Thread: Spread checkpoint sync
Final patch in this series for today spreads out the individual checkpoint fsync calls over time, and was written by myself and Simon Riggs. Patch is based against a system that's already had the two patches I sent over earlier today applied, rather than HEAD, as both are useful for measuring how well this one works. You can grab a tree with all three from my Github repo, via the "checkpoint" branch: https://github.com/greg2ndQuadrant/postgres/tree/checkpoint This is a work in progress. While I've seen this reduce checkpoint spike latency significantly on a large system, I don't have any referencable performance numbers I can share yet. There are also a couple of problems I know about, and I'm sure others I haven't thought of yet The first known issues is that it delays manual or other "forced" checkpoints, which is not necessarily wrong if you really are serious about spreading syncs out, but it is certainly surprising when you run into it. I notice this most when running createdb on a busy system. No real reason for this to happen, the code passes that it's a forced checkpoint down but just doesn't act on it yet. The second issue is that the delay between sync calls is currently hard-coded, at 3 seconds. I believe the right path here is to consider the current checkpoint_completion_target to still be valid, then work back from there. That raises the question of what percentage of the time writes should now be compressed into relative to that, to leave some time to spread the sync calls. If we're willing to say "writes finish in first 1/2 of target, syncs execute in second 1/2", that I could implement that here. Maybe that ratio needs to be another tunable. Still thinking about that part, and it's certainly open to community debate. The thing to realize that complicates the design is that the actual sync execution may take a considerable period of time. It's much more likely for that to happen than in the case of an individual write, as the current spread checkpoint does, because those are usually cached. In the spread sync case, it's easy for one slow sync to make the rest turn into ones that fire in quick succession, to make up for lost time. There's some history behind this design that impacts review. Circa 8.3 development in 2007, I had experimented with putting some delay between each of the fsync calls that the background writer executes during a checkpoint. It didn't help smooth things out at all at the time. It turns out that's mainly because all my tests were on Linux using ext3. On that filesystem, fsync is not very granular. It's quite likely it will push out data you haven't asked to sync yet, which means one giant sync is almost impossible to avoid no matter how you space the fsync calls. If you try and review this on ext3, I expect you'll find a big spike early in each checkpoint (where it flushes just about everything out) and then quick response for the later files involved. The system this patch originated to help fix was running XFS. There, I've confirmed that problem doesn't exist, that individual syncs only seem to push out the data related to one file. The same should be true on ext4, but I haven't tested that myself. Not sure how granular the fsync calls are on Solaris, FreeBSD, Darwin, etc. yet. Note that it's still possible to get hung on one sync call for a while, even on XFS. The worst case seems to be if you've created a new 1GB database table chunk and fully populated it since the last checkpoint, on a system that's just cached the whole thing so far. One change that turned out be necessary rather than optional--to get good performance from the system under tuning--was to make regular background writer activity, including fsync absorb checks, happen during these sync pauses. The existing code ran the checkpoint sync work in a pretty tight loop, which as I alluded to in an earlier patch today can lead to the backends competing with the background writer to get their sync calls executed. This squashes that problem if the background writer is setup properly. What does properly mean? Well, it can't do that cleanup if the background writer is sleeping. This whole area was refactored. The current sync absorb code uses the constant WRITES_PER_ABSORB to make decisions. This new version replaces that hard-coded value with something that scales to the system size. It now ignores doing work until the number of pending absorb requests has reached 10% of the number possible to store (BgWriterShmem->max_requests, which is set to the size of shared_buffers in 8K pages, AKA NBuffers). This may actually postpone this work for too long on systems with large shared_buffers settings; that's one area I'm still investigating. As far as concerns about this 10% setting not doing enough work, which is something I do see, you can always increase how often absorbing happens by decreasing bgwriter_delay now--giving other benefits too. For example, if you run the fsync-stress-v2.sh script I included with the last patch I sent, you'll discover the spread sync version of the server leaves just as many unabsorbed writes behind as the old code did. Those are happening because of periods the background writer is sleeping. They drop as you decrease the delay; here's a table showing some values I tested here, with all three patches installed: bgwriter_delay buffers_backend_sync 200 ms 90 50 ms 28 25 ms 3 There's a bunch of performance related review work that needs to be done here, in addition to the usual code review for the patch. My hope is that I can get enough of that done to validate this does what it's supposed to on public hardware that a later version of this patch is considered for the next CommitFest. It's a little more raw than I'd like still, but the idea has been tested enough here that I believe it's fundamentally sound and valuable. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 43a149e..0ce8e2b 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -143,8 +143,8 @@ typedef struct static BgWriterShmemStruct *BgWriterShmem; -/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */ -#define WRITES_PER_ABSORB 1000 +/* Fraction of fsync absorb queue that needs to be filled before acting */ +#define ABSORB_ACTION_DIVISOR 10 /* * GUC parameters @@ -382,7 +382,7 @@ BackgroundWriterMain(void) /* * Process any requests or signals received recently. */ - AbsorbFsyncRequests(); + AbsorbFsyncRequests(false); if (got_SIGHUP) { @@ -636,7 +636,7 @@ BgWriterNap(void) (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested)) break; pg_usleep(1000000L); - AbsorbFsyncRequests(); + AbsorbFsyncRequests(true); udelay -= 1000000L; } @@ -684,8 +684,6 @@ ImmediateCheckpointRequested(void) void CheckpointWriteDelay(int flags, double progress) { - static int absorb_counter = WRITES_PER_ABSORB; - /* Do nothing if checkpoint is being executed by non-bgwriter process */ if (!am_bg_writer) return; @@ -705,22 +703,65 @@ CheckpointWriteDelay(int flags, double progress) ProcessConfigFile(PGC_SIGHUP); } - AbsorbFsyncRequests(); - absorb_counter = WRITES_PER_ABSORB; + AbsorbFsyncRequests(false); BgBufferSync(); CheckArchiveTimeout(); BgWriterNap(); } - else if (--absorb_counter <= 0) + else { /* - * Absorb pending fsync requests after each WRITES_PER_ABSORB write - * operations even when we don't sleep, to prevent overflow of the - * fsync request queue. + * Check for overflow of the fsync request queue. */ - AbsorbFsyncRequests(); - absorb_counter = WRITES_PER_ABSORB; + AbsorbFsyncRequests(false); + } +} + +/* + * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint + * + * This function is called after each file sync performed by mdsync(). + * It is responsible for keeping the bgwriter's normal activities in + * progress during a long checkpoint. + */ +void +CheckpointSyncDelay(void) +{ + pg_time_t now; + pg_time_t sync_start_time; + int sync_delay_secs; + + /* + * Delay after each sync, in seconds. This could be a parameter. But + * since ideally this will be auto-tuning in the near future, not + * assigning it a GUC setting yet. + */ +#define EXTRA_SYNC_DELAY 3 + + /* Do nothing if checkpoint is being executed by non-bgwriter process */ + if (!am_bg_writer) + return; + + sync_start_time = (pg_time_t) time(NULL); + + /* + * Perform the usual bgwriter duties. + */ + for (;;) + { + AbsorbFsyncRequests(false); + BgBufferSync(); + CheckArchiveTimeout(); + BgWriterNap(); + + /* + * Are we there yet? + */ + now = (pg_time_t) time(NULL); + sync_delay_secs = now - sync_start_time; + if (sync_delay_secs >= EXTRA_SYNC_DELAY) + break; } } @@ -1116,16 +1157,41 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum, * non-bgwriter processes, do nothing if not bgwriter. */ void -AbsorbFsyncRequests(void) +AbsorbFsyncRequests(bool force) { BgWriterRequest *requests = NULL; BgWriterRequest *request; int n; + /* + * Divide the size of the request queue by this to determine when + * absorption action needs to be taken. Default here aims to empty the + * queue whenever 1 / 10 = 10% of it is full. If this isn't good enough, + * you probably need to lower bgwriter_delay, rather than presume + * this needs to be a tunable you can decrease. + */ + int absorb_action_divisor = 10; + if (!am_bg_writer) return; /* + * If the queue isn't very large, don't worry about absorbing yet. + * Access integer counter without lock, to avoid queuing. + */ + if (!force && BgWriterShmem->num_requests < + (BgWriterShmem->max_requests / ABSORB_ACTION_DIVISOR)) + { + if (BgWriterShmem->num_requests > 0) + elog(DEBUG1,"Absorb queue: %d fsync requests, not processing", + BgWriterShmem->num_requests); + return; + } + + elog(DEBUG1,"Absorb queue: %d fsync requests, processing", + BgWriterShmem->num_requests); + + /* * We have to PANIC if we fail to absorb all the pending requests (eg, * because our hashtable runs out of memory). This is because the system * cannot run safely if we are unable to fsync what we have been told to @@ -1167,4 +1233,9 @@ AbsorbFsyncRequests(void) pfree(requests); END_CRIT_SECTION(); + + /* + * Send off activity statistics to the stats collector + */ + pgstat_send_bgwriter(); } diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 7140b94..57066c4 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -36,9 +36,6 @@ */ #define DEBUG_FSYNC 1 -/* interval for calling AbsorbFsyncRequests in mdsync */ -#define FSYNCS_PER_ABSORB 10 - /* special values for the segno arg to RememberFsyncRequest */ #define FORGET_RELATION_FSYNC (InvalidBlockNumber) #define FORGET_DATABASE_FSYNC (InvalidBlockNumber-1) @@ -931,7 +928,6 @@ mdsync(void) HASH_SEQ_STATUS hstat; PendingOperationEntry *entry; - int absorb_counter; #ifdef DEBUG_FSYNC /* Statistics on sync times */ @@ -958,7 +954,7 @@ mdsync(void) * queued an fsync request before clearing the buffer's dirtybit, so we * are safe as long as we do an Absorb after completing BufferSync(). */ - AbsorbFsyncRequests(); + AbsorbFsyncRequests(true); /* * To avoid excess fsync'ing (in the worst case, maybe a never-terminating @@ -1001,7 +997,6 @@ mdsync(void) mdsync_in_progress = true; /* Now scan the hashtable for fsync requests to process */ - absorb_counter = FSYNCS_PER_ABSORB; hash_seq_init(&hstat, pendingOpsTable); while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL) { @@ -1026,17 +1021,9 @@ mdsync(void) int failures; /* - * If in bgwriter, we want to absorb pending requests every so - * often to prevent overflow of the fsync request queue. It is - * unspecified whether newly-added entries will be visited by - * hash_seq_search, but we don't care since we don't need to - * process them anyway. + * If in bgwriter, perform normal duties. */ - if (--absorb_counter <= 0) - { - AbsorbFsyncRequests(); - absorb_counter = FSYNCS_PER_ABSORB; - } + CheckpointSyncDelay(); /* * The fsync table could contain requests to fsync segments that @@ -1131,10 +1118,9 @@ mdsync(void) pfree(path); /* - * Absorb incoming requests and check to see if canceled. + * If in bgwriter, perform normal duties. */ - AbsorbFsyncRequests(); - absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */ + CheckpointSyncDelay(); if (entry->canceled) break; diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h index e251da6..4939604 100644 --- a/src/include/postmaster/bgwriter.h +++ b/src/include/postmaster/bgwriter.h @@ -26,10 +26,11 @@ extern void BackgroundWriterMain(void); extern void RequestCheckpoint(int flags); extern void CheckpointWriteDelay(int flags, double progress); +extern void CheckpointSyncDelay(void); extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum, BlockNumber segno); -extern void AbsorbFsyncRequests(void); +extern void AbsorbFsyncRequests(bool force); extern Size BgWriterShmemSize(void); extern void BgWriterShmemInit(void);
On Sun, Nov 14, 2010 at 6:48 PM, Greg Smith <greg@2ndquadrant.com> wrote: > The second issue is that the delay between sync calls is currently > hard-coded, at 3 seconds. I believe the right path here is to consider the > current checkpoint_completion_target to still be valid, then work back from > there. That raises the question of what percentage of the time writes > should now be compressed into relative to that, to leave some time to spread > the sync calls. If we're willing to say "writes finish in first 1/2 of > target, syncs execute in second 1/2", that I could implement that here. > Maybe that ratio needs to be another tunable. Still thinking about that > part, and it's certainly open to community debate. The thing to realize > that complicates the design is that the actual sync execution may take a > considerable period of time. It's much more likely for that to happen than > in the case of an individual write, as the current spread checkpoint does, > because those are usually cached. In the spread sync case, it's easy for > one slow sync to make the rest turn into ones that fire in quick succession, > to make up for lost time. I think the behavior of file systems and operating systems is highly relevant here. We seem to have a theory that allowing a delay between the write and the fsync should give the OS a chance to start writing the data out, but do we have any evidence indicating whether and under what circumstances that actually occurs? For example, if we knew that it's important to wait at least 30 s but waiting 60 s is no better, that would be useful information. Another question I have is about how we're actually going to know when any given fsync can be performed. For any given segment, there are a certain number of pages A that are already dirty at the start of the checkpoint. Then there are a certain number of additional pages B that are going to be written out during the checkpoint. If it so happens that B = 0, we can call fsync() at the beginning of the checkpoint without losing anything (in fact, we gain something: any pages dirtied by cleaning scans or backend writes during the checkpoint won't need to hit the disk; and if the filesystem dumps more of its cache than necessary on fsync, we may as well take that hit before dirtying a bunch more stuff). But if B > 0, then we should attempt the fsync() until we've written them all; otherwise we'll end up having to fsync() that segment twice. Doing all the writes and then all the fsyncs meets this requirement trivially, but I'm not so sure that's a good idea. For example, given files F1 ... Fn with dirty pages needing checkpoint writes, we could do the following: first, do any pending fsyncs for files not among F1 .. Fn; then, write all pages for F1 and fsync, write all pages for F2 and fsync, write all pages for F3 and fsync, etc. This might seem dumb because we're not really giving the OS a chance to write anything out before we fsync, but think about the ext3 case where the whole filesystem cache gets flushed anyway. It's much better to dump the cache at the beginning of the checkpoint and then again after every file than it is to spew many GB of dirty stuff into the cache and then drop the hammer. I'm just brainstorming here; feel free to tell me I'm all wet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Nov 15, 2010 at 6:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Nov 14, 2010 at 6:48 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> The second issue is that the delay between sync calls is currently >> hard-coded, at 3 seconds. I believe the right path here is to consider the >> current checkpoint_completion_target to still be valid, then work back from >> there. That raises the question of what percentage of the time writes >> should now be compressed into relative to that, to leave some time to spread >> the sync calls. If we're willing to say "writes finish in first 1/2 of >> target, syncs execute in second 1/2", that I could implement that here. >> Maybe that ratio needs to be another tunable. Still thinking about that >> part, and it's certainly open to community debate. I would speculate that the answer is likely to be nearly binary. The best option would either be to do the writes as fast as possible and spread out the fsyncs, or spread out the writes and do the fsyncs as fast as possible. Depending on the system set up. >> The thing to realize >> that complicates the design is that the actual sync execution may take a >> considerable period of time. It's much more likely for that to happen than >> in the case of an individual write, as the current spread checkpoint does, >> because those are usually cached. In the spread sync case, it's easy for >> one slow sync to make the rest turn into ones that fire in quick succession, >> to make up for lost time. > > I think the behavior of file systems and operating systems is highly > relevant here. We seem to have a theory that allowing a delay between > the write and the fsync should give the OS a chance to start writing > the data out, I thought that the theory was that doing too many fsync in short order can lead to some kind of starvation of other IO. If the theory is that we want to wait between writes and fsyncs, then the current behavior is probably the best, Spreading out the writes and then doing all the syncs at the end gives the best delay time between an average write and the sync of that written to file. Or, spread the writes out over 150 seconds, sleep for 140 seconds, then do the fsyncs. But I don't think that that is the theory. > but do we have any evidence indicating whether and under > what circumstances that actually occurs? For example, if we knew that > it's important to wait at least 30 s but waiting 60 s is no better, > that would be useful information. > > Another question I have is about how we're actually going to know when > any given fsync can be performed. For any given segment, there are a > certain number of pages A that are already dirty at the start of the > checkpoint. Dirty in the shared pool, or dirty in the OS cache? > Then there are a certain number of additional pages B > that are going to be written out during the checkpoint. If it so > happens that B = 0, we can call fsync() at the beginning of the > checkpoint without losing anything (in fact, we gain something: any > pages dirtied by cleaning scans or backend writes during the > checkpoint won't need to hit the disk; Aren't those pages written out by cleaning scans and backend writes while the checkpoint is occurring exactly what you defined to be page set B, and then to be zero? > and if the filesystem dumps > more of its cache than necessary on fsync, we may as well take that > hit before dirtying a bunch more stuff). But if B > 0, then we should > attempt the fsync() until we've written them all; otherwise we'll end > up having to fsync() that segment twice. > > Doing all the writes and then all the fsyncs meets this requirement > trivially, but I'm not so sure that's a good idea. For example, given > files F1 ... Fn with dirty pages needing checkpoint writes, we could > do the following: first, do any pending fsyncs for files not among F1 > .. Fn; then, write all pages for F1 and fsync, write all pages for F2 > and fsync, write all pages for F3 and fsync, etc. This might seem > dumb because we're not really giving the OS a chance to write anything > out before we fsync, but think about the ext3 case where the whole > filesystem cache gets flushed anyway. It's much better to dump the > cache at the beginning of the checkpoint and then again after every > file than it is to spew many GB of dirty stuff into the cache and then > drop the hammer. But the kernel has knobs to prevent that from happening. dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer kernels), dirty_expire_centisecs. Don't these knobs work? Also, ext3 is supposed to do a journal commit every 5 seconds under default mount conditions. Cheers, Jeff
On Sat, Nov 20, 2010 at 6:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> The thing to realize >>> that complicates the design is that the actual sync execution may take a >>> considerable period of time. It's much more likely for that to happen than >>> in the case of an individual write, as the current spread checkpoint does, >>> because those are usually cached. In the spread sync case, it's easy for >>> one slow sync to make the rest turn into ones that fire in quick succession, >>> to make up for lost time. >> >> I think the behavior of file systems and operating systems is highly >> relevant here. We seem to have a theory that allowing a delay between >> the write and the fsync should give the OS a chance to start writing >> the data out, > > I thought that the theory was that doing too many fsync in short order > can lead to some kind of starvation of other IO. > > If the theory is that we want to wait between writes and fsyncs, then > the current behavior is probably the best, Spreading out the writes > and then doing all the syncs at the end gives the best delay time > between an average write and the sync of that written to file. Or, > spread the writes out over 150 seconds, sleep for 140 seconds, then do > the fsyncs. But I don't think that that is the theory. Well, I've heard Bruce and, I think, possibly also Greg talk about wanting to wait after doing the writes in the hopes that the kernel will start to flush the dirty pages, but I'm wondering whether it wouldn't be better to just give up on that and do: small batch of writes - fsync those writes - another small batch of writes - fsync that batch - etc. >> but do we have any evidence indicating whether and under >> what circumstances that actually occurs? For example, if we knew that >> it's important to wait at least 30 s but waiting 60 s is no better, >> that would be useful information. >> >> Another question I have is about how we're actually going to know when >> any given fsync can be performed. For any given segment, there are a >> certain number of pages A that are already dirty at the start of the >> checkpoint. > > Dirty in the shared pool, or dirty in the OS cache? OS cache, sorry. >> Then there are a certain number of additional pages B >> that are going to be written out during the checkpoint. If it so >> happens that B = 0, we can call fsync() at the beginning of the >> checkpoint without losing anything (in fact, we gain something: any >> pages dirtied by cleaning scans or backend writes during the >> checkpoint won't need to hit the disk; > > Aren't those pages written out by cleaning scans and backend writes > while the checkpoint is occurring exactly what you defined to be page > set B, and then to be zero? No, sorry, I'm referring to cases where all the dirty pages in a segment have been written out the OS but we have not yet issued the necessary fsync. >> and if the filesystem dumps >> more of its cache than necessary on fsync, we may as well take that >> hit before dirtying a bunch more stuff). But if B > 0, then we should >> attempt the fsync() until we've written them all; otherwise we'll end >> up having to fsync() that segment twice. >> >> Doing all the writes and then all the fsyncs meets this requirement >> trivially, but I'm not so sure that's a good idea. For example, given >> files F1 ... Fn with dirty pages needing checkpoint writes, we could >> do the following: first, do any pending fsyncs for files not among F1 >> .. Fn; then, write all pages for F1 and fsync, write all pages for F2 >> and fsync, write all pages for F3 and fsync, etc. This might seem >> dumb because we're not really giving the OS a chance to write anything >> out before we fsync, but think about the ext3 case where the whole >> filesystem cache gets flushed anyway. It's much better to dump the >> cache at the beginning of the checkpoint and then again after every >> file than it is to spew many GB of dirty stuff into the cache and then >> drop the hammer. > > But the kernel has knobs to prevent that from happening. > dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer > kernels), dirty_expire_centisecs. Don't these knobs work? Also, ext3 > is supposed to do a journal commit every 5 seconds under default mount > conditions. I don't know in detail. dirty_expire_centisecs sounds useful; I think the problem with dirty_background_ratio and dirty_ratio is that the default ratios are large enough that on systems with a huge pile of memory, they allow more dirty data to accumulate than can be flushed without causing an I/O storm. I believe Greg Smith made a comment along the lines of - memory sizes are grow faster than I/O speeds; therefore a ratio that is OK for a low-end system with a modest amount of memory causes problems on a high-end system that has faster I/O but MUCH more memory. As a kernel developer, I suspect the tendency is to try to set the ratio so that you keep enough free memory around to service future allocation requests. Optimizing for possible future fsyncs is probably not the top priority... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Nov 20, 2010 at 5:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Nov 20, 2010 at 6:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> Doing all the writes and then all the fsyncs meets this requirement >>> trivially, but I'm not so sure that's a good idea. For example, given >>> files F1 ... Fn with dirty pages needing checkpoint writes, we could >>> do the following: first, do any pending fsyncs for files not among F1 >>> .. Fn; then, write all pages for F1 and fsync, write all pages for F2 >>> and fsync, write all pages for F3 and fsync, etc. This might seem >>> dumb because we're not really giving the OS a chance to write anything >>> out before we fsync, but think about the ext3 case where the whole >>> filesystem cache gets flushed anyway. It's much better to dump the >>> cache at the beginning of the checkpoint and then again after every >>> file than it is to spew many GB of dirty stuff into the cache and then >>> drop the hammer. >> >> But the kernel has knobs to prevent that from happening. >> dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer >> kernels), dirty_expire_centisecs. Don't these knobs work? Also, ext3 >> is supposed to do a journal commit every 5 seconds under default mount >> conditions. > > I don't know in detail. dirty_expire_centisecs sounds useful; I think > the problem with dirty_background_ratio and dirty_ratio is that the > default ratios are large enough that on systems with a huge pile of > memory, they allow more dirty data to accumulate than can be flushed > without causing an I/O storm. True, but I think that changing these from their defaults is not considered to be a dark art reserved for kernel hackers, i.e they are something that sysadmins are expected to tweak to suite their work load, just like the shmmax and such. And for very large memory systems, even 1% may be too much to cache (dirty*_ratio can only be set in integer percent points), so recent kernels introduced dirty*_bytes parameters. I like these better because they do what they say. With the dirty*_ratio, I could never figure out what it was a ratio of, and the results were unpredictable without extensive experimentation. > I believe Greg Smith made a comment > along the lines of - memory sizes are grow faster than I/O speeds; > therefore a ratio that is OK for a low-end system with a modest amount > of memory causes problems on a high-end system that has faster I/O but > MUCH more memory. Yes, but how much work do we want to put into redoing the checkpoint logic so that the sysadmin on a particular OS and configuration and FS can avoid having to change the kernel parameters away from their defaults? (Assuming of course I am correctly understanding the problem, always a dangerous assumption.) Some experiments I have just done show that dirty_expire_centisecs does not seem reliable on ext3, but the dirty*_ratio and dirty*_bytes seem reliable on ext2, ext3, and ext4. But that may not apply to RAID, I don't have one I can test. Cheers, Jeff
Jeff Janes wrote: > And for very large memory > systems, even 1% may be too much to cache (dirty*_ratio can only be > set in integer percent points), so recent kernels introduced > dirty*_bytes parameters. I like these better because they do what > they say. With the dirty*_ratio, I could never figure out what it was > a ratio of, and the results were unpredictable without extensive > experimentation. > Right, you can't set dirty_background_ratio low enough to make this problem go away. Even attempts to set it to 1%, back when that that was the right size for it, seem to be defeated by other mechanisms within the kernel. Last time I looked at the related source code, it seemed the "congestion control" logic that kicks in to throttle writes was a likely suspect. This is why I'm not real optimistic about newer mechanism like the dirty_background_bytes added 2.6.29 to help here, as that just gives a mapping to setting lower values; the same basic logic is under the hood. Like Jeff, I've never seen dirty_expire_centisecs help at all, possibly due to the same congestion mechanism. > Yes, but how much work do we want to put into redoing the checkpoint > logic so that the sysadmin on a particular OS and configuration and FS > can avoid having to change the kernel parameters away from their > defaults? (Assuming of course I am correctly understanding the > problem, always a dangerous assumption.) > I've been trying to make this problem go away using just the kernel tunables available since 2006. I adjusted them carefully on the server that ran into this problem so badly that it motivated the submitted patch, months before this issue got bad. It didn't help. Maybe if they were running a later kernel that supported dirty_background_bytes that would have worked better. During the last few years, the only thing that has consistently helped in every case is the checkpoint spreading logic that went into 8.3. I no longer expect that the kernel developers will ever make this problem go away the way checkpoints are written out right now, whereas the last good PostgreSQL work in this area definitely helped. The basic premise of the current checkpoint code is that if you write all of the buffers out early enough, by the time syncs execute enough of the data should have gone out that those don't take very long to process. That was usually true for the last few years, on systems with a battery-backed cache; the amount of memory cached by the OS was relatively small relative to the RAID cache size. That's not the case anymore, and that divergence is growing bigger. The idea that the checkpoint sync code can run in a relatively tight loop, without stopping to do the normal background writer cleanup work, is also busted by that observation. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Robert Haas wrote: > Doing all the writes and then all the fsyncs meets this requirement > trivially, but I'm not so sure that's a good idea. For example, given > files F1 ... Fn with dirty pages needing checkpoint writes, we could > do the following: first, do any pending fsyncs for files not among F1 > .. Fn; then, write all pages for F1 and fsync, write all pages for F2 > and fsync, write all pages for F3 and fsync, etc. This might seem > dumb because we're not really giving the OS a chance to write anything > out before we fsync, but think about the ext3 case where the whole > filesystem cache gets flushed anyway. I'm not horribly interested in optimizing for the ext3 case per se, as I consider that filesystem fundamentally broken from the perspective of its ability to deliver low-latency here. I wouldn't want a patch that improved behavior on filesystem with granular fsync to make the ext3 situation worst. That's as much as I'd want design to lean toward considering its quirks. Jeff Janes made a case downthread for "why not make it the admin/OS's job to worry about this?" In cases where there is a reasonable solution available, in the form of "switch to XFS or ext4", I'm happy to take that approach. Let me throw some numbers out to give a better idea of the shape and magnitude of the problem case I've been working on here. In the situation that leads that the near hour-long sync phase I've seen, checkpoints will start with about a 3GB backlog of data in the kernel write cache to deal with. That's about 4% of RAM, just under the 5% threshold set by dirty_background_ratio. Whether or not the 256MB write cache on the controller is also filled is a relatively minor detail I can't monitor easily. The checkpoint itself? <250MB each time. This proportion is why I didn't think to follow the alternate path of worrying about spacing the write and fsync calls out differently. I shrunk shared_buffers down to make the actual checkpoints smaller, which helped to some degree; that's what got them down to smaller than the RAID cache size. But the amount of data cached by the operating system is the real driver of total sync time here. Whether or not you include all of the writes from the checkpoint itself before you start calling fsync didn't actually matter very much; in the case I've been chasing, those are getting cached anyway. The write storm from the fsync calls themselves forcing things out seems to be the driver on I/O spikes, which is why I started with spacing those out. Writes go out at a rate of around 5MB/s, so clearing the 3GB backlog takes a minimum of 10 minutes of real time. There are about 300 1GB relation files involved in the case I've been chasing. This is where the 3 second delay number came from; 300 files, 3 seconds each, 900 seconds = 15 minutes of sync spread. You can turn that math around to figure out how much delay per relation you can afford while still keeping checkpoints to a planned end time, which isn't done in the patch I submitted yet. Ultimately what I want to do here is some sort of smarter write-behind sync operation, perhaps with a LRU on relations with pending fsync requests. The idea would be to sync relations that haven't been touched in a while in advance of the checkpoint even. I think that's similar to the general idea Robert is suggesting here, to get some sync calls flowing before all of the checkpoint writes have happened. I think that the final sync calls will need to get spread out regardless, and since doing that requires a fairly small amount of code too that's why we started with that. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Sun, Nov 21, 2010 at 04:54:00PM -0500, Greg Smith wrote: > Ultimately what I want to do here is some sort of smarter write-behind > sync operation, perhaps with a LRU on relations with pending fsync > requests. The idea would be to sync relations that haven't been touched > in a while in advance of the checkpoint even. I think that's similar to > the general idea Robert is suggesting here, to get some sync calls > flowing before all of the checkpoint writes have happened. I think that > the final sync calls will need to get spread out regardless, and since > doing that requires a fairly small amount of code too that's why we > started with that. For a similar problem we had (kernel buffering too much) we had success using the fadvise and madvise WONTNEED syscalls to force the data to exit the cache much sooner than it would otherwise. This was on Linux and it had the side-effect that the data was deleted from the kernel cache, which we wanted, but probably isn't appropriate here. There is also sync_file_range, but that's linux specific, although close to what you want I think. It would allow you to work with blocks smaller than 1GB. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
On Sunday 21 November 2010 23:19:30 Martijn van Oosterhout wrote: > For a similar problem we had (kernel buffering too much) we had success > using the fadvise and madvise WONTNEED syscalls to force the data to > exit the cache much sooner than it would otherwise. This was on Linux > and it had the side-effect that the data was deleted from the kernel > cache, which we wanted, but probably isn't appropriate here. Yep, works fine. Although it has the issue that the data will get read again if archiving/SR is enabled. > There is also sync_file_range, but that's linux specific, although > close to what you want I think. It would allow you to work with blocks > smaller than 1GB. Unfortunately that puts the data under quite high write-out pressure inside the kernel - which is not what you actually want because it limits reordering and such significantly. It would be nicer if you could get a mix of both semantics (looking at it, depending on the approach that seems to be about a 10 line patch to the kernel). I.e. indicate that you want to write the pages soonish, but don't put it on the head of the writeout queue. Andres
On 11/20/10 6:11 PM, Jeff Janes wrote: > True, but I think that changing these from their defaults is not > considered to be a dark art reserved for kernel hackers, i.e they are > something that sysadmins are expected to tweak to suite their work > load, just like the shmmax and such. I disagree. Linux kernel hackers know about these kinds of parameters, and I suppose that Linux performance experts do. But very few sysadmins, in my experience, have any idea. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Sun, Nov 21, 2010 at 4:54 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Let me throw some numbers out [...] Interesting. > Ultimately what I want to do here is some sort of smarter write-behind sync > operation, perhaps with a LRU on relations with pending fsync requests. The > idea would be to sync relations that haven't been touched in a while in > advance of the checkpoint even. I think that's similar to the general idea > Robert is suggesting here, to get some sync calls flowing before all of the > checkpoint writes have happened. I think that the final sync calls will > need to get spread out regardless, and since doing that requires a fairly > small amount of code too that's why we started with that. Doing some kind of background fsyinc-ing might indeed be sensible, but I agree that's secondary to trying to spread out the fsyncs during the checkpoint itself. I guess the question is what we can do there sensibly without an unreasonable amount of new code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2010/11/21 Andres Freund <andres@anarazel.de>: > On Sunday 21 November 2010 23:19:30 Martijn van Oosterhout wrote: >> For a similar problem we had (kernel buffering too much) we had success >> using the fadvise and madvise WONTNEED syscalls to force the data to >> exit the cache much sooner than it would otherwise. This was on Linux >> and it had the side-effect that the data was deleted from the kernel >> cache, which we wanted, but probably isn't appropriate here. > Yep, works fine. Although it has the issue that the data will get read again if > archiving/SR is enabled. mmhh . the current code does call DONTNEED or WILLNEED for WAL depending of the archiving off or on. This matters *only* once the data is writen (fsync, fdatasync), before that it should not have an effect. > >> There is also sync_file_range, but that's linux specific, although >> close to what you want I think. It would allow you to work with blocks >> smaller than 1GB. > Unfortunately that puts the data under quite high write-out pressure inside > the kernel - which is not what you actually want because it limits reordering > and such significantly. > > It would be nicer if you could get a mix of both semantics (looking at it, > depending on the approach that seems to be about a 10 line patch to the > kernel). I.e. indicate that you want to write the pages soonish, but don't put > it on the head of the writeout queue. > > Andres > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
Josh Berkus wrote: > On 11/20/10 6:11 PM, Jeff Janes wrote: >> True, but I think that changing these from their defaults is not >> considered to be a dark art reserved for kernel hackers, i.e they are >> something that sysadmins are expected to tweak to suite their work >> load, just like the shmmax and such. > > I disagree. Linux kernel hackers know about these kinds of parameters, > and I suppose that Linux performance experts do. But very few > sysadmins, in my experience, have any idea. To me, a lot of this conversation feels parallel to the arguments the occasionally come up debating writing directly to raw disks bypassing the filesystems altogether. Might smoother checkpoints be better solved by talking to the OS vendors & virtual-memory-tunning-knob-authors to work with them on exposing the ideal knobs; rather than saying that our only tool is a hammer(fsync) so the problem must be handled as a nail. Hypothetically - what would the ideal knobs be? Something like madvise WONTNEED but that leaves pages in the OS's cache after writing them?
Ron Mayer wrote: > Might smoother checkpoints be better solved by talking > to the OS vendors & virtual-memory-tunning-knob-authors > to work with them on exposing the ideal knobs; rather than > saying that our only tool is a hammer(fsync) so the problem > must be handled as a nail. > Maybe, but it's hard to argue that the current implementation--just doing all of the sync calls as fast as possible, one after the other--is going to produce worst-case behavior in a lot of situations. Given that it's not a huge amount of code to do better, I'd rather do some work in that direction, instead of presuming the kernel authors will ever make this go away. Spreading the writes out as part of the checkpoint rework in 8.3 worked better than any kernel changes I've tested since then, and I'm not real optimisic about this getting resolved at the system level. So long as the database changes aren't antagonistic toward kernel improvements, I'd prefer to have some options here that become effective as soon as the database code is done. I've attached an updated version of the initial sync spreading patch here, one that applies cleanly on top of HEAD and over top of the sync instrumentation patch too. The conflict that made that hard before is gone now. Having the pg_stat_bgwriter.buffers_backend_fsync patch available all the time now has made me reconsider how important one potential bit of refactoring here would be. I managed to catch one of the situations where really popular relations were being heavily updated in a way that was competing with the checkpoint on my test system (which I can happily share the logs of), with the instrumentation patch applied but not the spread sync one: LOG: checkpoint starting: xlog DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 7747 of relation base/16424/16442 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 42688 of relation base/16424/16437 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 9723 of relation base/16424/16442 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 58117 of relation base/16424/16437 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 165128 of relation base/16424/16437 [330 of these total, all referring to the same two relations] DEBUG: checkpoint sync: number=1 file=base/16424/16448_fsm time=10132.830000 msec DEBUG: checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec DEBUG: checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec DEBUG: checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec DEBUG: checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec DEBUG: checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec DEBUG: checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec DEBUG: checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000 msec DEBUG: checkpoint sync: number=9 file=base/16424/16437_fsm time=0.001000 msec DEBUG: checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec DEBUG: checkpoint sync: number=11 file=base/16424/16425 time=0.000000 msec DEBUG: checkpoint sync: number=12 file=base/16424/16437_vm time=0.001000 msec DEBUG: checkpoint sync: number=13 file=base/16424/16425_vm time=0.001000 msec LOG: checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s, total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s Note here how the checkpoint was hung on trying to get 16448_fsm written out, but the backends were issuing constant competing fsync calls to these other relations. This is very similar to the production case this patch was written to address, which I hadn't been able to share a good example of yet. That's essentially what it looks like, except with the contention going on for minutes instead of seconds. One of the ideas Simon and I had been considering at one point was adding some better de-duplication logic to the fsync absorb code, which I'm reminded by the pattern here might be helpful independently of other improvements. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 620b197..501cab8 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -143,8 +143,8 @@ typedef struct static BgWriterShmemStruct *BgWriterShmem; -/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */ -#define WRITES_PER_ABSORB 1000 +/* Fraction of fsync absorb queue that needs to be filled before acting */ +#define ABSORB_ACTION_DIVISOR 10 /* * GUC parameters @@ -382,7 +382,7 @@ BackgroundWriterMain(void) /* * Process any requests or signals received recently. */ - AbsorbFsyncRequests(); + AbsorbFsyncRequests(false); if (got_SIGHUP) { @@ -636,7 +636,7 @@ BgWriterNap(void) (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested)) break; pg_usleep(1000000L); - AbsorbFsyncRequests(); + AbsorbFsyncRequests(true); udelay -= 1000000L; } @@ -684,8 +684,6 @@ ImmediateCheckpointRequested(void) void CheckpointWriteDelay(int flags, double progress) { - static int absorb_counter = WRITES_PER_ABSORB; - /* Do nothing if checkpoint is being executed by non-bgwriter process */ if (!am_bg_writer) return; @@ -705,22 +703,65 @@ CheckpointWriteDelay(int flags, double progress) ProcessConfigFile(PGC_SIGHUP); } - AbsorbFsyncRequests(); - absorb_counter = WRITES_PER_ABSORB; + AbsorbFsyncRequests(false); BgBufferSync(); CheckArchiveTimeout(); BgWriterNap(); } - else if (--absorb_counter <= 0) + else { /* - * Absorb pending fsync requests after each WRITES_PER_ABSORB write - * operations even when we don't sleep, to prevent overflow of the - * fsync request queue. + * Check for overflow of the fsync request queue. */ - AbsorbFsyncRequests(); - absorb_counter = WRITES_PER_ABSORB; + AbsorbFsyncRequests(false); + } +} + +/* + * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint + * + * This function is called after each file sync performed by mdsync(). + * It is responsible for keeping the bgwriter's normal activities in + * progress during a long checkpoint. + */ +void +CheckpointSyncDelay(void) +{ + pg_time_t now; + pg_time_t sync_start_time; + int sync_delay_secs; + + /* + * Delay after each sync, in seconds. This could be a parameter. But + * since ideally this will be auto-tuning in the near future, not + * assigning it a GUC setting yet. + */ +#define EXTRA_SYNC_DELAY 3 + + /* Do nothing if checkpoint is being executed by non-bgwriter process */ + if (!am_bg_writer) + return; + + sync_start_time = (pg_time_t) time(NULL); + + /* + * Perform the usual bgwriter duties. + */ + for (;;) + { + AbsorbFsyncRequests(false); + BgBufferSync(); + CheckArchiveTimeout(); + BgWriterNap(); + + /* + * Are we there yet? + */ + now = (pg_time_t) time(NULL); + sync_delay_secs = now - sync_start_time; + if (sync_delay_secs >= EXTRA_SYNC_DELAY) + break; } } @@ -1116,16 +1157,41 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum, * non-bgwriter processes, do nothing if not bgwriter. */ void -AbsorbFsyncRequests(void) +AbsorbFsyncRequests(bool force) { BgWriterRequest *requests = NULL; BgWriterRequest *request; int n; + /* + * Divide the size of the request queue by this to determine when + * absorption action needs to be taken. Default here aims to empty the + * queue whenever 1 / 10 = 10% of it is full. If this isn't good enough, + * you probably need to lower bgwriter_delay, rather than presume + * this needs to be a tunable you can decrease. + */ + int absorb_action_divisor = 10; + if (!am_bg_writer) return; /* + * If the queue isn't very large, don't worry about absorbing yet. + * Access integer counter without lock, to avoid queuing. + */ + if (!force && BgWriterShmem->num_requests < + (BgWriterShmem->max_requests / ABSORB_ACTION_DIVISOR)) + { + if (BgWriterShmem->num_requests > 0) + elog(DEBUG1,"Absorb queue: %d fsync requests, not processing", + BgWriterShmem->num_requests); + return; + } + + elog(DEBUG1,"Absorb queue: %d fsync requests, processing", + BgWriterShmem->num_requests); + + /* * We have to PANIC if we fail to absorb all the pending requests (eg, * because our hashtable runs out of memory). This is because the system * cannot run safely if we are unable to fsync what we have been told to @@ -1164,4 +1230,9 @@ AbsorbFsyncRequests(void) pfree(requests); END_CRIT_SECTION(); + + /* + * Send off activity statistics to the stats collector + */ + pgstat_send_bgwriter(); } diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index cadd938..c89486e 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -31,9 +31,6 @@ #include "pg_trace.h" -/* interval for calling AbsorbFsyncRequests in mdsync */ -#define FSYNCS_PER_ABSORB 10 - /* special values for the segno arg to RememberFsyncRequest */ #define FORGET_RELATION_FSYNC (InvalidBlockNumber) #define FORGET_DATABASE_FSYNC (InvalidBlockNumber-1) @@ -926,7 +923,6 @@ mdsync(void) HASH_SEQ_STATUS hstat; PendingOperationEntry *entry; - int absorb_counter; /* Statistics on sync times */ int processed = 0; @@ -951,7 +947,7 @@ mdsync(void) * queued an fsync request before clearing the buffer's dirtybit, so we * are safe as long as we do an Absorb after completing BufferSync(). */ - AbsorbFsyncRequests(); + AbsorbFsyncRequests(true); /* * To avoid excess fsync'ing (in the worst case, maybe a never-terminating @@ -994,7 +990,6 @@ mdsync(void) mdsync_in_progress = true; /* Now scan the hashtable for fsync requests to process */ - absorb_counter = FSYNCS_PER_ABSORB; hash_seq_init(&hstat, pendingOpsTable); while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL) { @@ -1019,17 +1014,9 @@ mdsync(void) int failures; /* - * If in bgwriter, we want to absorb pending requests every so - * often to prevent overflow of the fsync request queue. It is - * unspecified whether newly-added entries will be visited by - * hash_seq_search, but we don't care since we don't need to - * process them anyway. + * If in bgwriter, perform normal duties. */ - if (--absorb_counter <= 0) - { - AbsorbFsyncRequests(); - absorb_counter = FSYNCS_PER_ABSORB; - } + CheckpointSyncDelay(); /* * The fsync table could contain requests to fsync segments that @@ -1121,10 +1108,9 @@ mdsync(void) pfree(path); /* - * Absorb incoming requests and check to see if canceled. + * If in bgwriter, perform normal duties. */ - AbsorbFsyncRequests(); - absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */ + CheckpointSyncDelay(); if (entry->canceled) break; diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h index e251da6..4939604 100644 --- a/src/include/postmaster/bgwriter.h +++ b/src/include/postmaster/bgwriter.h @@ -26,10 +26,11 @@ extern void BackgroundWriterMain(void); extern void RequestCheckpoint(int flags); extern void CheckpointWriteDelay(int flags, double progress); +extern void CheckpointSyncDelay(void); extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum, BlockNumber segno); -extern void AbsorbFsyncRequests(void); +extern void AbsorbFsyncRequests(bool force); extern Size BgWriterShmemSize(void); extern void BgWriterShmemInit(void);
> Maybe, but it's hard to argue that the current implementation--just > doing all of the sync calls as fast as possible, one after the other--is > going to produce worst-case behavior in a lot of situations. Given that > it's not a huge amount of code to do better, I'd rather do some work in > that direction, instead of presuming the kernel authors will ever make > this go away. Spreading the writes out as part of the checkpoint rework > in 8.3 worked better than any kernel changes I've tested since then, and > I'm not real optimisic about this getting resolved at the system level. > So long as the database changes aren't antagonistic toward kernel > improvements, I'd prefer to have some options here that become effective > as soon as the database code is done. Besides, even if kernel/FS authors did improve things, the improvements would not be available on production platforms for years. And, for that matter, while Linux and BSD are pretty responsive to our feedback, Apple, Microsoft and Oracle are most definitely not. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Sun, Nov 14, 2010 at 3:48 PM, Greg Smith <greg@2ndquadrant.com> wrote: ... > One change that turned out be necessary rather than optional--to get good > performance from the system under tuning--was to make regular background > writer activity, including fsync absorb checks, happen during these sync > pauses. The existing code ran the checkpoint sync work in a pretty tight > loop, which as I alluded to in an earlier patch today can lead to the > backends competing with the background writer to get their sync calls > executed. This squashes that problem if the background writer is setup > properly. Have you tested out this "absorb during syncing phase" code without the sleep between the syncs? I.e. so that it still a tight loop, but the loop alternates between sync and absorb, with no intentional pause? I wonder if all the improvement you see might not be due entirely to the absorb between syncs, and none or very little from the sleep itself. I ask because I don't have a mental model of how the pause can help. Given that this dirty data has been hanging around for many minutes already, what is a 3 second pause going to heal? The healing power of clearing out the absorb queue seems much more obvious. Cheers, Jeff
Jeff Janes wrote: > Have you tested out this "absorb during syncing phase" code without > the sleep between the syncs? > I.e. so that it still a tight loop, but the loop alternates between > sync and absorb, with no intentional pause? > Yes; that's how it was developed. It helped to have just the extra absorb work without the pauses, but that alone wasn't enough to really improve things on the server we ran into this problem badly on. > I ask because I don't have a mental model of how the pause can help. > Given that this dirty data has been hanging around for many minutes > already, what is a 3 second pause going to heal? > The difference is that once an fsync call is made, dirty data is much more likely to be forced out. It's the one thing that bypasses all other ways the kernel might try to avoid writing the data--both the dirty ratio guidelines and the congestion control logic--and forces those writes to happen as soon as they can be scheduled. If you graph the amount of data shown "Dirty:" by /proc/meminfo over time, once the sync calls start happening it's like a descending staircase pattern, dropping a little bit as each sync fires. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On 01.12.2010 06:25, Greg Smith wrote: > Jeff Janes wrote: >> I ask because I don't have a mental model of how the pause can help. >> Given that this dirty data has been hanging around for many minutes >> already, what is a 3 second pause going to heal? > > The difference is that once an fsync call is made, dirty data is much > more likely to be forced out. It's the one thing that bypasses all other > ways the kernel might try to avoid writing the data--both the dirty > ratio guidelines and the congestion control logic--and forces those > writes to happen as soon as they can be scheduled. If you graph the > amount of data shown "Dirty:" by /proc/meminfo over time, once the sync > calls start happening it's like a descending staircase pattern, dropping > a little bit as each sync fires. Do you have any idea how to autotune the delay between fsyncs? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Do you have any idea how to autotune the delay between fsyncs? I'm thinking to start by counting the number of relations that need them at the beginning of the checkpoint. Then use the same basic math that drives the spread writes, where you assess whether you're on schedule or not based on segment/time progress relative to how many have been sync'd out of that total. At a high level I think that idea translates over almost directly into the existing write spread code. Was hoping for a sanity check from you in particular about whether that seems reasonable or not before diving into the coding. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On 01.12.2010 23:30, Greg Smith wrote: > Heikki Linnakangas wrote: >> Do you have any idea how to autotune the delay between fsyncs? > > I'm thinking to start by counting the number of relations that need them > at the beginning of the checkpoint. Then use the same basic math that > drives the spread writes, where you assess whether you're on schedule or > not based on segment/time progress relative to how many have been sync'd > out of that total. At a high level I think that idea translates over > almost directly into the existing write spread code. Was hoping for a > sanity check from you in particular about whether that seems reasonable > or not before diving into the coding. Sounds reasonable to me. fsync()s are a lot less uniform than write()s, though. If you fsync() a file with one dirty page in it, it's going to return very quickly, but a 1GB file will take a while. That could be problematic if you have a thousand small files and a couple of big ones, as you would want to reserve more time for the big ones. I'm not sure what to do about it, maybe it's not a problem in practice. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Dec 1, 2010 at 4:25 AM, Greg Smith <greg@2ndquadrant.com> wrote: >> I ask because I don't have a mental model of how the pause can help. >> Given that this dirty data has been hanging around for many minutes >> already, what is a 3 second pause going to heal? >> > > The difference is that once an fsync call is made, dirty data is much more > likely to be forced out. It's the one thing that bypasses all other ways > the kernel might try to avoid writing the data I had always assumed the problem was that other I/O had been done to the files in the meantime. I.e. the fsync is not just syncing the checkpoint but any other blocks that had been flushed since the checkpoint started. The longer the checkpoint is spread over the more other I/O is included as well. Using sync_file_range you can specify the set of blocks to sync and then block on them only after some time has passed. But there's no documentation on how this relates to the I/O scheduler so it's not clear it would have any effect on the problem. We might still have to delay the begining of the sync to allow the dirty blocks to be synced naturally and then when we issue it still end up catching a lot of other i/o as well. -- greg
> Using sync_file_range you can specify the set of blocks to sync and > then block on them only after some time has passed. But there's no > documentation on how this relates to the I/O scheduler so it's not > clear it would have any effect on the problem. We might still have to > delay the begining of the sync to allow the dirty blocks to be synced > naturally and then when we issue it still end up catching a lot of > other i/o as well. This *really* sounds like we should be working with the FS geeks on making the OS do this work for us. Greg, you wanna go to LinuxCon next year? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Thu, Dec 2, 2010 at 2:24 PM, Greg Stark <gsstark@mit.edu> wrote: > On Wed, Dec 1, 2010 at 4:25 AM, Greg Smith <greg@2ndquadrant.com> wrote: >>> I ask because I don't have a mental model of how the pause can help. >>> Given that this dirty data has been hanging around for many minutes >>> already, what is a 3 second pause going to heal? >>> >> >> The difference is that once an fsync call is made, dirty data is much more >> likely to be forced out. It's the one thing that bypasses all other ways >> the kernel might try to avoid writing the data > > I had always assumed the problem was that other I/O had been done to > the files in the meantime. I.e. the fsync is not just syncing the > checkpoint but any other blocks that had been flushed since the > checkpoint started. It strikes me that we might start the syncs of the files that the checkpoint isn't going to dirty further at the start of the checkpoint, and do the rest at the end. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greg Stark wrote: > Using sync_file_range you can specify the set of blocks to sync and > then block on them only after some time has passed. But there's no > documentation on how this relates to the I/O scheduler so it's not > clear it would have any effect on the problem. I believe this is the exact spot we're stalled at in regards to getting this improved on the Linux side, as I understand it at least. *The* answer for this class of problem on Linux is to use sync_file_range, and I don't think we'll ever get any sympathy from those kernel developers until we do. But that's a Linux specific call, so doing that is going to add a write path fork with platform-specific code into the database. If I thought sync_file_range was a silver bullet guaranteed to make this better, maybe I'd go for that. I think there's some relatively low-hanging fruit on the database side that would do better before going to that extreme though, thus the patch. > We might still have to delay the begining of the sync to allow the dirty blocks to be synced > naturally and then when we issue it still end up catching a lot of > other i/o as well. > Whether it's "lots" or not is really workload dependent. I work from the assumption that the blocks being written out by the checkpoint are the most popular ones in the database, the ones that accumulate a high usage count and stay there. If that's true, my guess is that the writes being done while the checkpoint is executing are a bit less likely to be touching the same files. You raise a valid concern, I just haven't seen that actually happen in practice yet. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
Heikki Linnakangas wrote: > If you fsync() a file with one dirty page in it, it's going to return > very quickly, but a 1GB file will take a while. That could be > problematic if you have a thousand small files and a couple of big > ones, as you would want to reserve more time for the big ones. I'm not > sure what to do about it, maybe it's not a problem in practice. It's a problem in practice allright, with the bulk-loading situation being the main one you'll hit it. If somebody is running a giant COPY to populate a table at the time the checkpoint starts, there's probably a 1GB file of dirty data that's unsynced around there somewhere. I think doing anything about that situation requires an additional leap in thinking about buffer cache evicition and fsync absorption though. Ultimately I think we'll end up doing sync calls for relations that have gone "cold" for a while all the time as part of BGW activity, not just at checkpoint time, to try and avoid this whole area better. That's a lot more than I'm trying to do in my first pass of improvements though. In the interest of cutting the number of messy items left in the official CommitFest, I'm going to mark my patch here "Returned with Feedback" and continue working in the general direction I was already going. Concept shared, underlying patches continue to advance, good discussion around it; those were my goals for this CF and I think we're there. I have a good idea how to autotune the sync spread that's hardcoded in the current patch. I'll work on finishing that up and organizing some more extensive performance tests. Right now I'm more concerned about finishing the tests around the wal_sync_method issues, which are related to this and need to get sorted out a bit more urgently. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
On Sun, Dec 5, 2010 at 2:53 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Heikki Linnakangas wrote: >> >> If you fsync() a file with one dirty page in it, it's going to return very >> quickly, but a 1GB file will take a while. That could be problematic if you >> have a thousand small files and a couple of big ones, as you would want to >> reserve more time for the big ones. I'm not sure what to do about it, maybe >> it's not a problem in practice. > > It's a problem in practice allright, with the bulk-loading situation being > the main one you'll hit it. If somebody is running a giant COPY to populate > a table at the time the checkpoint starts, there's probably a 1GB file of > dirty data that's unsynced around there somewhere. I think doing anything > about that situation requires an additional leap in thinking about buffer > cache evicition and fsync absorption though. Ultimately I think we'll end > up doing sync calls for relations that have gone "cold" for a while all the > time as part of BGW activity, not just at checkpoint time, to try and avoid > this whole area better. That's a lot more than I'm trying to do in my first > pass of improvements though. > > In the interest of cutting the number of messy items left in the official > CommitFest, I'm going to mark my patch here "Returned with Feedback" and > continue working in the general direction I was already going. Concept > shared, underlying patches continue to advance, good discussion around it; > those were my goals for this CF and I think we're there. > > I have a good idea how to autotune the sync spread that's hardcoded in the > current patch. I'll work on finishing that up and organizing some more > extensive performance tests. Right now I'm more concerned about finishing > the tests around the wal_sync_method issues, which are related to this and > need to get sorted out a bit more urgently. > > -- > Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD > PostgreSQL Training, Services and Support www.2ndQuadrant.us > Forgive me, but is all of this a step on the slippery slope to direction io? And is this a bad thing? -- Rob Wultsch wultsch@gmail.com
Rob Wultsch wrote: > Forgive me, but is all of this a step on the slippery slope to > direct io? And is this a bad thing I don't really think so. There's an important difference in my head between direct I/O, where the kernel is told "write this immediately!", and what I'm trying to achive. I want to give the kernel an opportunity to write blocks out in an efficient way, so that it can take advantage of elevator sorting, write combining, and similar tricks. But, eventually, those writes have to make it out to disk. Linux claims to have concepts like a "deadline" for I/O to happen, but they turn out to not be so effective once the system gets backed up with enough writes. Since fsync time is the only effective deadline, I'm progressing from the standpoint that adjusting when it happens relative to the write will help, while still allowing the kernel an opportunity to get the writes out on its own schedule. When ends up happening if you push toward fully sync I/O is the design you see in some other databases, where you need multiple writer processes. Then requests for new pages can continue to allocate as needed, while keeping any one write from blocking things. That's one sort of a way to simulate asynchronous I/O, and you can substitute true async I/O instead in many of those implementations. We didn't have much luck with portability on async I/O when that was last experimented with, and having multiple background writer processes seems like overkill; that whole direction worries me. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
Excerpts from Greg Smith's message of dom dic 05 20:02:48 -0300 2010: > When ends up happening if you push toward fully sync I/O is the design > you see in some other databases, where you need multiple writer > processes. Then requests for new pages can continue to allocate as > needed, while keeping any one write from blocking things. That's one > sort of a way to simulate asynchronous I/O, and you can substitute true > async I/O instead in many of those implementations. We didn't have much > luck with portability on async I/O when that was last experimented with, > and having multiple background writer processes seems like overkill; > that whole direction worries me. Why would multiple bgwriter processes worry you? Of course, it wouldn't work to have multiple processes trying to execute a checkpoint simultaneously, but what if we separated the tasks so that one process is in charge of checkpoints, and another oneZis in charge of the LRU scan? -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Why would multiple bgwriter processes worry you? > > Of course, it wouldn't work to have multiple processes trying to execute > a checkpoint simultaneously, but what if we separated the tasks so that > one process is in charge of checkpoints, and another one is in charge of > the LRU scan? > I was commenting more in the context of development resource allocation. Moving toward that design would be helpful, but it alone isn't enough to improve the checkpoint sync issues. My concern is that putting work into that area will be a distraction from making progress on those. If individual syncs take so long that the background writer gets lost for a while executing them, and therefore doesn't do LRU cleanup, you've got a problem that LRU-related improvements probably aren't enough to solve. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Mon, 2010-12-06 at 23:26 -0300, Alvaro Herrera wrote: > Why would multiple bgwriter processes worry you? Because it complicates the tracking of files requiring fsync. As Greg says, the last attempt to do that was a lot of code. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Having the pg_stat_bgwriter.buffers_backend_fsync patch available all the > time now has made me reconsider how important one potential bit of > refactoring here would be. I managed to catch one of the situations where > really popular relations were being heavily updated in a way that was > competing with the checkpoint on my test system (which I can happily share > the logs of), with the instrumentation patch applied but not the spread sync > one: > > LOG: checkpoint starting: xlog > DEBUG: could not forward fsync request because request queue is full > CONTEXT: writing block 7747 of relation base/16424/16442 > DEBUG: could not forward fsync request because request queue is full > CONTEXT: writing block 42688 of relation base/16424/16437 > DEBUG: could not forward fsync request because request queue is full > CONTEXT: writing block 9723 of relation base/16424/16442 > DEBUG: could not forward fsync request because request queue is full > CONTEXT: writing block 58117 of relation base/16424/16437 > DEBUG: could not forward fsync request because request queue is full > CONTEXT: writing block 165128 of relation base/16424/16437 > [330 of these total, all referring to the same two relations] > > DEBUG: checkpoint sync: number=1 file=base/16424/16448_fsm > time=10132.830000 msec > DEBUG: checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec > DEBUG: checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec > DEBUG: checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec > DEBUG: checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec > DEBUG: checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec > DEBUG: checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec > DEBUG: checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000 > msec > DEBUG: checkpoint sync: number=9 file=base/16424/16437_fsm time=0.001000 > msec > DEBUG: checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec > DEBUG: checkpoint sync: number=11 file=base/16424/16425 time=0.000000 msec > DEBUG: checkpoint sync: number=12 file=base/16424/16437_vm time=0.001000 > msec > DEBUG: checkpoint sync: number=13 file=base/16424/16425_vm time=0.001000 > msec > LOG: checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log > file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s, > total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s > > Note here how the checkpoint was hung on trying to get 16448_fsm written > out, but the backends were issuing constant competing fsync calls to these > other relations. This is very similar to the production case this patch was > written to address, which I hadn't been able to share a good example of yet. > That's essentially what it looks like, except with the contention going on > for minutes instead of seconds. > > One of the ideas Simon and I had been considering at one point was adding > some better de-duplication logic to the fsync absorb code, which I'm > reminded by the pattern here might be helpful independently of other > improvements. Hopefully I'm not stepping on any toes here, but I thought this was an awfully good idea and had a chance to take a look at how hard it would be today while en route from point A to point B. The answer turned out to be "not very", so PFA a patch that seems to work. I tested it by attaching gdb to the background writer while running pgbench, and it eliminate the backend fsyncs without even breaking a sweat. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas wrote: <blockquote cite="mid:AANLkTi=P0te3oFq0LVS8cGLkGF_Wp9ery0fOu9SHEcs9@mail.gmail.com" type="cite"><prewrap="">On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <a class="moz-txt-link-rfc2396E" href="mailto:greg@2ndquadrant.com"><greg@2ndquadrant.com></a>wrote: </pre><blockquote type="cite"><pre wrap="">Oneof the ideas Simon and I had been considering at one point was adding some better de-duplication logic to the fsync absorb code, which I'm reminded by the pattern here might be helpful independently of other improvements. </pre></blockquote><pre wrap=""> Hopefully I'm not stepping on any toes here, but I thought this was an awfully good idea and had a chance to take a look at how hard it would be today while en route from point A to point B. The answer turned out to be "not very", so PFA a patch that seems to work. I tested it by attaching gdb to the background writer while running pgbench, and it eliminate the backend fsyncs without even breaking a sweat. </pre></blockquote><br /> No toe damage, this is great, Ihadn't gotten to coding for this angle yet at all. Suffering from an overload of ideas and (mostly wasted) test data, sothanks for exploring this concept and proving it works.<br /><br /> I'm not sure what to do with the rest of the work I'vebeen doing in this area here, so I'm tempted to just combine this new bit from you with the older patch I submitted,streamline, and see if that makes sense. Expected to be there already, then "how about spending 5 minutes firstchecking out that autovacuum lock patch again" turned out to be a wild underestimate.<br /><br /> Part of the problemis that it's become obvious to me the last month that right now is a terrible time to be doing Linux benchmarks thatimpact filesystem sync behavior. The recent kernel changes that are showing in the next rev of the enterprise distributions--likeRHEL6 and Debian Squeeze both working to get a stable 2.6.32--have made testing a nightmare. I don'twant to dump a lot of time into optimizing for <2.6.32 if this problem changes its form in newer kernels, but thedistributions built around newer kernels are just not fully baked enough yet to tell. For example, the pre-release Squeezenumbers we're seeing are awful so far, but it's not really done yet either. I expect 3-6 months from today, thatall will have settled down enough that I can make some sense of it. Lately my work with the latest distributions hasjust been burning time installing stuff that doesn't work quite right yet.<br /><br /><pre class="moz-signature" cols="72">-- Greg Smith 2ndQuadrant US <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> Baltimore, MD PostgreSQL Training, Services, and 24x7 Support <a class="moz-txt-link-abbreviated" href="http://www.2ndQuadrant.us">www.2ndQuadrant.us</a> "PostgreSQL 9.0 High Performance": <a class="moz-txt-link-freetext" href="http://www.2ndQuadrant.com/books">http://www.2ndQuadrant.com/books</a> </pre>
On Sat, Jan 15, 2011 at 5:47 AM, Greg Smith <greg@2ndquadrant.com> wrote: > No toe damage, this is great, I hadn't gotten to coding for this angle yet > at all. Suffering from an overload of ideas and (mostly wasted) test data, > so thanks for exploring this concept and proving it works. Yeah - obviously I want to make sure that someone reviews the logic carefully, since a loss of fsyncs or a corruption of the request queue could affect system stability, but only very rarely, since you'd need full fsync queue + crash. But the code is pretty simple, so it should be possible to convince ourselves as to its correctness (or otherwise). Obviously, major credit to you and Simon for identifying the problem and coming up with a proposed fix. > I'm not sure what to do with the rest of the work I've been doing in this > area here, so I'm tempted to just combine this new bit from you with the > older patch I submitted, streamline, and see if that makes sense. Expected > to be there already, then "how about spending 5 minutes first checking out > that autovacuum lock patch again" turned out to be a wild underestimate. I'd rather not combine the patches, because this one is pretty simple and just does one thing, but feel free to write something that applies over top of it. Looking through your old patch (sync-spread-v3), there seem to be a couple of components there: - Compact the fsync queue based on percentage fill rather than number of writes per absorb. If we apply my queue-compacting logic, do we still need this? The queue compaction may hold the BgWriterCommLock for slightly longer than AbsorbFsyncRequests() would, but I'm not inclined to jump to the conclusion that this is worth getting excited about. The whole idea of accessing BgWriterShmem->num_requests without the lock gives me the willies anyway - sure, it'll probably work OK most of the time, especially on x86, but it seems hard to predict whether there will be occasional bad behavior on platforms with weak memory ordering. - Call pgstat_send_bgwriter() at the end of AbsorbFsyncRequests(). Not sure what the motivation for this is. - CheckpointSyncDelay(), to make sure that we absorb fsync requests and free up buffers during a long checkpoint. I think this part is clearly valuable, although I'm not sure we've satisfactorily solved the problem of how to spread out the fsyncs and still complete the checkpoint on schedule. As to that, I have a couple of half-baked ideas I'll throw out so you can laugh at them. Some of these may be recycled versions of ideas you've already had/mentioned, so, again, credit to you for getting the ball rolling. Idea #1: When we absorb fsync requests, don't just remember that there was an fsync request; also remember the time of said fsync request. If a new fsync request arrives for a segment for which we're already remembering an fsync request, update the timestamp on the request. Periodically scan the fsync request queue for requests older than, say, 30 s, and perform one such request. The idea is - if we wrote a bunch of data to a relation and then haven't touched it for a while, force it out to disk before the checkpoint actually starts so that the volume of work required by the checkpoint is lessened. Idea #2: At the beginning of a checkpoint when we scan all the buffers, count the number of buffers that need to be synced for each relation. Use the same hashtable that we use for tracking pending fsync requests. Then, interleave the writes and the fsyncs. Start by performing any fsyncs that need to happen but have no buffers to sync (i.e. everything that must be written to that relation has already been written). Then, start performing the writes, decrementing the pending-write counters as you go. If the pending-write count for a relation hits zero, you can add it to the list of fsyncs that can be performed before the writes are finished. If it doesn't hit zero (perhaps because a non-bgwriter process wrote a buffer that we were going to write anyway), then we'll do it at the end. One problem with this - aside from complexity - is that most likely most fsyncs would either happen at the beginning or very near the end, because there's no reason to assume that buffers for the same relation would be clustered together in shared_buffers. But I'm inclined to think that at least the idea of performing fsyncs for which no dirty buffers remain in shared_buffers at the beginning of the checkpoint rather than at the end might have some value. Idea #3: Stick with the idea of a fixed delay between fsyncs, but compute how many fsyncs you think you're ultimately going to need at the start of the checkpoint, and back up the target completion time by 3 s per fsync from the get-go, so that the checkpoint still finishes on schedule. Idea #4: For ext3 filesystems that like to dump the entire buffer cache instead of only the requested file, write a little daemon that runs alongside of (and completely indepdently of) PostgreSQL. Every 30 s, it opens a 1-byte file, changes the byte, fsyncs the file, and closes the file, thus dumping the cache and preventing a ridiculous growth in the amount of data to be sync'd at checkpoint time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote: > Robert Haas wrote: > > On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <greg@2ndquadrant.com> wrote: > > > > > One of the ideas Simon and I had been considering at one point was adding > > > some better de-duplication logic to the fsync absorb code, which I'm > > > reminded by the pattern here might be helpful independently of other > > > improvements. > > > > > > > Hopefully I'm not stepping on any toes here, but I thought this was an > > awfully good idea and had a chance to take a look at how hard it would > > be today while en route from point A to point B. The answer turned > > out to be "not very", so PFA a patch that seems to work. I tested it > > by attaching gdb to the background writer while running pgbench, and > > it eliminate the backend fsyncs without even breaking a sweat. > > > > No toe damage, this is great, I hadn't gotten to coding for this angle > yet at all. Suffering from an overload of ideas and (mostly wasted) > test data, so thanks for exploring this concept and proving it works. No toe damage either, but are we sure we want the de-duplication logic and in this place? I was originally of the opinion that de-duplicating the list would save time in the bgwriter, but that guess was wrong by about two orders of magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sat, Jan 15, 2011 at 8:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote: >> Robert Haas wrote: >> > On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> > >> > > One of the ideas Simon and I had been considering at one point was adding >> > > some better de-duplication logic to the fsync absorb code, which I'm >> > > reminded by the pattern here might be helpful independently of other >> > > improvements. >> > > >> > >> > Hopefully I'm not stepping on any toes here, but I thought this was an >> > awfully good idea and had a chance to take a look at how hard it would >> > be today while en route from point A to point B. The answer turned >> > out to be "not very", so PFA a patch that seems to work. I tested it >> > by attaching gdb to the background writer while running pgbench, and >> > it eliminate the backend fsyncs without even breaking a sweat. >> > >> >> No toe damage, this is great, I hadn't gotten to coding for this angle >> yet at all. Suffering from an overload of ideas and (mostly wasted) >> test data, so thanks for exploring this concept and proving it works. > > No toe damage either, but are we sure we want the de-duplication logic > and in this place? > > I was originally of the opinion that de-duplicating the list would save > time in the bgwriter, but that guess was wrong by about two orders of > magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable. Well, the point of this is not to save time in the bgwriter - I'm not surprised to hear that wasn't noticeable. The point is that when the fsync request queue fills up, backends start performing an fsync *for every block they write*, and that's about as bad for performance as it's possible to be. So it's worth going to a little bit of trouble to try to make sure it doesn't happen. It didn't happen *terribly* frequently before, but it does seem to be common enough to worry about - e.g. on one occasion, I was able to reproduce it just by running pgbench -i -s 25 or something like that on a laptop. With this patch applied, there's no performance impact vs. current code in the very, very common case where space remains in the queue - 999 times out of 1000, writing to the fsync queue will be just as fast as ever. But in the unusual case where the queue has been filled up, compacting the queue is much much faster than performing an fsync, and the best part is that the compaction is generally massive. I was seeing things like "4096 entries compressed to 14". So clearly even if the compaction took as long as the fsync itself it would be worth it, because the next 4000+ guys who come along again go through the fast path. But in fact I think it's much faster than an fsync. In order to get pathological behavior even with this patch applied, you'd need to have NBuffers pending fsync requests and they'd all have to be different. I don't think that's theoretically impossible, but Greg's research seems to indicate that even on busy systems we don't come even a little bit close to the circumstances that would cause it to occur in practice. Every other change we might make in this area will further improve this case, too: for example, doing an absorb after each fsync would presumably help, as would the more drastic step of splitting the bgwriter into two background processes (one to do background page cleaning, and the other to do checkpoints, for example). But even without those sorts of changes, I think this is enough to effectively eliminate the full fsync queue problem in practice, which seems worth doing independently of anything else. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > Idea #2: At the beginning of a checkpoint when we scan all the > buffers, count the number of buffers that need to be synced for each > relation. Use the same hashtable that we use for tracking pending > fsync requests. Then, interleave the writes and the fsyncs... > > Idea #3: Stick with the idea of a fixed delay between fsyncs, but > compute how many fsyncs you think you're ultimately going to need at > the start of the checkpoint, and back up the target completion time by > 3 s per fsync from the get-go, so that the checkpoint still finishes > on schedule. > What I've been working on is something halfway between these two ideas. I have a patch, and it doesn't work right yet because I just broke it, but since I have some faint hope this will all come together any minute now I'm going to share it before someone announces a deadline has passed or something. (whistling). I'm going to add this messy thing and the patch you submitted upthread to the CF list; I'll review yours, I'll either fix the remaining problem in this one myself or rewrite to one of your ideas, and then it's onto a round of benchmarking. Once upon a time we got a patch from Itagaki Takahiro whose purpose was to sort writes before sending them out: http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php This didn't really work repeatedly for everyone because of the now well understood ext3 issues--I never replicated that speedup at the time for example. And this was before the spread checkpoint code was in 8.3. The hope was that it wasn't really going to be necessary after that anyway. Back to today...instead of something complicated, it struck me that if I just had a count of exactly how many files were involved in each checkpoint, that would be helpful. I could keep the idea of a fixed delay between fsyncs, but just auto-tune that delay amount based on the count. And how do you count the number of unique things in a list? Well, you can always sort them. I thought that if the sorted writes patch got back to functional again, it could serve two purposes. It would group all of the writes for a file together, and if you did the syncs in the same sorted order they would have the maximum odds of discovering the data was already written. So rather than this possible order: table block a 1 b 1 c 1 c 2 b 2 a 2 sync a sync b sync c Which has very low odds of the sync on "a" finishing quickly, we'd get this one: table block a 1 a 2 b 1 b 2 c 1 c 2 sync a sync b sync c Which sure seems like a reasonable way to improve the odds data has been written before the associated sync comes along. Also, I could just traverse the sorted list with some simple logic to count the number of unique files, and then set the delay between fsync writes based on it. In the above, once the list was sorted, easy to just see how many times the table name changes on a linear scan of the sorted data. 3 files, so if the checkpoint target gives me, say, a minute of time to sync them, I can delay 20 seconds between. Simple math, and exactly the sort I used to get reasonable behavior on the busy production system this all started on. There's some unresolved trickiness in the segment-driven checkpoint case, but one thing at a time. So I fixed the bitrot on the old sorted patch, which was fun as it came from before the 8.3 changes. It seemed to work. I then moved the structure it uses to hold the list of buffers to write, the thing that's sorted, into shared memory. It's got a predictable maximum size, relying on palloc in the middle of the checkpoint code seems bad, and there's some potential gain from not reallocating it every time through. Somewhere along the way, it started doing this instead of what I wanted: BadArgument("!(((header->context) != ((void *)0) && (((((Node*)((header->context)))->type) == T_AllocSetContext))))", File: "mcxt.c", Line: 589) (that's from initdb, not a good sign) And it's left me wondering whether this whole idea is a dead end I used up my window of time wandering down. There's good bits in the patch I submitted for the last CF and in the patch you wrote earlier this week. This unfinished patch may be a valuable idea to fit in there too once I fix it, or maybe it's fundamentally flawed and one of the other ideas you suggested (or I have sitting on the potential design list) will work better. There's a patch integration problem that needs to be solved here, but I think almost all the individual pieces are available. I'd hate to see this fail to get integrated now just for lack of time, considering the problem is so serious when you run into it. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c index dadb49d..c8c0f67 100644 *** a/src/backend/storage/buffer/buf_init.c --- b/src/backend/storage/buffer/buf_init.c *************** *** 20,25 **** --- 20,26 ---- BufferDesc *BufferDescriptors; char *BufferBlocks; + BufAndTag *BufferTags; int32 *PrivateRefCount; *************** int32 *PrivateRefCount; *** 72,79 **** void InitBufferPool(void) { ! bool foundBufs, ! foundDescs; BufferDescriptors = (BufferDesc *) ShmemInitStruct("Buffer Descriptors", --- 73,81 ---- void InitBufferPool(void) { ! bool foundBufs; ! bool foundDescs; ! bool foundTags; BufferDescriptors = (BufferDesc *) ShmemInitStruct("Buffer Descriptors", *************** InitBufferPool(void) *** 83,92 **** ShmemInitStruct("Buffer Blocks", NBuffers * (Size) BLCKSZ, &foundBufs); ! if (foundDescs || foundBufs) { ! /* both should be present or neither */ ! Assert(foundDescs && foundBufs); /* note: this path is only taken in EXEC_BACKEND case */ } else --- 85,98 ---- ShmemInitStruct("Buffer Blocks", NBuffers * (Size) BLCKSZ, &foundBufs); ! BufferTags = (BufAndTag *) ! ShmemInitStruct("Dirty Buffer Tags", ! NBuffers * sizeof(BufAndTag), &foundTags); ! ! if (foundDescs || foundBufs || foundTags) { ! /* all should be present or none */ ! Assert(foundDescs && foundBufs && foundTags); /* note: this path is only taken in EXEC_BACKEND case */ } else *************** BufferShmemSize(void) *** 171,176 **** --- 177,185 ---- /* size of data pages */ size = add_size(size, mul_size(NBuffers, BLCKSZ)); + /* size of checkpoint buffer tags */ + size = add_size(size, mul_size(NBuffers, sizeof(BufAndTag))); + /* size of stuff controlled by freelist.c */ size = add_size(size, StrategyShmemSize()); diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 1f89e52..bd779bf 100644 *** a/src/backend/storage/buffer/bufmgr.c --- b/src/backend/storage/buffer/bufmgr.c *************** UnpinBuffer(volatile BufferDesc *buf, bo *** 1158,1163 **** --- 1158,1181 ---- } } + static int + bufcmp(const void *a, const void *b) + { + const BufAndTag *lhs = (const BufAndTag *) a; + const BufAndTag *rhs = (const BufAndTag *) b; + int r; + + r = memcmp(&lhs->tag.rnode, &rhs->tag.rnode, sizeof(lhs->tag.rnode)); + if (r != 0) + return r; + if (lhs->tag.blockNum < rhs->tag.blockNum) + return -1; + else if (lhs->tag.blockNum > rhs->tag.blockNum) + return 1; + else + return 0; + } + /* * BufferSync -- Write out all dirty buffers in the pool. * *************** static void *** 1171,1180 **** BufferSync(int flags) { int buf_id; - int num_to_scan; int num_to_write; int num_written; int mask = BM_DIRTY; /* Make sure we can handle the pin inside SyncOneBuffer */ ResourceOwnerEnlargeBuffers(CurrentResourceOwner); --- 1189,1202 ---- BufferSync(int flags) { int buf_id; int num_to_write; int num_written; int mask = BM_DIRTY; + int dirty_buf; + int dirty_files; + Oid last_seen_rel; + ForkNumber last_seen_fork; + BlockNumber last_seen_block; /* Make sure we can handle the pin inside SyncOneBuffer */ ResourceOwnerEnlargeBuffers(CurrentResourceOwner); *************** BufferSync(int flags) *** 1216,1221 **** --- 1238,1245 ---- if ((bufHdr->flags & mask) == mask) { bufHdr->flags |= BM_CHECKPOINT_NEEDED; + BufferTags[num_to_write].buf_id = buf_id; + BufferTags[num_to_write].tag = bufHdr->tag; num_to_write++; } *************** BufferSync(int flags) *** 1225,1246 **** if (num_to_write == 0) return; /* nothing to do */ TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write); /* * Loop over all buffers again, and write the ones (still) marked with ! * BM_CHECKPOINT_NEEDED. In this loop, we start at the clock sweep point ! * since we might as well dump soon-to-be-recycled buffers first. * * Note that we don't read the buffer alloc count here --- that should be * left untouched till the next BgBufferSync() call. ! */ ! buf_id = StrategySyncStart(NULL, NULL); ! num_to_scan = NBuffers; num_written = 0; ! while (num_to_scan-- > 0) ! { ! volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id]; /* * We don't need to acquire the lock here, because we're only looking --- 1249,1307 ---- if (num_to_write == 0) return; /* nothing to do */ + /* + * Sort the list of buffers to write. It's then straightforward to + * count the approximate number of files involved. There may be + * some small error from buffers that turn out to be skipped below, + * but for the purposes the file count is needed that's acceptable. + */ + qsort(BufferTags, num_to_write, sizeof(*BufferTags), bufcmp); + + /* + * Count the number of unique node/fork combinations, relying on the + * sorted order + */ + + /* Initialize with the first entry in the dirty buffer list */ + last_seen_rel = BufferTags[0].tag.rnode.relNode; + last_seen_fork = BufferTags[0].tag.forkNum; + last_seen_block = BufferTags[0].tag.blockNum; + dirty_files = 1; + + for (dirty_buf = 1; dirty_buf < num_to_write; dirty_buf++) + { + if ((last_seen_rel != BufferTags[dirty_buf].tag.rnode.relNode) || + (last_seen_fork != BufferTags[dirty_buf].tag.forkNum)) + { + last_seen_rel=BufferTags[dirty_buf].tag.rnode.relNode; + last_seen_fork=BufferTags[dirty_buf].tag.forkNum; + dirty_files++; + } + } + + /* + * TODO: This doesn't account for the fact that blocks might span multiple + * files within the same relation yet. + */ + + elog(DEBUG1, "BufferSync found %d buffers to write involving %d files", + num_to_write,dirty_files) + TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write); /* * Loop over all buffers again, and write the ones (still) marked with ! * BM_CHECKPOINT_NEEDED. * * Note that we don't read the buffer alloc count here --- that should be * left untouched till the next BgBufferSync() call. ! */ num_written = 0; ! for (dirty_buf = 0; dirty_buf < num_to_write; dirty_buf++) ! { ! volatile BufferDesc *bufHdr; ! buf_id = BufferTags[dirty_buf].buf_id; ! bufHdr = &BufferDescriptors[buf_id]; /* * We don't need to acquire the lock here, because we're only looking *************** BufferSync(int flags) *** 1263,1282 **** num_written++; /* - * We know there are at most num_to_write buffers with - * BM_CHECKPOINT_NEEDED set; so we can stop scanning if - * num_written reaches num_to_write. - * - * Note that num_written doesn't include buffers written by - * other backends, or by the bgwriter cleaning scan. That - * means that the estimate of how much progress we've made is - * conservative, and also that this test will often fail to - * trigger. But it seems worth making anyway. - */ - if (num_written >= num_to_write) - break; - - /* * Perform normal bgwriter duties and sleep to throttle our * I/O rate. */ --- 1324,1329 ---- *************** BufferSync(int flags) *** 1284,1292 **** (double) num_written / num_to_write); } } - - if (++buf_id >= NBuffers) - buf_id = 0; } /* --- 1331,1336 ---- diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h index 0652bdf..1c9c910 100644 *** a/src/include/storage/buf_internals.h --- b/src/include/storage/buf_internals.h *************** typedef struct sbufdesc *** 167,175 **** --- 167,185 ---- #define LockBufHdr(bufHdr) SpinLockAcquire(&(bufHdr)->buf_hdr_lock) #define UnlockBufHdr(bufHdr) SpinLockRelease(&(bufHdr)->buf_hdr_lock) + /* + * Checkpoint time mapping between the buffer id values and the associated + * buffer tags of dirty buffers to write + */ + typedef struct BufAndTag + { + int buf_id; + BufferTag tag; + } BufAndTag; /* in buf_init.c */ extern PGDLLIMPORT BufferDesc *BufferDescriptors; + extern PGDLLIMPORT BufAndTag *BufferTags; /* in localbuf.c */ extern BufferDesc *LocalBufferDescriptors;
On Sat, Jan 15, 2011 at 9:25 AM, Greg Smith <greg@2ndquadrant.com> wrote: > Once upon a time we got a patch from Itagaki Takahiro whose purpose was to > sort writes before sending them out: > > http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php Ah, a fine idea! > Which has very low odds of the sync on "a" finishing quickly, we'd get this > one: > > table block > a 1 > a 2 > b 1 > b 2 > c 1 > c 2 > sync a > sync b > sync c > > Which sure seems like a reasonable way to improve the odds data has been > written before the associated sync comes along. I'll believe it when I see it. How about this: a 1 a 2 sync a b 1 b 2 sync b c 1 c 2 sync c Or maybe some variant, where we become willing to fsync a file a certain number of seconds after writing the last block, or when all the writes are done, whichever comes first. It seems to me that it's going to be a bear to figure out what fraction of the checkpoint you've completed if you put all of the syncs at the end, and this whole problem appears to be predicated the assumption that the OS *isn't* writing out in a timely fashion. Are we sure that postponing the fsync relative to the writes is anything more than wishful thinking? > Also, I could just traverse the sorted list with some simple logic to count > the number of unique files, and then set the delay between fsync writes > based on it. In the above, once the list was sorted, easy to just see how > many times the table name changes on a linear scan of the sorted data. 3 > files, so if the checkpoint target gives me, say, a minute of time to sync > them, I can delay 20 seconds between. Simple math, and exactly the sort I How does the checkpoint target give you any time to sync them? Unless you squeeze the writes together more tightly, but that seems sketchy. > So I fixed the bitrot on the old sorted patch, which was fun as it came from > before the 8.3 changes. It seemed to work. I then moved the structure it > uses to hold the list of buffers to write, the thing that's sorted, into > shared memory. It's got a predictable maximum size, relying on palloc in > the middle of the checkpoint code seems bad, and there's some potential gain > from not reallocating it every time through. Well you don't have to put it in shared memory on account of any of that. You can just hang it on a global variable. > There's good bits in the patch I submitted for the last CF and in the patch > you wrote earlier this week. This unfinished patch may be a valuable idea > to fit in there too once I fix it, or maybe it's fundamentally flawed and > one of the other ideas you suggested (or I have sitting on the potential > design list) will work better. There's a patch integration problem that > needs to be solved here, but I think almost all the individual pieces are > available. I'd hate to see this fail to get integrated now just for lack of > time, considering the problem is so serious when you run into it. Likewise, but committing something half-baked is no good either. I think we're in a position to crush the full-fsync-queue problem flat (my patch should do that, and there are several other obvious things we can do for extra certainty) but the problem of spreading out the fsyncs looks to me like something we don't completely know how to solve. If we can find something that's a modest improvement on the status quo and we can be confident in quickly, good, but I'd rather have 9.1 go out the door on time without fully fixing this than delay the release. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > I'll believe it when I see it. How about this: > > a 1 > a 2 > sync a > b 1 > b 2 > sync b > c 1 > c 2 > sync c > > Or maybe some variant, where we become willing to fsync a file a > certain number of seconds after writing the last block, or when all > the writes are done, whichever comes first. That's going to give worse performance than the current code in some cases. The goal of what's in there now is that you get a sequence like this: a1 b1 a2 [Filesystem writes a1] b2 [Filesystem writes b1] sync a [Only has to write a2] sync b [Only has to write b2] This idea works until you to get where the filesystem write cache is so large that it becomes lazier about writing things. The fundamental idea--push writes out some time before the sync, in hopes the filesystem will get to them before that said--it not unsound. On some systems, doing the sync more aggressively than that will be a regression. This approach just breaks down in some cases, and those cases are happening more now because their likelihood scales with total RAM. I don't want to screw the people with smaller systems, who may be getting considerable benefit from the existing sequence. Today's little systems--which are very similar to the high-end ones the spread checkpoint stuff was developed on during 8.3--do get some benefit from it as far as I know. Anyway, now that the ability to get logging on all this stuff went in during the last CF, it's way easier to just setup a random system to run tests in this area than it used to be. Whatever testing does happen should include, say, a 2GB laptop with a single hard drive in it. I think that's the bottom of what is reasonable to consider a reasonable target for tweaking write performance on, given hardware 9.1 is likely to be deployed on. > How does the checkpoint target give you any time to sync them? Unless > you squeeze the writes together more tightly, but that seems sketchy. > Obviously the checkpoint target idea needs to be shuffled around some too. I was thinking of making the new default 0.8, and having it split the time in half for write and sync. That will make the write phase close to the speed people are seeing now, at the default of 0.5, while giving some window for spread sync too. The exact way to redistribute that around I'm not so concerned about yet. When I get to where that's the most uncertain thing left I'll benchmark the TPS vs. latency trade-off and see what happens. If the rest of the code is good enough but this just needs to be tweaked, that's a perfect thing to get beta feedback to finalize. > Well you don't have to put it in shared memory on account of any of > that. You can just hang it on a global variable. > Hmm. Because it's so similar to other things being allocated in shared memory, I just automatically pushed it over to there. But you're right; it doesn't need to be that complicated. Nobody is touching it but the background writer. > If we can find something that's a modest improvement on the > status quo and we can be confident in quickly, good, but I'd rather > have 9.1 go out the door on time without fully fixing this than delay > the release. > I'm not somebody who needs to be convinced of that. There are two near commit quality pieces of this out there now: 1) Keep some BGW cleaning and fsync absorption going while sync is happening, rather than starting it and ignoring everything else until it's done. 2) Compact fsync requests when the queue fills If that's all we can get for 9.1, it will still be a major improvement. I realize I only have a very short period of time to complete a major integration breakthrough on the pieces floating around before the goal here has to drop to something less ambitious. I head to the West Coast for a week on the 23rd; I'll be forced to throw in the towel at that point if I can't get the better ideas we have in pieces here all assembled well by then. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Sat, 2011-01-15 at 09:15 -0500, Robert Haas wrote: > On Sat, Jan 15, 2011 at 8:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote: > >> Robert Haas wrote: > >> > On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <greg@2ndquadrant.com> wrote: > >> > > >> > > One of the ideas Simon and I had been considering at one point was adding > >> > > some better de-duplication logic to the fsync absorb code, which I'm > >> > > reminded by the pattern here might be helpful independently of other > >> > > improvements. > >> > > > >> > > >> > Hopefully I'm not stepping on any toes here, but I thought this was an > >> > awfully good idea and had a chance to take a look at how hard it would > >> > be today while en route from point A to point B. The answer turned > >> > out to be "not very", so PFA a patch that seems to work. I tested it > >> > by attaching gdb to the background writer while running pgbench, and > >> > it eliminate the backend fsyncs without even breaking a sweat. > >> > > >> > >> No toe damage, this is great, I hadn't gotten to coding for this angle > >> yet at all. Suffering from an overload of ideas and (mostly wasted) > >> test data, so thanks for exploring this concept and proving it works. > > > > No toe damage either, but are we sure we want the de-duplication logic > > and in this place? > > > > I was originally of the opinion that de-duplicating the list would save > > time in the bgwriter, but that guess was wrong by about two orders of > > magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable. > > Well, the point of this is not to save time in the bgwriter - I'm not > surprised to hear that wasn't noticeable. The point is that when the > fsync request queue fills up, backends start performing an fsync *for > every block they write*, and that's about as bad for performance as > it's possible to be. So it's worth going to a little bit of trouble > to try to make sure it doesn't happen. It didn't happen *terribly* > frequently before, but it does seem to be common enough to worry about > - e.g. on one occasion, I was able to reproduce it just by running > pgbench -i -s 25 or something like that on a laptop. > > With this patch applied, there's no performance impact vs. current > code in the very, very common case where space remains in the queue - > 999 times out of 1000, writing to the fsync queue will be just as fast > as ever. But in the unusual case where the queue has been filled up, > compacting the queue is much much faster than performing an fsync, and > the best part is that the compaction is generally massive. I was > seeing things like "4096 entries compressed to 14". So clearly even > if the compaction took as long as the fsync itself it would be worth > it, because the next 4000+ guys who come along again go through the > fast path. But in fact I think it's much faster than an fsync. > > In order to get pathological behavior even with this patch applied, > you'd need to have NBuffers pending fsync requests and they'd all have > to be different. I don't think that's theoretically impossible, but > Greg's research seems to indicate that even on busy systems we don't > come even a little bit close to the circumstances that would cause it > to occur in practice. Every other change we might make in this area > will further improve this case, too: for example, doing an absorb > after each fsync would presumably help, as would the more drastic step > of splitting the bgwriter into two background processes (one to do > background page cleaning, and the other to do checkpoints, for > example). But even without those sorts of changes, I think this is > enough to effectively eliminate the full fsync queue problem in > practice, which seems worth doing independently of anything else. You've persuaded me. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Sat, Jan 15, 2011 at 10:31 AM, Greg Smith <greg@2ndquadrant.com> wrote: > That's going to give worse performance than the current code in some cases. OK. >> How does the checkpoint target give you any time to sync them? Unless >> you squeeze the writes together more tightly, but that seems sketchy. > > Obviously the checkpoint target idea needs to be shuffled around some too. > I was thinking of making the new default 0.8, and having it split the time > in half for write and sync. That will make the write phase close to the > speed people are seeing now, at the default of 0.5, while giving some window > for spread sync too. The exact way to redistribute that around I'm not so > concerned about yet. When I get to where that's the most uncertain thing > left I'll benchmark the TPS vs. latency trade-off and see what happens. If > the rest of the code is good enough but this just needs to be tweaked, > that's a perfect thing to get beta feedback to finalize. That seems like a bad idea - don't we routinely recommend that people crank this up to 0.9? You'd be effectively bounding the upper range of this setting to a value to the less than the lowest value we recommend anyone use today. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > That seems like a bad idea - don't we routinely recommend that people > crank this up to 0.9? You'd be effectively bounding the upper range > of this setting to a value to the less than the lowest value we > recommend anyone use today. > I was just giving an example of how I might do an initial split. There's a checkpoint happening now at time T; we have a rough idea that it needs to be finished before some upcoming time T+D. Currently with default parameters this becomes: Write: 0.5 * D; Sync: 0 Even though Sync obviously doesn't take zero. The slop here is enough that it usually works anyway. I was suggesting that a quick reshuffling to: Write: 0.4 * D; Sync: 0.4 * D Might be a good first candidate for how to split the time up better. The fact that this gives less writing time than the current biggest spread possible: Write: 0.9 * D; Sync: 0 Is true. It's also true that in the case where sync time really is zero, this new default would spread writes less than the current default. I think that's optimistic, but it could happen if checkpoints are small and you have a good write cache. Step back from that a second though. Ultimately, the person who is getting checkpoints at a 5 minute interval, and is being nailed by spikes, should have the option of just increasing the parameters to make that interval bigger. First you increase the measly default segments to a reasonable range, then checkpoint_completion_target is the second one you can try. But from there, you quickly move onto making checkpoint_timeout longer. At some point, there is no option but to give up checkpoints every 5 minutes as being practical, and make the average interval longer. Whether or not a refactoring here makes things slightly worse for cases closer to the default doesn't bother me too much. What bothers me is the way trying to stretch checkpoints out further fails to deliver as well as it should. I'd be OK with saying "to get the exact same spread situation as in older versions, you may need to retarget for checkpoints every 6 minutes" *if* in the process I get a much better sync latency distribution in most cases. Here's an interesting data point from the customer site this all started at, one I don't think they'll mind sharing since it helps make the situation more clear to the community. After applying this code to spread sync out, in order to get their server back to functional we had to move all the parameters involved up to where checkpoints were spaced 35 minutes apart. It just wasn't possible to write any faster than that without disrupting foreground activity. The whole current model where people think of this stuff in terms of segments and completion targets is a UI disaster. The direction I want to go in is where users can say "make sure checkpoints happen every N minutes", and something reasonable happens without additional parameter fiddling. And if the resulting checkpoint I/O spike is too big, they just increase the timeout to N+1 or N*2 to spread the checkpoint further. Getting too bogged down thinking in terms of the current, really terrible interface is something I'm trying to break myself of. Long-term, I want there to be checkpoint_timeout, and all the other parameters are gone, replaced by an internal implementation of the best practices proven to work even on busy systems. I don't have as much clarity on exactly what that best practice is the way that, say, I just suggested exactly how to eliminate wal_buffers as an important thing to manually set. But that's the dream UI: you shoot for a checkpoint interval, and something reasonable happens; if that's too intense, you just increase the interval to spread further. There probably will be small performance regression possible vs. the current code with parameter combination that happen to work well on it. Preserving every one of those is something that's not as important to me as making the tuning interface simple and clear. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Sat, Jan 15, 2011 at 14:05, Robert Haas <robertmhaas@gmail.com> wrote: > Idea #4: For ext3 filesystems that like to dump the entire buffer > cache instead of only the requested file, write a little daemon that > runs alongside of (and completely indepdently of) PostgreSQL. Every > 30 s, it opens a 1-byte file, changes the byte, fsyncs the file, and > closes the file, thus dumping the cache and preventing a ridiculous > growth in the amount of data to be sync'd at checkpoint time. Wouldn't it be easier to just mount in data=writeback mode? This provides a similar level of journaling as most other file systems. Regards, Marti
On Sat, Jan 15, 2011 at 5:57 PM, Greg Smith <greg@2ndquadrant.com> wrote: > I was just giving an example of how I might do an initial split. There's a > checkpoint happening now at time T; we have a rough idea that it needs to be > finished before some upcoming time T+D. Currently with default parameters > this becomes: > > Write: 0.5 * D; Sync: 0 > > Even though Sync obviously doesn't take zero. The slop here is enough that > it usually works anyway. > > I was suggesting that a quick reshuffling to: > > Write: 0.4 * D; Sync: 0.4 * D > > Might be a good first candidate for how to split the time up better. What is the basis for thinking that the sync should get the same amount of time as the writes? That seems pretty arbitrary. Right now, you're allowing 3 seconds per fsync, which could be a lot more or a lot less than 40% of the total checkpoint time, but I have a pretty clear sense of why that's a sensible thing to try: you give the rest of the system a moment or two to get some I/O done for something other than the checkpoint before flushing the next batch of buffers. But the checkpoint activity is always going to be spikey if it does anything at all, so spacing it out *more* isn't obviously useful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > What is the basis for thinking that the sync should get the same > amount of time as the writes? That seems pretty arbitrary. Right > now, you're allowing 3 seconds per fsync, which could be a lot more or > a lot less than 40% of the total checkpoint time... Just that it's where I ended up at when fighting with this for a month on the system I've seen the most problems at. The 3 second number was reversed from a computation that said "aim for an internal of X minutes; we have Y relations on average involved in the checkpoint". The direction my latest patch is strugling to go is computing a reasonable time automatically in the same way--count the relations, do a time estimate, add enough delay so the sync calls should be spread linearly over the given time range. > the checkpoint activity is always going to be spikey if it does > anything at all, so spacing it out *more* isn't obviously useful. > One of the components to the write queue is some notion that writes that have been waiting longest should eventually be flushed out. Linux has this number called dirty_expire_centiseconds which suggests it enforces just that, set to a default of 30 seconds. This is why some 5-minute interval checkpoints with default parameters, effectively spreading the checkpoint over 2.5 minutes, can work under the current design. Anything you wrote at T+0 to T+2:00 *should* have been written out already when you reach T+2:30 and sync. Unfortunately, when the system gets busy, there is this "congestion control" logic that basically throws out any guarantee of writes starting shortly after the expiration time. It turns out that the only thing that really works are the tunables that block new writes from happening once the queue is full, but they can't be set low enough to work well in earlier kernels when combined with lots of RAM. Using the terminology of http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt at some point you hit a point where "a process generating disk writes will itself start writeback." This is anologous to the PostgreSQL situation where backends do their own fsync calls. The kernel will eventually move to where those trying to write new data are instead recruited into being additional sources of write flushing. That's the part you just can't make aggressive enough on older kernels; dirty writers can always win. Ideally, the system never digs itself into a hole larger than you can afford to wait to write out. It's a transacton speed vs. latency thing though, and the older kernels just don't consider the latency side well enough. There is new mechanism in the latest kernels to control this much better: dirty_bytes and dirty_background_bytes are the tunables. I haven't had a chance to test yet. As mentioned upthread, some of the bleding edge kernels that have this feature available in are showing such large general performance regressions in our tests, compared to the boring old RHEL5 kernel, that whether this feature works or not is irrelevant. I haven't tracked down which new kernel distributions work well performance-wise and which don't yet for PostgreSQL. I'm hoping that when I get there, I'll see results like http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages , where the ideal setting for dirty_bytes to keep latency under control with BBWC was 15MB. To put that into perspective, the lowest useful setting you can set dirty_ratio to is 5% of RAM. That's 410MB on my measly 8GB desktop, and 3.3GB on the 64GB production server I've been trying to tune. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Hello Postgres Hackers, In reference to this todo item about clustering system table indexes, ( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php ) I have been studying the system tables to see which would benefit from clustering. I have some index suggestions and a question if you have a moment. Cluster Candidates: pg_attribute: Make the existing index ( attrelid, attnum ) clustered to order it by table and column.pg_attrdef: Existingindex ( adrelid, adnum ) clustered to order itby table and column. pg_constraint: Existing index ( conrelid ) clustered to get table constraints contiguous. pg_depend: Existing Index (refclassid, refobjid, refobjsubid) clusteredto so that when the referenced object is changed itsdependencies arevcontiguous. pg_description: Make the existing index ( Objoid, classoid, objsubid ) clustered to order it by entity, catalog, and optionalcolumn. * reversing the first two columns makes more sense to me ... catalog, object, column or since objectimplies catalog ( right? ) just dispensing with catalog altogether, but that would mean creating a new index.pg_shdependent:Existing index (refclassid, refobjid) clustered for same reason as pg_depend. pg_statistic: Existing index (starelid, staattnum) clustered to order it by table and column. pg_trigger: Make the existing index ( tgrelid, tgname ) clustered to order it by table then name getting all the triggerson a table together. Maybe Cluster: pg_rewrite: Not sure about this one ... The existing index ( ev_class,rulename ) seems logical to cluster to get all therewrite rules for agiven table contiguous but in the db's available to me virtually everytable only has one rewrite rule. pg_auth_members: We could order it by role or by member ofthat role. Not sure which would be more valuable. Stupid newbie question: is there a way to make queries on the system tables show me what is actually there when I'm poking around? So for example: Select * from pg_type limit 1; tells me that the typoutput is 'boolout'. An english string rather than a number. So even though the documentation saysthat columnmaps to pg_proc.oid I can't then write: Select * from pg_proc where oid = 'boolout'; It would be very helpful if I wasn't learning the system but since Iam I'd like to turn it off for now. Fewer layers ofabstraction. Thanks, Simone Aiken 303-956-7188 Quietly Competent Consulting
2011/1/16 Simone Aiken <saiken@ulfheim.net>: > is there a way to make queries on the system tables show me what > is actually there when I'm poking around? So for example: > > Select * from pg_type limit 1; > > tells me that the typoutput is 'boolout'. An english string rather than > a number. So even though the documentation says that column > maps to pg_proc.oid I can't then write: > > Select * from pg_proc where oid = 'boolout'; Type type of typoutput is "regproc", which is really an oid with a different output function. To get the numeric value, do: Select typoutput::oid from pg_type limit 1; Nicolas
Nicolas Barbier <nicolas.barbier@gmail.com> writes: > 2011/1/16 Simone Aiken <saiken@ulfheim.net>: >> ... So even though the documentation says that column >> maps to pg_proc.oid I can't then write: >> Select * from pg_proc where oid = 'boolout'; > Type type of typoutput is "regproc", which is really an oid with a > different output function. To get the numeric value, do: > Select typoutput::oid from pg_type limit 1; Also, you *can* go back the other way. It's very common to write Select * from pg_proc where oid = 'boolout'::regproc rather than looking up the OID first. There are similar pseudotypes for relation and operator names; see "Object Identifier Types" in the manual. regards, tom lane
>> Select typoutput::oid from pg_type limit 1; > Also, you *can* go back the other way. It's very common to write > > Select * from pg_proc where oid = 'boolout'::regproc > > rather than looking up the OID first. > see "Object Identifier Types" in the manual. Many thanks to you both, that helps tremendously. - Simone Aiken
On Tue, Jan 11, 2011 at 5:27 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> One of the ideas Simon and I had been considering at one point was adding >> some better de-duplication logic to the fsync absorb code, which I'm >> reminded by the pattern here might be helpful independently of other >> improvements. > > Hopefully I'm not stepping on any toes here, but I thought this was an > awfully good idea and had a chance to take a look at how hard it would > be today while en route from point A to point B. The answer turned > out to be "not very", so PFA a patch that seems to work. I tested it > by attaching gdb to the background writer while running pgbench, and > it eliminate the backend fsyncs without even breaking a sweat. I had been concerned about how long the lock would be held, and I was pondering ways to do only partial deduplication to reduce the time. But since you already wrote a patch to do the whole thing, I figured I'd time it. I arranged to test an instrumented version of your patch under large shared_buffers of 4GB, conditions that would maximize the opportunity for it to take a long time. Running your compaction to go from 524288 to a handful (14 to 29, depending on run) took between 36 and 39 milliseconds. For comparison, doing just the memcpy part of AbsorbFsyncRequest on a full queue took from 24 to 27 milliseconds. They are close enough to each other that I am no longer interested in partial deduplication. But both are long enough that I wonder if having a hash table in shared memory that is kept unique automatically at each update might not be worthwhile. Cheers, Jeff
On Sun, Jan 16, 2011 at 7:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > But since you already wrote a patch to do the whole thing, I figured > I'd time it. Thanks! > I arranged to test an instrumented version of your patch under large > shared_buffers of 4GB, conditions that would maximize the opportunity > for it to take a long time. Running your compaction to go from 524288 > to a handful (14 to 29, depending on run) took between 36 and 39 > milliseconds. > > For comparison, doing just the memcpy part of AbsorbFsyncRequest on > a full queue took from 24 to 27 milliseconds. > > They are close enough to each other that I am no longer interested in > partial deduplication. But both are long enough that I wonder if > having a hash table in shared memory that is kept unique automatically > at each update might not be worthwhile. There are basically three operations that we care about here: (1) time to add an fsync request to the queue, (2) time to absorb requests from the queue, and (3) time to compact the queue. The first is by far the most common, and at least in any situation that anyone's analyzed so far, the second will be far more common than the third. Therefore, it seems unwise to accept any slowdown in #1 to speed up either #2 or #3, and a hash table probe is definitely going to be slower than what's required to add an element under the status quo. We could perhaps mitigate this by partitioning the hash table. Alternatively, we could split the queue in half and maintain a global variable - protected by the same lock - indicating which half is currently open for insertions. The background writer would grab the lock, flip the global, release the lock, and then drain the half not currently open to insertions; the next iteration would flush the other half. However, it's unclear to me that either of these things has any value. I can't remember any reports of contention on the BgWriterCommLock, so it seems like changing the logic as minimally as possible as the way to go. (In contrast, note that the WAL insert lock, proc array lock, and lock manager/buffer manager partition locks are all known to be heavily contended in certain workloads.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I have finished a first run of benchmarking the current 9.1 code at various sizes. See http://www.2ndquadrant.us/pgbench-results/index.htm for many details. The interesting stuff is in Test Set 3, near the bottom. That's the first one that includes buffer_backend_fsync data. This iall on ext3 so far, but is using a newer 2.6.32 kernel, the one from Ubuntu 10.04. The results are classic Linux in 2010: latency pauses from checkpoint sync will easily leave the system at a dead halt for a minute, with the worst one observed this time dropping still for 108 seconds. That one is weird, but these two are completely averge cases: http://www.2ndquadrant.us/pgbench-results/210/index.html http://www.2ndquadrant.us/pgbench-results/215/index.html I think a helpful next step here would be to put Robert's fsync compaction patch into here and see if that helps. There are enough backend syncs showing up in the difficult workloads (scale>=1000, clients >=32) that its impact should be obvious. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Sun, Jan 16, 2011 at 10:13 PM, Greg Smith <greg@2ndquadrant.com> wrote: > I have finished a first run of benchmarking the current 9.1 code at various > sizes. See http://www.2ndquadrant.us/pgbench-results/index.htm for many > details. The interesting stuff is in Test Set 3, near the bottom. That's > the first one that includes buffer_backend_fsync data. This iall on ext3 so > far, but is using a newer 2.6.32 kernel, the one from Ubuntu 10.04. > > The results are classic Linux in 2010: latency pauses from checkpoint sync > will easily leave the system at a dead halt for a minute, with the worst one > observed this time dropping still for 108 seconds. I wish I understood better what makes Linux systems "freeze up" under heavy I/O load. Linux - like other UNIX-like systems - generally has reasonably effective mechanisms for preventing a single task from monopolizing the (or a) CPU in the presence of other processes that also wish to be time-sliced, but the same thing doesn't appear to be true of I/O. > I think a helpful next step here would be to put Robert's fsync compaction > patch into here and see if that helps. There are enough backend syncs > showing up in the difficult workloads (scale>=1000, clients >=32) that its > impact should be obvious. Thanks for doing this work. I look forward to the results. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greg Smith wrote: > One of the components to the write queue is some notion that writes that > have been waiting longest should eventually be flushed out. Linux has > this number called dirty_expire_centiseconds which suggests it enforces > just that, set to a default of 30 seconds. This is why some 5-minute > interval checkpoints with default parameters, effectively spreading the > checkpoint over 2.5 minutes, can work under the current design. > Anything you wrote at T+0 to T+2:00 *should* have been written out > already when you reach T+2:30 and sync. Unfortunately, when the system > gets busy, there is this "congestion control" logic that basically > throws out any guarantee of writes starting shortly after the expiration > time. Should we be writing until 2:30 then sleep 30 seconds and fsync at 3:00? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Sun, Jan 16, 2011 at 7:13 PM, Greg Smith <greg@2ndquadrant.com> wrote: > I have finished a first run of benchmarking the current 9.1 code at various > sizes. See http://www.2ndquadrant.us/pgbench-results/index.htm for many > details. The interesting stuff is in Test Set 3, near the bottom. That's > the first one that includes buffer_backend_fsync data. This iall on ext3 so > far, but is using a newer 2.6.32 kernel, the one from Ubuntu 10.04. > > The results are classic Linux in 2010: latency pauses from checkpoint sync > will easily leave the system at a dead halt for a minute, with the worst one > observed this time dropping still for 108 seconds. That one is weird, but > these two are completely averge cases: > > http://www.2ndquadrant.us/pgbench-results/210/index.html > http://www.2ndquadrant.us/pgbench-results/215/index.html > > I think a helpful next step here would be to put Robert's fsync compaction > patch into here and see if that helps. There are enough backend syncs > showing up in the difficult workloads (scale>=1000, clients >=32) that its > impact should be obvious. Have you ever tested Robert's other idea of having a metronome process do a periodic fsync on a dummy file which is located on the same ext3fs as the table files? I think that that would be interesting to see. Cheers, Jeff
Jeff Janes wrote: > Have you ever tested Robert's other idea of having a metronome process > do a periodic fsync on a dummy file which is located on the same ext3fs > as the table files? I think that that would be interesting to see. > To be frank, I really don't care about fixing this behavior on ext3, especially in the context of that sort of hack. That filesystem is not the future, it's not possible to ever really make it work right, and every minute spent on pandering to its limitations would be better spent elsewhere IMHO. I'm starting with the ext3 benchmarks just to provide some proper context for the worst-case behavior people can see right now, and to make sure refactoring here doesn't make things worse on it. My target is same or slightly better on ext3, much better on XFS and ext4. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Jan 15, 2011, at 8:15 AM, Robert Haas wrote: > Well, the point of this is not to save time in the bgwriter - I'm not > surprised to hear that wasn't noticeable. The point is that when the > fsync request queue fills up, backends start performing an fsync *for > every block they write*, and that's about as bad for performance as > it's possible to be. So it's worth going to a little bit of trouble > to try to make sure it doesn't happen. It didn't happen *terribly* > frequently before, but it does seem to be common enough to worry about > - e.g. on one occasion, I was able to reproduce it just by running > pgbench -i -s 25 or something like that on a laptop. Wow, that's the kind of thing that would be incredibly difficult to figure out, especially while your production system isin flames... Can we change ereport that happens in that case from DEBUG1 to WARNING? Or provide some other means to trackit? -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Mon, Jan 17, 2011 at 6:07 PM, Jim Nasby <jim@nasby.net> wrote: > On Jan 15, 2011, at 8:15 AM, Robert Haas wrote: >> Well, the point of this is not to save time in the bgwriter - I'm not >> surprised to hear that wasn't noticeable. The point is that when the >> fsync request queue fills up, backends start performing an fsync *for >> every block they write*, and that's about as bad for performance as >> it's possible to be. So it's worth going to a little bit of trouble >> to try to make sure it doesn't happen. It didn't happen *terribly* >> frequently before, but it does seem to be common enough to worry about >> - e.g. on one occasion, I was able to reproduce it just by running >> pgbench -i -s 25 or something like that on a laptop. > > Wow, that's the kind of thing that would be incredibly difficult to figure out, especially while your production systemis in flames... Can we change ereport that happens in that case from DEBUG1 to WARNING? Or provide some other meansto track it? Something like this? http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=3134d8863e8473e3ed791e27d484f9e548220411 -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Jim Nasby wrote: > Wow, that's the kind of thing that would be incredibly difficult to figure out, especially while your production systemis in flames... Can we change ereport that happens in that case from DEBUG1 to WARNING? Or provide some other meansto track it That's why we already added pg_stat_bgwriter.buffers_backend_fsync to track the problem before trying to improve it. It was driving me crazy on a production server not having any visibility into when it happened. I haven't seen that we need anything beyond that so far. In the context of this new patch for example, if you get to where a backend does its own sync, you'll know it did a compaction as part of that. The existing statistic would tell you enough. There's now enough data in test set 3 at http://www.2ndquadrant.us/pgbench-results/index.htm to start to see how this breaks down on a moderately big system (well, by most people's standards, but not Jim for whom this is still a toy). Note the backend_sync column on the right, very end of the page; that's the relevant counter I'm commenting on: scale=175: Some backend fsync with 64 clients, 2/3 runs. scale=250: Significant backend fsync with 32 and 64 clients, every run. scale=500: Moderate to large backend fsync at any client count >=16. This seems to be worst spot of those mapped. Above here, I would guess the TPS numbers start slowing enough that the fsync request queue activity drops, too. scale=1000: Backend fsync starting at 8 clients scale=2000: Backend fsync starting at 16 clients. By here I think the TPS volumes are getting low enough that clients are stuck significantly more often waiting for seeks rather than fsync. Looks like the most effective spot for me to focus testing on with this server is scales of 500 and 1000, with 16 to 64 clients. Now that I've got the scale fine tuned better, I may crank up the client counts too and see what that does. I'm glad these are appearing in reasonable volume here though, was starting to get nervous about only having NDA restricted results to work against. Some days you just have to cough up for your own hardware. I just tagged pgbench-tools-0.6.0 and pushed to GitHub/git.postgresql.org with the changes that track and report on buffers_backend_fsync if anyone else wants to try this out. It includes those numbers if you have a 9.1 with them, otherwise just reports 0 for it all the time; detection of the feature wasn't hard to add. The end portion of a config file for the program (the first part specifies host/username info and the like) that would replicate the third test set here is: MAX_WORKERS="4" SCRIPT="tpc-b.sql" SCALES="1 10 100 175 250 500 1000 2000" SETCLIENTS="4 8 16 32 64" SETTIMES=3 RUNTIME=600 TOTTRANS="" -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Bruce Momjian wrote: > Should we be writing until 2:30 then sleep 30 seconds and fsync at 3:00? > The idea of having a dead period doing no work at all between write phase and sync phase may have some merit. I don't have enough test data yet on some more fundamental issues in this area to comment on whether that smaller optimization would be valuable. It may be a worthwhile concept to throw into the sequencing. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Followup on System Table Index clustering ToDo -
It looks like to implement this I need to do the following:
1 - Add statements to indexing.h to cluster the selected indexes.
A do-nothing define at the top to suppress warnings and then
lines below for perl to parse out.
#define DECLARE_CLUSTER_INDEX(table,index) ...
( add the defines under the index declarations ).
2 - Alter genbki.pl to produce the appropriate statements in
postgres.bki when it reads the new lines in indexing.h.
Will hold them in memory until the end of the file so they
will come in after 'Build Indices' is called.
CLUSTER tablename USING indexname
3 - Initdb will pipe the commands in postgres.bki to the
postgres executable running in --boot mode. Code
will need to be added to bootparse.y to recognize
this new command and resolve it into a call to
cluster_rel( tabOID, indOID, 0, 0, -1, -1 );
Speak now before I learn Bison ... actually I should probably
learn Bison anyway. After ProC other pre-compilation languages
can't be that bad.
Sound all right?
Thanks,
-Simone Aiken
On Jan 15, 2011, at 10:11 PM, Simone Aiken wrote:
Hello Postgres Hackers,
In reference to this todo item about clustering system table indexes,
( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php )
I have been studying the system tables to see which would benefit from
clustering. I have some index suggestions and a question if you have a
moment.
Cluster Candidates:
pg_attribute: Make the existing index ( attrelid, attnum ) clustered to
order it by table and column.
pg_attrdef: Existing index ( adrelid, adnum ) clustered to order it
by table and column.
pg_constraint: Existing index ( conrelid ) clustered to get table
constraints contiguous.
pg_depend: Existing Index (refclassid, refobjid, refobjsubid) clustered
to so that when the referenced object is changed its dependencies
arevcontiguous.
pg_description: Make the existing index ( Objoid, classoid, objsubid )
clustered to order it by entity, catalog, and optional column.
* reversing the first two columns makes more sense to me ...
catalog, object, column or since object implies catalog ( right? )
just dispensing with catalog altogether, but that would mean
creating a new index.
pg_shdependent: Existing index (refclassid, refobjid) clustered for
same reason as pg_depend.
pg_statistic: Existing index (starelid, staattnum) clustered to order
it by table and column.
pg_trigger: Make the existing index ( tgrelid, tgname ) clustered to
order it by table then name getting all the triggers on a table together.
Maybe Cluster:
pg_rewrite: Not sure about this one ... The existing index ( ev_class,
rulename ) seems logical to cluster to get all the rewrite rules for a
given table contiguous but in the db's available to me virtually every
table only has one rewrite rule.
pg_auth_members: We could order it by role or by member of
that role. Not sure which would be more valuable.
Stupid newbie question:
is there a way to make queries on the system tables show me what
is actually there when I'm poking around? So for example:
Select * from pg_type limit 1;
tells me that the typoutput is 'boolout'. An english string rather than
a number. So even though the documentation says that column
maps to pg_proc.oid I can't then write:
Select * from pg_proc where oid = 'boolout';
It would be very helpful if I wasn't learning the system but since I
am I'd like to turn it off for now. Fewer layers of abstraction.
Thanks,
Simone Aiken
303-956-7188
Quietly Competent Consulting
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2011/1/18 Greg Smith <greg@2ndquadrant.com>: > Bruce Momjian wrote: >> >> Should we be writing until 2:30 then sleep 30 seconds and fsync at 3:00? >> > > The idea of having a dead period doing no work at all between write phase > and sync phase may have some merit. I don't have enough test data yet on > some more fundamental issues in this area to comment on whether that smaller > optimization would be valuable. It may be a worthwhile concept to throw > into the sequencing. Are we able to have some pause without strict rules like 'stop for 30 sec' ? (case : my hardware is very good and I can write 400MB/sec with no interrupt, XXX IOPS) I wonder if we are not going to have issue with "RAID firmware + BBU + linux scheduler" because we are adding 'unexpected' behavior in the middle. -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
Robert Haas wrote: > Idea #4: For ext3 filesystems that like to dump the entire buffer > cache instead of only the requested file, write a little daemon that > runs alongside of (and completely indepdently of) PostgreSQL. Every > 30 s, it opens a 1-byte file, changes the byte, fsyncs the file, and > closes the file, thus dumping the cache and preventing a ridiculous > growth in the amount of data to be sync'd at checkpoint time. > Today's data suggests this problem has been resolved in the latest kernels. I saw the "giant flush/series of small flushes" pattern quite easily on the CentOS5 system I last did heavy pgbench testing on. The one I'm testing now has kernel 2.6.23 (Ubuntu 10.04), and it doesn't show it at all. Here's what a bad checkpoint looks like on this system: LOG: checkpoint starting: xlog DEBUG: checkpoint sync: number=1 file=base/24746/36596.8 time=7651.601 msec DEBUG: checkpoint sync: number=2 file=base/24746/36506 time=0.001 msec DEBUG: checkpoint sync: number=3 file=base/24746/36596.2 time=1891.695 msec DEBUG: checkpoint sync: number=4 file=base/24746/36596.4 time=7431.441 msec DEBUG: checkpoint sync: number=5 file=base/24746/36515 time=0.216 msec DEBUG: checkpoint sync: number=6 file=base/24746/36596.9 time=4422.892 msec DEBUG: checkpoint sync: number=7 file=base/24746/36596.12 time=954.242 msec DEBUG: checkpoint sync: number=8 file=base/24746/36237_fsm time=0.002 msec DEBUG: checkpoint sync: number=9 file=base/24746/36503 time=0.001 msec DEBUG: checkpoint sync: number=10 file=base/24746/36584 time=41.401 msec DEBUG: checkpoint sync: number=11 file=base/24746/36596.7 time=885.921 msec DEBUG: checkpoint sync: number=12 file=base/24813/30774 time=0.002 msec DEBUG: checkpoint sync: number=13 file=base/24813/24822 time=0.005 msec DEBUG: checkpoint sync: number=14 file=base/24746/36801 time=49.801 msec DEBUG: checkpoint sync: number=15 file=base/24746/36601.2 time=610.996 msec DEBUG: checkpoint sync: number=16 file=base/24746/36596 time=16154.361 msec DEBUG: checkpoint sync: number=17 file=base/24746/36503_vm time=0.001 msec DEBUG: checkpoint sync: number=18 file=base/24746/36508 time=0.000 msec DEBUG: checkpoint sync: number=19 file=base/24746/36596.10 time=9759.898 msec DEBUG: checkpoint sync: number=20 file=base/24746/36596.3 time=3392.727 msec DEBUG: checkpoint sync: number=21 file=base/24746/36237 time=0.150 msec DEBUG: checkpoint sync: number=22 file=base/24746/36596.11 time=9153.437 msec DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 1057833 of relation base/24746/36596 [>800 more of these] DEBUG: checkpoint sync: number=23 file=base/24746/36601.1 time=48697.179 msec DEBUG: could not forward fsync request because request queue is full DEBUG: checkpoint sync: number=24 file=base/24746/36597 time=0.080 msec DEBUG: checkpoint sync: number=25 file=base/24746/36237_vm time=0.001 msec DEBUG: checkpoint sync: number=26 file=base/24813/24822_fsm time=0.001 msec DEBUG: checkpoint sync: number=27 file=base/24746/36503_fsm time=0.000 msec DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 20619 of relation base/24746/36601 DEBUG: checkpoint sync: number=28 file=base/24746/36506_fsm time=0.000 msec DEBUG: checkpoint sync: number=29 file=base/24746/36596_vm time=0.040 msec DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 278967 of relation base/24746/36596 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 1582400 of relation base/24746/36596 DEBUG: checkpoint sync: number=30 file=base/24746/36596.6 time=0.002 msec DEBUG: checkpoint sync: number=31 file=base/24813/11647 time=0.004 msec DEBUG: checkpoint sync: number=32 file=base/24746/36601 time=201.632 msec DEBUG: checkpoint sync: number=33 file=base/24746/36801_fsm time=0.001 msec DEBUG: checkpoint sync: number=34 file=base/24746/36596.5 time=0.001 msec DEBUG: checkpoint sync: number=35 file=base/24746/36599 time=0.000 msec DEBUG: checkpoint sync: number=36 file=base/24746/36587 time=0.005 msec DEBUG: checkpoint sync: number=37 file=base/24746/36596_fsm time=0.001 msec DEBUG: checkpoint sync: number=38 file=base/24746/36596.1 time=0.001 msec DEBUG: checkpoint sync: number=39 file=base/24746/36801_vm time=0.001 msec LOG: checkpoint complete: wrote 9515 buffers (29.0%); 0 transaction log file(s) added, 0 removed, 64 recycled; write=32.409 s, sync=111.615 s, total=144.052 s; sync files=39, longest=48.697 s, average=2.853 s Here the file that's been brutally delayed via backend contention is #23, after already seeing quite long delays on the earlier ones. That I've never seen with earlier kernels running ext3. This is good in that it makes it more likely a spread sync approach that works on XFS will also work on these newer kernels with ext4. Then the only group we wouldn't be able to help if that works the ext3 + old kernel crowd. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011: > > Hello Postgres Hackers, > > In reference to this todo item about clustering system table indexes, > ( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php ) > I have been studying the system tables to see which would benefit from > clustering. I have some index suggestions and a question if you have a > moment. Wow, this is really old stuff. I don't know if this is really of any benefit, given that these catalogs are loaded into syscaches anyway. Furthermore, if you cluster at initdb time, they will soon lose the ordering, given that updates move tuples around and inserts put them anywhere. So you'd need the catalogs to be re-clustered once in a while, and I don't see how you'd do that (except by asking the user to do it, which doesn't sound so great). I think you need some more discussion on the operational details before engaging in the bootstrap bison stuff (unless you just want to play with Bison for educational purposes, of course, which is always a good thing to do). -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011: > > Hello Postgres Hackers, BTW whatever you do, don't start a new thread by replying to an existing message and just changing the subject line. It will mess up the threading for some readers, and some might not even see your message. Compose a fresh message instead. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jan 18, 2011 at 8:35 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011: >> >> Hello Postgres Hackers, >> >> In reference to this todo item about clustering system table indexes, >> ( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php ) >> I have been studying the system tables to see which would benefit from >> clustering. I have some index suggestions and a question if you have a >> moment. > > Wow, this is really old stuff. I don't know if this is really of any > benefit, given that these catalogs are loaded into syscaches anyway. > Furthermore, if you cluster at initdb time, they will soon lose the > ordering, given that updates move tuples around and inserts put them > anywhere. So you'd need the catalogs to be re-clustered once in a > while, and I don't see how you'd do that (except by asking the user to > do it, which doesn't sound so great). The idea of the TODO seems to have been to set the default clustering to something reasonable. That doesn't necessarily seem like a bad idea even if we can't automatically maintain the cluster order, but there's some question in my mind whether we'd get any measurable benefit from the clustering. Even on a database with a gigantic number of tables, it seems likely that the relevant system catalogs will stay fully cached and, as you point out, the system caches will further blunt the impact of any work in this area. I think the first thing to do would be to try to come up with a reproducible test case where clustering the tables improves performance. If we can't, that might mean it's time to remove this TODO. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jan 18, 2011, at 6:35 AM, Alvaro Herrera wrote:
Wow, this is really old stuff. I don't know if this is really of any
benefit, given that these catalogs are loaded into syscaches anyway.
The benefit is educational primarily. I was looking for a todo list item
that would expose me to the system tables. Learning the data model
of a new system is always step 1 for me. So that one was perfect as
it would have me study and consider each one to determine if there
was any benefit from clustering on its initial load into cache.
Furthermore, if you cluster at initdb time, they will soon lose the
ordering, given that updates move tuples around and inserts put them
anywhere. So you'd need the catalogs to be re-clustered once in a
while, and I don't see how you'd do that (except by asking the user to
do it, which doesn't sound so great).
I did discover that last night. I'm used to databases that keep up their
clustering. One that falls apart over time is distinctly strange. And the
way you guys do your re-clustering logic is overkill if just a few rows
are out of place. On the upside, a call to mass re-clustering goes
and updates all the clustered indexes in the system and that includes
these tables. Will have to study auto-vacuum as well to consider that.
(unless you just want to play with
Bison for educational purposes, of course, which is always a good thing
to do).
Pretty much, yeah.
- Simone Aiken
On Tue, Jan 18, 2011 at 8:35 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote: >> Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011: >>> >> >Hello Postgres Hackers, >>> >> >In reference to this todo item about clustering system table indexes, >>> ( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php ) >> >> Wow, this is really old stuff. I don't know if this is really of any > >If we can't, that might mean it's time to remove this TODO. When I'm learning a new system I like to first learn how to use it, second learn its data model, third start seriously looking at the code. So that Todo is ideal for my learning method. If there is something else that would also involve studying all the system tables it would also be great. For example, I noticed we have column level comments on the web but not in the database itself. This seems silly. Why not have the comments in the database and have the web query the tables of template databases for the given versions? That way \d+ pg_tablename would provide instant gratification for users. And we all like our gratification to be instant. They could be worked into The .h files though as inserts to pg_description they wouldn't provide an excuse to learn bison. I'm open to other suggestions as well. -Simone Aiken
> To be frank, I really don't care about fixing this behavior on ext3, > especially in the context of that sort of hack. That filesystem is not > the future, it's not possible to ever really make it work right, and > every minute spent on pandering to its limitations would be better spent > elsewhere IMHO. I'm starting with the ext3 benchmarks just to provide > some proper context for the worst-case behavior people can see right > now, and to make sure refactoring here doesn't make things worse on it. > My target is same or slightly better on ext3, much better on XFS and ext4. Please don't forget that we need to avoid performance regressions on NTFS and ZFS as well. They don't need to improve, but we can't let them regress. I think we can ignore BSD/UFS and Solaris/UFS, as well as HFS+, though. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Robert Haas wrote: > On Tue, Jan 18, 2011 at 8:35 AM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: > > Excerpts from Simone Aiken's message of dom ene 16 02:11:26 -0300 2011: > >> > >> Hello Postgres Hackers, > >> > >> In reference to this todo item about clustering system table indexes, > >> ( http://archives.postgresql.org/pgsql-hackers/2004-05/msg00989.php ) > >> I have been studying the system tables to see which would benefit ?from > >> clustering. ?I have some index suggestions and a question if you have a > >> moment. > > > > Wow, this is really old stuff. ?I don't know if this is really of any > > benefit, given that these catalogs are loaded into syscaches anyway. > > Furthermore, if you cluster at initdb time, they will soon lose the > > ordering, given that updates move tuples around and inserts put them > > anywhere. ?So you'd need the catalogs to be re-clustered once in a > > while, and I don't see how you'd do that (except by asking the user to > > do it, which doesn't sound so great). > > The idea of the TODO seems to have been to set the default clustering > to something reasonable. That doesn't necessarily seem like a bad > idea even if we can't automatically maintain the cluster order, but > there's some question in my mind whether we'd get any measurable > benefit from the clustering. Even on a database with a gigantic > number of tables, it seems likely that the relevant system catalogs > will stay fully cached and, as you point out, the system caches will > further blunt the impact of any work in this area. I think the first > thing to do would be to try to come up with a reproducible test case > where clustering the tables improves performance. If we can't, that > might mean it's time to remove this TODO. I think CLUSTER is a win when you are looking up multiple rows in the same table, either using a non-unique index or a range search. What places do such lookups? Having them all in adjacent pages would be a win --- single-row lookups are usually not. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Tue, Jan 18, 2011 at 12:16 PM, Simone Aiken <saiken@ulfheim.net> wrote: > When I'm learning a new system I like to first learn how to use it, > second learn its data model, third start seriously looking at the code. > So that Todo is ideal for my learning method. Sure - my point is just that we usually have as a criteria for any performance related patch that it actually does improve performance. So, we'd need a test case. > If there is something else that would also involve studying all the system > tables it would also be great. For example, I noticed we have column > level comments on the web but not in the database itself. This seems > silly. Why not have the comments in the database and have the web > query the tables of template databases for the given versions? Uh... I don't know what this means. > I'm open to other suggestions as well. Here are a few TODO items that look relatively easy to me (they may not actually be easy when you dig in, of course): Clear table counters on TRUNCATE Allow the clearing of cluster-level statistics Allow ALTER TABLE ... ALTER CONSTRAINT ... RENAME Allow ALTER TABLE to change constraint deferrability and actions Unfortunately we don't have a lot of easy TODOs. People keep doing the ones we think up... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
-----Original Message----- From: Robert Haas [mailto:robertmhaas@gmail.com] Sent: Tuesday, January 18, 2011 2:53 PM To: Simone Aiken Cc: Alvaro Herrera; pgsql-hackers Subject: Re: [HACKERS] ToDo List Item - System Table Index Clustering >Sure - my point is just that we usually have as a criteria for any >performance related patch that it actually does improve performance. Sorry wasn't arguing your point. Conceding it actually. =) I wasn't explaining why I chose it anyway to contest your statements, but as an invitation for you to point me towards something more useful that fit what I was looking for in a task. > > Uh... I don't know what this means. > Pages like this one have column comments for the system tables: http://www.psql.it/manuale/8.3/catalog-pg-attribute.html But in my database when I look for comments they aren't there: qcc=> \d+ pg_attribute Table "pg_catalog.pg_attribute" Column | Type | Modifiers | Description ---------------+----------+-----------+-------------attrelid | oid | not null |attname | name | notnull |atttypid | oid | not null |attstattarget | integer | not null |attlen | smallint | not null |attnum | smallint | not null |attndims | integer | not null |attcacheoff | integer | not null |atttypmod | integer | not null |attbyval | boolean | not null |attstorage | "char" | not null |attalign | "char" | not null |attnotnull | boolean | not null |atthasdef | boolean | not null |attisdropped | boolean | not null |attislocal | boolean | not null |attinhcount | integer | not null | So I have to fire up a web browser and start googling to learn about the columns. Putting them in pg_description would be more handy, no? -Simone Aiken
> Robert > > I think the first > thing to do would be to try to come up with a reproducible test case > where clustering the tables improves performance. > On that note, is there any standard way you guys do benchmarks? > Bruce > >I think CLUSTER is a win when you are looking up multiple rows in the same table, either using a non-unique index or a range search. What places do such lookups? >Having them all in adjacent pages would be a win --- single-row lookups are usually not. > Mostly the tables that track column level data. Typically you will want to grab rows for multiple columns for a given table at once so it would be helpful to have them be contiguous on disk. I could design a benchmark to display this by building a thousand tables one column at a time using 'alter add column' to scatter the catalog rows for the tables across many blocks. So they'll be a range with column 1 for each table and column 2 for each table and column three for each table. Then fill a couple data tables with a lot of data and set some noise makers to loop through them over and over with full table scans ... filling up cache with unrelated data and hopefully ageing out the cache of the pg_tables. Then do some benchmark index lookup queries to see the retrieval time before and after clustering the pg_ctalog tables to record a difference. If the criteria is "doesn't hurt anything and helps a little" I think this passes. Esp since clusters aren't maintained automatically so adding them has no negative impact on insert or update. It'd just be a nice thing to do if you know it can be done that doesn't harm anyone who doesn't know. -Simone Aiken
On Tue, Jan 18, 2011 at 6:49 PM, Simone Aiken <saiken@quietlycompetent.com> wrote: > Pages like this one have column comments for the system tables: > > http://www.psql.it/manuale/8.3/catalog-pg-attribute.html Oh, I see. I don't think we want to go there. We'd need some kind of system for keeping the two places in sync. And there'd be no easy way to upgrade the in-database descriptions when we upgraded to a newer minor release, supposing they'd changed in the meantime. And some of the descriptions are quite long, so they wouldn't fit nicely in the amount of space you typically have available when you run \d+. And it would enlarge the size of an empty database by however much was required to store all those comments, which could be an issue for PostgreSQL instances that have many small databases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jan 18, 2011 at 6:49 PM, Simone Aiken > <saiken@quietlycompetent.com> wrote: >> Pages like this one have column comments for the system tables: >> >> http://www.psql.it/manuale/8.3/catalog-pg-attribute.html > Oh, I see. I don't think we want to go there. We'd need some kind of > system for keeping the two places in sync. I seem to recall some muttering about teaching genbki to extract such comments from the SGML sources or perhaps the C header files. I tend to agree though that it would be a lot more work than it's worth. And as you say, pg_description entries aren't free. Which brings up another point though. I have a personal TODO item to make the comments for operator support functions more consistent: http://archives.postgresql.org/message-id/21407.1287157253@sss.pgh.pa.us Should we consider removing those comments altogether, instead? regards, tom lane
Excerpts from Robert Haas's message of mié ene 19 15:25:00 -0300 2011: > Oh, I see. I don't think we want to go there. We'd need some kind of > system for keeping the two places in sync. Maybe autogenerate both the .sgml and the postgres.description files from a single source. > And there'd be no easy way > to upgrade the in-database descriptions when we upgraded to a newer > minor release, supposing they'd changed in the meantime. I wouldn't worry about this issue. We don't do many catalog changes in minor releases anyway. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, Jan 19, 2011 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Jan 18, 2011 at 6:49 PM, Simone Aiken >> <saiken@quietlycompetent.com> wrote: >>> Pages like this one have column comments for the system tables: >>> >>> http://www.psql.it/manuale/8.3/catalog-pg-attribute.html > >> Oh, I see. I don't think we want to go there. We'd need some kind of >> system for keeping the two places in sync. > > I seem to recall some muttering about teaching genbki to extract such > comments from the SGML sources or perhaps the C header files. I tend to > agree though that it would be a lot more work than it's worth. And as > you say, pg_description entries aren't free. > > Which brings up another point though. I have a personal TODO item to > make the comments for operator support functions more consistent: > http://archives.postgresql.org/message-id/21407.1287157253@sss.pgh.pa.us > Should we consider removing those comments altogether, instead? I could go either way on that. Most of those comments are pretty short, aren't they? How much storage are they really costing us? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jan 19, 2011 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Which brings up another point though. I have a personal TODO item to >> make the comments for operator support functions more consistent: >> http://archives.postgresql.org/message-id/21407.1287157253@sss.pgh.pa.us >> Should we consider removing those comments altogether, instead? > I could go either way on that. Most of those comments are pretty > short, aren't they? How much storage are they really costing us? Well, on my machine pg_description is about 210K (per database) as of HEAD. 90% of its contents are pg_proc entries, though I have no good fix on how much of that is for internal-use-only functions. A very rough estimate from counting pg_proc and pg_operator entries suggests that the answer might be "about a third". So if we do what was said in the above-cited thread, ie move existing comments to pg_operator and add boilerplate ones to pg_proc, we probably would pay <100K for it. regards, tom lane
On Wed, Jan 19, 2011 at 3:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Jan 19, 2011 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Which brings up another point though. I have a personal TODO item to >>> make the comments for operator support functions more consistent: >>> http://archives.postgresql.org/message-id/21407.1287157253@sss.pgh.pa.us >>> Should we consider removing those comments altogether, instead? > >> I could go either way on that. Most of those comments are pretty >> short, aren't they? How much storage are they really costing us? > > Well, on my machine pg_description is about 210K (per database) as of > HEAD. 90% of its contents are pg_proc entries, though I have no good > fix on how much of that is for internal-use-only functions. A very > rough estimate from counting pg_proc and pg_operator entries suggests > that the answer might be "about a third". So if we do what was said in > the above-cited thread, ie move existing comments to pg_operator and > add boilerplate ones to pg_proc, we probably would pay <100K for it. I guess that's not enormously expensive, but it's not insignificant either. On my machine, a template database is 5.5MB. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jan 19, 2011 at 3:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, on my machine pg_description is about 210K (per database) as of >> HEAD. �90% of its contents are pg_proc entries, though I have no good >> fix on how much of that is for internal-use-only functions. �A very >> rough estimate from counting pg_proc and pg_operator entries suggests >> that the answer might be "about a third". �So if we do what was said in >> the above-cited thread, ie move existing comments to pg_operator and >> add boilerplate ones to pg_proc, we probably would pay <100K for it. > I guess that's not enormously expensive, but it's not insignificant > either. On my machine, a template database is 5.5MB. The implementation I was thinking about was to have initdb run a SQL command that would do something like INSERT INTO pg_description SELECT oprcode, 'pg_proc'::regclass, 0, 'implementation of ' || oprname FROM pg_operator WHEREtheres-not-already-a-description-of-the-oprcode-function So it would be minimal work to either provide or omit the boilerplate descriptions. I think we can postpone the decision till we have a closer fix on the number of entries we're talking about. regards, tom lane
> >I seem to recall some muttering about teaching genbki to extract such comments from the SGML sources or perhaps the C header files. I tend to agree though that it would be a lot >more work than it's worth. And as you say, pg_description entries aren't free. > I know I can't do all of the work, any submission requires review etc, but it is worth it to me provided it does no harm to the codebase. So the only outstanding question is the impact of increased size. In my experience size increases related to documentation are almost always worth it. So I'm prejudiced right out of the gate. I was wondering if every pg_ table gets copied out to every database .. if there is already a mechanism for not replicating all of them we could utilize views or re-writes rules to merge a single copy of catalog comments in a separate table with each deployed database's pg_descriptions. If all catalog descriptions were handled this way it would actually decrease the size of a deployed database ( by 210K? ) by absorbing the pg_descriptions that are currently being duplicated. Since users shouldn't be messing with them anyway and they are purely for humans to refer to - not computers to calculate explain plans with - there shouldn't be anything inherently wrong with moving static descriptions out of user space. In theory at least. -Simone Aiken
On Wed, Jan 19, 2011 at 4:27 PM, Simone Aiken <saiken@ulfheim.net> wrote: > In my experience size increases related to documentation are almost always > worth it. So I'm prejudiced right out of the gate. I was wondering if > every pg_ table gets copied out to every database .. if there is already a > mechanism for not replicating all of them we could utilize views or > re-writes rules to merge a single copy of catalog comments in a separate > table with each deployed database's pg_descriptions. All of them get copied, except for a handful of so-called shared catalogs. Changing that would be difficult. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
After playing with this in benchmarks and researching the weird results I got I'm going to advise dropping the todo for now unless something happens to change how postgres handles clustering. You guys probably already grokked this so I am just recording it for the list archives. The primary factor here is that postgres doesn't maintain clustered indexes. Clustering is a one-time operation that clusters the table at this current point in time. Basically, there really isn't any such thing in postgres as a clustered index. There is an operation - Cluster - which takes an index and a table as input and re-orders the table according to the index. But it is borderline fiction to call the index used "clustered" because the next row inserted will pop in at the end of the table instead of slipping into the middle of the table per the desired ordering. All the pg_table cluster candidates are candidates because they have a row per table column and we expect that a query will want to get several of these rows at once. These rows are naturally clustered because the scripts that create them insert their information into the catalog contiguously. When you create a catalog table the pg_attribute rows for its columns are inserted together. When you then create all its triggers they too are put into pg_triggers one after the other. So calling the Cluster operation after dbinit doesn't help anything. Over time table alterations can fragment this information. If a user loads a bunch of tables, then alters them over time the columns added later on will have their metadata stored separately from the columns created originally. Which gets us to the down and dirty of how the Cluster function works. It puts an access exclusive lock on the entire table - blocking all attempts to read and write to the table - creates a copy of the table in the desired order, drops the original, and renames the copy. Doing this to a catalog table that is relevant to queries pretty much brings everything else in the database to a halt while the system table is locked up. And the brute force logic makes this time consuming even if the table is perfectly ordered already. Additionally, snapshots taken of the table during the Cluster operation make the table appear to be empty which introduces the possibility of system table corruption if transactions are run concurrently with a Cluster operation. So basically, the Cluster operation in its current form is not something you want running automatically on a bunch of system table as it is currently implemented. It gives your system the hiccups. You would only want to run it manually during downtime. And you can do that just as easily with or without any preparation during dbinit. Thanks everyone, -Simone Aiken
On Thu, Jan 20, 2011 at 4:40 PM, Simone Aiken <saiken@quietlycompetent.com> wrote: > After playing with this in benchmarks and researching the weird results I > got I'm going to advise dropping the todo for now unless something happens > to change how postgres handles clustering. I agree, let's remove it. That having been said, analyzing TODO items to figure out which ones are worthless is a useful thing to do, so please feel free to keep at it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Thu, Jan 20, 2011 at 4:40 PM, Simone Aiken > <saiken@quietlycompetent.com> wrote: > > After playing with this in benchmarks and researching the weird results I > > got I'm going to advise dropping the todo for now unless something happens > > to change how postgres handles clustering. > > I agree, let's remove it. > > That having been said, analyzing TODO items to figure out which ones > are worthless is a useful thing to do, so please feel free to keep at > it. OK, removed. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Greg Smith wrote: > I think a helpful next step here would be to put Robert's fsync > compaction patch into here and see if that helps. There are enough > backend syncs showing up in the difficult workloads (scale>=1000, > clients >=32) that its impact should be obvious. Initial tests show everything expected from this change and more. This took me a while to isolate because of issues where the filesystem involved degraded over time, giving a heavy bias toward a faster first test run, before anything was fragmented. I just had to do a whole new mkfs on the database/xlog disks when switching between test sets in order to eliminate that. At a scale of 500, I see the following average behavior: Clients TPS backend-fsync 16 557 155 32 587 572 64 628 843 128 621 1442 256 632 2504 On one run through with the fsync compaction patch applied this turned into: Clients TPS backend-fsync 16 637 0 32 621 0 64 721 0 128 716 0 256 841 0 So not only are all the backend fsyncs gone, there is a very clear TPS improvement too. The change in results at >=64 clients are well above the usual noise threshold in these tests. The problem where individual fsync calls during checkpoints can take a long time is not appreciably better. But I think this will greatly reduce the odds of running into the truly dysfunctional breakdown, where checkpoint and backend fsync calls compete with one another, that caused the worst-case situation kicking off this whole line of research here. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Thu, Jan 27, 2011 at 12:18 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Greg Smith wrote: >> >> I think a helpful next step here would be to put Robert's fsync compaction >> patch into here and see if that helps. There are enough backend syncs >> showing up in the difficult workloads (scale>=1000, clients >=32) that its >> impact should be obvious. > > Initial tests show everything expected from this change and more. This took > me a while to isolate because of issues where the filesystem involved > degraded over time, giving a heavy bias toward a faster first test run, > before anything was fragmented. I just had to do a whole new mkfs on the > database/xlog disks when switching between test sets in order to eliminate > that. > > At a scale of 500, I see the following average behavior: > > Clients TPS backend-fsync > 16 557 155 > 32 587 572 > 64 628 843 > 128 621 1442 > 256 632 2504 > > On one run through with the fsync compaction patch applied this turned into: > > Clients TPS backend-fsync > 16 637 0 > 32 621 0 > 64 721 0 > 128 716 0 > 256 841 0 > > So not only are all the backend fsyncs gone, there is a very clear TPS > improvement too. The change in results at >=64 clients are well above the > usual noise threshold in these tests. > The problem where individual fsync calls during checkpoints can take a long > time is not appreciably better. But I think this will greatly reduce the > odds of running into the truly dysfunctional breakdown, where checkpoint and > backend fsync calls compete with one another, that caused the worst-case > situation kicking off this whole line of research here. Dude! That's pretty cool. Thanks for doing that measurement work - that's really awesome. Barring objections, I'll go ahead and commit my patch. Based on what I saw looking at this, I'm thinking that the backend fsyncs probably happen in clusters - IOW, it's not 2504 backend fsyncs spread uniformly throughout the test, but clusters of 100 or more that happen in very quick succession, followed by relief when the background writer gets around to emptying the queue. During each cluster, the system probably slows way down, and then recovers when the queue is emptied. So the TPS improvement isn't at all a uniform speedup, but simply relief from the stall that would otherwise result from a full queue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > Based on what I saw looking at this, I'm thinking that the backend > fsyncs probably happen in clusters - IOW, it's not 2504 backend fsyncs > spread uniformly throughout the test, but clusters of 100 or more that > happen in very quick succession, followed by relief when the > background writer gets around to emptying the queue. That's exactly the case. You'll be running along fine, the queue will fill, and then hundreds of them can pile up in seconds. Since the worst of that seemed to be during the sync phase of the checkpoint, adding additional queue management logic to there is where we started at. I thought this compaction idea would be more difficult to implement than your patch proved to be though, so doing this first is working out quite well instead. This is what all the log messages from the patch look like here, at scale=500 and shared_buffers=256MB: DEBUG: compacted fsync request queue from 32768 entries to 11 entries That's an 8GB database, and from looking at the relative sizes I'm guessing 7 entries refer to the 1GB segments of the accounts table, 2 to its main index, and the other 2 are likely branches/tellers data. Since I know the production system I ran into this on has about 400 file segments on it regularly dirtied a higher shared_buffers than that, I expect this will demolish this class of problem on it, too. I'll have all the TPS over time graphs available to publish by the end of my day here, including tests at a scale of 1000 as well. Those should give a little more insight into how the patch is actually impacting high-level performance. I don't dare disturb the ongoing tests by copying all that data out of there until they're finished, will be a few hours yet. My only potential concern over committing this is that I haven't done a sanity check over whether it impacts the fsync mechanics in a way that might cause an issue. Your assumptions there are documented and look reasonable on quick review; I just haven't had much time yet to look for flaws in them. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Robert Haas wrote: > During each cluster, the system probably slows way down, and then recovers when > the queue is emptied. So the TPS improvement isn't at all a uniform > speedup, but simply relief from the stall that would otherwise result > from a full queue. > That does seem to be the case here. http://www.2ndquadrant.us/pgbench-results/index.htm now has results from my a long test series, at two database scales that caused many backend fsyncs during earlier tests. Set #5 is the existing server code, #6 is with the patch applied. There are zero backend fsync calls with the patch applied, which isn't surprising given how simple the schema is on this test case. An average of a 14% TPS gain appears at a scale of 500 and a 8% one at 1000; the attached CSV file summarizes the average figures for the archives. The gains do appear to be from smoothing out the dead period that normally occur during the sync phase of the checkpoint. For example, here are the fastest runs at scale=1000/clients=256 with and without the patch: http://www.2ndquadrant.us/pgbench-results/436/index.html (tps=361) http://www.2ndquadrant.us/pgbench-results/486/index.html (tps=380) Here the difference in how much less of a slowdown there is around the checkpoint end points is really obvious, and obviously an improvement. You can see the same thing to a lesser extent at the other end of the scale; here's the fastest runs at scale=500/clients=16: http://www.2ndquadrant.us/pgbench-results/402/index.html (tps=590) http://www.2ndquadrant.us/pgbench-results/462/index.html (tps=643) Where there are still very ugly maximum latency figures here in every case, these periods just aren't as wide with the patch in place. I'm moving onto some brief testing some of the newer kernel behavior here, then returning to testing the other checkpoint spreading ideas on top of this compation patch, presuming something like it will end up being committed first. I think it's safe to say I can throw away the changes to try and alter the fsync absorption code present in what I submitted before, as this scheme does a much better job of avoiding that problem than those earlier queue alteration ideas. I'm glad Robert grabbed the right one from the pile of ideas I threw out for what else might help here. P.S. Yes, I know I have other review work to do as well. Starting on the rest of that tomorrow. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books ,,"Unmodified",,"Compacted Fsync",,, "scale","clients","tps","max_latency","tps","max_latency","TPS Gain","% Gain" 500,16,557,17963.41,631,17116.31,74,13.3% 500,32,587,25838.8,655,24311.54,68,11.6% 500,64,628,35198.39,727,38040.39,99,15.8% 500,128,621,41001.91,687,48195.77,66,10.6% 500,256,632,49610.39,747,46799.48,115,18.2% ,,,,,,, 1000,16,306,39298.95,321,40826.58,15,4.9% 1000,32,314,40120.35,345,27910.51,31,9.9% 1000,64,334,46244.86,358,45138.1,24,7.2% 1000,128,343,72501.57,372,47125.46,29,8.5% 1000,256,321,80588.63,350,83232.14,29,9.0%
On Fri, Jan 28, 2011 at 12:53 AM, Greg Smith <greg@2ndquadrant.com> wrote: > Where there are still very ugly maximum latency figures here in every case, > these periods just aren't as wide with the patch in place. OK, committed the patch, with some additional commenting, and after fixing the compiler warning Chris Browne noticed. > P.S. Yes, I know I have other review work to do as well. Starting on the > rest of that tomorrow. *cracks whip* Man, this thing doesn't work at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith <greg@2ndquadrant.com> wrote: > I've attached an updated version of the initial sync spreading patch here, > one that applies cleanly on top of HEAD and over top of the sync > instrumentation patch too. The conflict that made that hard before is gone > now. With the fsync queue compaction patch applied, I think most of this is now not needed. Attached please find an attempt to isolate the portion that looks like it might still be useful. The basic idea of what remains here is to make the background writer still do its normal stuff even when it's checkpointing. In particular, with this patch applied, PG will: 1. Absorb fsync requests a lot more often during the sync phase. 2. Still try to run the cleaning scan during the sync phase. 3. Pause for 3 seconds after every fsync. I suspect that #1 is probably a good idea. It seems pretty clear based on your previous testing that the fsync compaction patch should be sufficient to prevent us from hitting the wall, but if we're going to any kind of nontrivial work here then cleaning the queue is a sensible thing to do along the way, and there's little downside. I also suspect #2 is a good idea. The fact that we're checkpointing doesn't mean that the system suddenly doesn't require clean buffers, and the experimentation I've done recently (see: limiting hint bit I/O) convinces me that it's pretty expensive from a performance standpoint when backends have to start writing out their own buffers, so continuing to do that work during the sync phase of a checkpoint, just as we do during the write phase, seems pretty sensible. I think something along the lines of #3 is probably a good idea, but the current coding doesn't take checkpoint_completion_target into account. The underlying problem here is that it's at least somewhat reasonable to assume that if we write() a whole bunch of blocks, each write() will take approximately the same amount of time. But this is not true at all with respect to fsync() - they neither take the same amount of time as each other, nor is there any fixed ratio between write() time and fsync() time to go by. So if we want the checkpoint to finish in, say, 20 minutes, we can't know whether the write phase needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59. One idea I have is to try to get some of the fsyncs out of the queue at times other than end-of-checkpoint. Even if this resulted in some modest increase in the total number of fsync() calls, it might improve performance by causing data to be flushed to disk in smaller chunks. For example, suppose we kept an LRU list of pending fsync requests - every time we remember an fsync request for a particular relation, we move it to the head (hot end) of the LRU. And periodically we pull the tail entry off the list and fsync it - say, after checkpoint_timeout / (# of items in the list). That way, when we arrive at the end of the checkpoint and starting syncing everything, the syncs hopefully complete more quickly because we've already forced a bunch of the data down to disk. That algorithm may well be too stupid or just not work in real life, but perhaps there's some variation that would be sensible. The point is: instead of or in addition to trying to spread out the sync phase, we might want to investigate whether it's possible to reduce its size. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, Jan 31, 2011 at 13:41, Robert Haas <robertmhaas@gmail.com> wrote: > 1. Absorb fsync requests a lot more often during the sync phase. > 2. Still try to run the cleaning scan during the sync phase. > 3. Pause for 3 seconds after every fsync. > > So if we want the checkpoint > to finish in, say, 20 minutes, we can't know whether the write phase > needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59. We probably need deadline-based scheduling, that is being used in write() phase. If we want to sync 100 files in 20 minutes, each file should be sync'ed in 12 seconds if we think each fsync takes the same time. If we would have better estimation algorithm (file size? dirty ratio?), each fsync chould have some weight factor. But deadline-based scheduling is still needed then. BTW, we should not sleep in full-speed checkpoint. CHECKPOINT command, shutdown, pg_start_backup(), and some of checkpoints during recovery might don't want to sleep. -- Itagaki Takahiro
On Mon, Jan 31, 2011 at 3:04 AM, Itagaki Takahiro <itagaki.takahiro@gmail.com> wrote: > On Mon, Jan 31, 2011 at 13:41, Robert Haas <robertmhaas@gmail.com> wrote: >> 1. Absorb fsync requests a lot more often during the sync phase. >> 2. Still try to run the cleaning scan during the sync phase. >> 3. Pause for 3 seconds after every fsync. >> >> So if we want the checkpoint >> to finish in, say, 20 minutes, we can't know whether the write phase >> needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59. > > We probably need deadline-based scheduling, that is being used in write() > phase. If we want to sync 100 files in 20 minutes, each file should be > sync'ed in 12 seconds if we think each fsync takes the same time. > If we would have better estimation algorithm (file size? dirty ratio?), > each fsync chould have some weight factor. But deadline-based scheduling > is still needed then. Right. I think the problem is balancing the write and sync phases. For example, if your operating system is very aggressively writing out dirty pages to disk, then you want the write phase to be as long as possible and the sync phase can be very short because there won't be much work to do. But if your operating system is caching lots of stuff in memory and writing dirty pages out to disk only when absolutely necessary, then the write phase could be relatively quick without much hurting anything, but the sync phase will need to be long to keep from crushing the I/O system. The trouble is, we don't really have a priori way to know which it's doing. Maybe we could try to tune based on the behavior of previous checkpoints, but I'm wondering if we oughtn't to take the cheesy path first and split checkpoint_completion_target into checkpoint_write_target and checkpoint_sync_target. That's another parameter to set, but I'd rather add a parameter that people have to play with to find the right value than impose an arbitrary rule that creates unavoidable bad performance in certain environments. > BTW, we should not sleep in full-speed checkpoint. CHECKPOINT command, > shutdown, pg_start_backup(), and some of checkpoints during recovery > might don't want to sleep. Yeah, I think that's understood. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 31.01.2011 16:44, Robert Haas wrote: > On Mon, Jan 31, 2011 at 3:04 AM, Itagaki Takahiro > <itagaki.takahiro@gmail.com> wrote: >> On Mon, Jan 31, 2011 at 13:41, Robert Haas<robertmhaas@gmail.com> wrote: >>> 1. Absorb fsync requests a lot more often during the sync phase. >>> 2. Still try to run the cleaning scan during the sync phase. >>> 3. Pause for 3 seconds after every fsync. >>> >>> So if we want the checkpoint >>> to finish in, say, 20 minutes, we can't know whether the write phase >>> needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59. >> >> We probably need deadline-based scheduling, that is being used in write() >> phase. If we want to sync 100 files in 20 minutes, each file should be >> sync'ed in 12 seconds if we think each fsync takes the same time. >> If we would have better estimation algorithm (file size? dirty ratio?), >> each fsync chould have some weight factor. But deadline-based scheduling >> is still needed then. > > Right. I think the problem is balancing the write and sync phases. > For example, if your operating system is very aggressively writing out > dirty pages to disk, then you want the write phase to be as long as > possible and the sync phase can be very short because there won't be > much work to do. But if your operating system is caching lots of > stuff in memory and writing dirty pages out to disk only when > absolutely necessary, then the write phase could be relatively quick > without much hurting anything, but the sync phase will need to be long > to keep from crushing the I/O system. The trouble is, we don't really > have a priori way to know which it's doing. Maybe we could try to > tune based on the behavior of previous checkpoints, ... IMHO we should re-consider the patch to sort the writes. Not so much because of the performance gain that gives, but because we can then re-arrange the fsyncs so that you write one file, then fsync it, then write the next file and so on. That way we the time taken by the fsyncs is distributed between the writes, so we don't need to accurately estimate how long each will take. If one fsync takes a long time, the writes that follow will just be done a bit faster to catch up. > ... but I'm wondering > if we oughtn't to take the cheesy path first and split > checkpoint_completion_target into checkpoint_write_target and > checkpoint_sync_target. That's another parameter to set, but I'd > rather add a parameter that people have to play with to find the right > value than impose an arbitrary rule that creates unavoidable bad > performance in certain environments. That is of course simpler.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > IMHO we should re-consider the patch to sort the writes. Not so much > because of the performance gain that gives, but because we can then > re-arrange the fsyncs so that you write one file, then fsync it, then > write the next file and so on. Isn't that going to make performance worse not better? Generally you want to give the kernel as much scheduling flexibility as possible, which you do by issuing the write as far before the fsync as you can. An arrangement like the above removes all cross-file scheduling freedom. For example, if two files are on different spindles, you've just guaranteed that no I/O overlap is possible. > That way we the time taken by the fsyncs > is distributed between the writes, That sounds like you have an entirely wrong mental model of where the cost comes from. Those times are not independent. regards, tom lane
On Mon, Jan 31, 2011 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> IMHO we should re-consider the patch to sort the writes. Not so much >> because of the performance gain that gives, but because we can then >> re-arrange the fsyncs so that you write one file, then fsync it, then >> write the next file and so on. > > Isn't that going to make performance worse not better? Generally you > want to give the kernel as much scheduling flexibility as possible, > which you do by issuing the write as far before the fsync as you can. > An arrangement like the above removes all cross-file scheduling freedom. > For example, if two files are on different spindles, you've just > guaranteed that no I/O overlap is possible. > >> That way we the time taken by the fsyncs >> is distributed between the writes, > > That sounds like you have an entirely wrong mental model of where the > cost comes from. Those times are not independent. Yeah, Greg Smith made the same point a week or three ago. But it seems to me that there is potential value in overlaying the write and sync phases to some degree. For example, if the write phase is spread over 15 minutes and you have 30 files, then by, say, minute 7, it's a probably OK to flush the file you wrote first. Waiting longer isn't necessarily going to help - the kernel has probably written what it is going to write without prodding. In fact, it might be that on a busy system, you could lose by waiting *too long* to perform the fsync. The cleaning scan and/or backends may kick out additional dirty buffers that will now have to get forced down to disk, even though you don't really care about them (because they were dirtied after the checkpoint write had already been done). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jan 31, 2011 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That sounds like you have an entirely wrong mental model of where the >> cost comes from. �Those times are not independent. > Yeah, Greg Smith made the same point a week or three ago. But it > seems to me that there is potential value in overlaying the write and > sync phases to some degree. For example, if the write phase is spread > over 15 minutes and you have 30 files, then by, say, minute 7, it's a > probably OK to flush the file you wrote first. Yeah, probably, but we can't do anything as stupid as file-by-file. I wonder whether it'd be useful to keep track of the total amount of data written-and-not-yet-synced, and to issue fsyncs often enough to keep that below some parameter; the idea being that the parameter would limit how much dirty kernel disk cache there is. Of course, ideally the kernel would have a similar tunable and this would be a waste of effort on our part... regards, tom lane
On Mon, Jan 31, 2011 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Jan 31, 2011 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> That sounds like you have an entirely wrong mental model of where the >>> cost comes from. Those times are not independent. > >> Yeah, Greg Smith made the same point a week or three ago. But it >> seems to me that there is potential value in overlaying the write and >> sync phases to some degree. For example, if the write phase is spread >> over 15 minutes and you have 30 files, then by, say, minute 7, it's a >> probably OK to flush the file you wrote first. > > Yeah, probably, but we can't do anything as stupid as file-by-file. Eh? > I wonder whether it'd be useful to keep track of the total amount of > data written-and-not-yet-synced, and to issue fsyncs often enough to > keep that below some parameter; the idea being that the parameter would > limit how much dirty kernel disk cache there is. Of course, ideally the > kernel would have a similar tunable and this would be a waste of effort > on our part... It's not clear to me how you'd maintain that information without it turning into a contention bottleneck. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > 3. Pause for 3 seconds after every fsync. > I think something along the lines of #3 is probably a good idea, Really? Any particular delay is guaranteed wrong. regards, tom lane
On Mon, Jan 31, 2011 at 12:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> 3. Pause for 3 seconds after every fsync. > >> I think something along the lines of #3 is probably a good idea, > > Really? Any particular delay is guaranteed wrong. What I was getting at was - I think it's probably a good idea not to do the fsyncs at top speed, but I'm not too sure how they should be spaced out. I agree a fixed delay isn't necessarily right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jan 31, 2011 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I wonder whether it'd be useful to keep track of the total amount of >> data written-and-not-yet-synced, and to issue fsyncs often enough to >> keep that below some parameter; the idea being that the parameter would >> limit how much dirty kernel disk cache there is. �Of course, ideally the >> kernel would have a similar tunable and this would be a waste of effort >> on our part... > It's not clear to me how you'd maintain that information without it > turning into a contention bottleneck. What contention bottleneck? I was just visualizing the bgwriter process locally tracking how many writes it'd issued. Backend-issued writes should happen seldom enough to be ignorable for this purpose. regards, tom lane
On Mon, Jan 31, 2011 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Jan 31, 2011 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I wonder whether it'd be useful to keep track of the total amount of >>> data written-and-not-yet-synced, and to issue fsyncs often enough to >>> keep that below some parameter; the idea being that the parameter would >>> limit how much dirty kernel disk cache there is. Of course, ideally the >>> kernel would have a similar tunable and this would be a waste of effort >>> on our part... > >> It's not clear to me how you'd maintain that information without it >> turning into a contention bottleneck. > > What contention bottleneck? I was just visualizing the bgwriter process > locally tracking how many writes it'd issued. Backend-issued writes > should happen seldom enough to be ignorable for this purpose. Ah. Well, if you ignore backend writes, then yes, there's no contention bottleneck. However, I seem to recall Greg Smith showing a system at PGCon last year with a pretty respectable volume of backend writes (30%?) and saying "OK, so here's a healthy system". Perhaps I'm misremembering. But at any rate any backend that is using a BufferAccessStrategy figures to do a lot of its own writes. This is probably an area for improvement in future releases, if we an figure out how to do it: if we're doing a bulk load into a system with 4GB of shared_buffers using a 16MB ring buffer, we'd ideally like the background writer - or somebody other than the foreground process - to go nuts on those buffers, writing them out as fast as it possibly can - rather than letting the backend do it when the ring wraps around. Back to the idea at hand - I proposed something a bit along these lines upthread, but my idea was to proactively perform the fsyncs on the relations that had gone the longest without a write, rather than the ones with the most dirty data. I'm not sure which is better. Obviously, doing the ones that have "gone idle" gives the OS more time to write out the data, but OTOH it might not succeed in purging much dirty data. Doing the ones with the most dirty data will definitely reduce the size of the final checkpoint, but might also cause a latency spike if it's triggered immediately after heavy write activity on that file. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > Back to the idea at hand - I proposed something a bit along these > lines upthread, but my idea was to proactively perform the fsyncs on > the relations that had gone the longest without a write, rather than > the ones with the most dirty data. I'm not sure which is better. > Obviously, doing the ones that have "gone idle" gives the OS more time > to write out the data, but OTOH it might not succeed in purging much > dirty data. Doing the ones with the most dirty data will definitely > reduce the size of the final checkpoint, but might also cause a > latency spike if it's triggered immediately after heavy write activity > on that file. Crazy idea #2 --- it would be interesting if you issued an fsync _before_ you wrote out data to a file that needed an fsync. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Tom Lane wrote: > I wonder whether it'd be useful to keep track of the total amount of > data written-and-not-yet-synced, and to issue fsyncs often enough to > keep that below some parameter; the idea being that the parameter would > limit how much dirty kernel disk cache there is. Of course, ideally the > kernel would have a similar tunable and this would be a waste of effort > on our part... > I wanted to run the tests again before reporting in detail here, because the results are so bad, but I threw out an initial report about trying to push this toward this down to be the kernel's job at http://blog.2ndquadrant.com/en/2011/01/tuning-linux-for-low-postgresq.html So far it looks like the newish Linux dirty_bytes parameter works well at reducing latency by limiting how much dirty data can pile up before it gets nudged heavily toward disk. But the throughput drop you pay on VACUUM in particular is brutal, I'm seeing over a 50% slowdown in some cases. I suspect we need to let the regular cleaner and backend writes queue up in the largest possible cache for VACUUM, so it benefits as much as possible from elevator sorting of writes. I suspect this being the worst case now for a tightly controlled write cache is an unintended side-effect of the ring buffer implementation it uses now. Right now I'm running the same tests on XFS instead of ext3, and those are just way more sensible all around; I'll revisit this on that filesystem and ext4. The scale=500 tests I've running lots of lately are a full 3X TPS faster on XFS relative to ext3, with about 1/8 as much worst-case latency. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Robert Haas <robertmhaas@gmail.com> writes: > Back to the idea at hand - I proposed something a bit along these > lines upthread, but my idea was to proactively perform the fsyncs on > the relations that had gone the longest without a write, rather than > the ones with the most dirty data. Yeah. What I meant to suggest, but evidently didn't explain well, was to use that or something much like it as the rule for deciding *what* to fsync next, but to use amount-of-unsynced-data-versus-threshold as the method for deciding *when* to do the next fsync. regards, tom lane
Tom Lane wrote: <blockquote cite="mid:18450.1296493308@sss.pgh.pa.us" type="cite"><pre wrap="">Robert Haas <a class="moz-txt-link-rfc2396E"href="mailto:robertmhaas@gmail.com"><robertmhaas@gmail.com></a> writes: </pre><blockquotetype="cite"><pre wrap="">3. Pause for 3 seconds after every fsync. </pre></blockquote><pre wrap=""> </pre><blockquotetype="cite"><pre wrap="">I think something along the lines of #3 is probably a good idea, </pre></blockquote><prewrap=""> Really? Any particular delay is guaranteed wrong. </pre></blockquote><br /> '3 seconds' is just a placeholder for whatever comes out of a "total time scheduled to sync /relations to sync" computation. (Still doing all my thinking in terms of time, altough I recognize a showdown with segment-basedcheckpoints is coming too)<br /><br /> I think the right way to compute "relations to sync" is to finish thesorted writes patch I sent over a not quite right yet update to already, which is my next thing to work on here. I remainpessimistic that any attempt to issue fsync calls without the maximum possible delay after asking kernel to write thingsout first will work out well. My recent tests with low values of dirty_bytes on Linux just reinforces how bad thatcan turn out. In addition to computing the relation count while sorting them, placing writes in-order by relation andthen doing all writes followed by all syncs should place the database right in the middle of the throughput/latency trade-offhere. It will have had the maximum amount of time we can give it to sort and flush writes for any given relationbefore it is asked to sync it. I don't want to try and be any smarter than that without trying to be a *lot* smarter--timingindividual sync calls, feedback loops on time estimation, etc.<br /><br /> At this point I have to agree withRobert's observation that splitting checkpoints into checkpoint_write_target and checkpoint_sync_target is the only reasonablething left that might be possible complete in a short period. So that's how this can compute the total time numeratorhere.<br /><br /> The main thing I will warn about in relations to discussion today is the danger of true dead-lineoriented scheduling in this area. The checkpoint process may discover the sync phase is falling behind expectationsbecause the individual sync calls are taking longer than expected. If that happens, aiming for the "finish ontarget anyway" goal puts you right back to a guaranteed nasty write spike again. I think many people would prefer loggingthe overrun as tuning feedback for the DBA rather than to accelerate, which is likely to make the problem even worseif the checkpoint is falling behind. But since ultimately the feedback for this will be "make the checkpoints longeror increase checkpoint_sync_target", sync acceleration to meet the deadline isn't unacceptable; DBA can try both ofthose themselves if seeing spikes.<br /><br /><pre class="moz-signature" cols="72">-- Greg Smith 2ndQuadrant US <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> Baltimore, MD PostgreSQL Training, Services, and 24x7 Support <a class="moz-txt-link-abbreviated" href="http://www.2ndQuadrant.us">www.2ndQuadrant.us</a> "PostgreSQL 9.0 High Performance": <a class="moz-txt-link-freetext" href="http://www.2ndQuadrant.com/books">http://www.2ndQuadrant.com/books</a> </pre>
Greg Smith wrote:
Attached update now makes much more sense than the misguided patch I submitted two weesk ago. This takes the original sorted write code, first adjusting it so it only allocates the memory its tag structure is stored in once (in a kind of lazy way I can improve on right now). It then computes a bunch of derived statistics from a single walk of the sorted data on each pass through. Here's an example of what comes out:
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11809.0_0
DEBUG: BufferSync 2 dirty blocks in relation.segment_fork 11811.0_0
DEBUG: BufferSync 3 dirty blocks in relation.segment_fork 11812.0_0
DEBUG: BufferSync 3 dirty blocks in relation.segment_fork 16496.0_0
DEBUG: BufferSync 28 dirty blocks in relation.segment_fork 16499.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11638.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11640.0_0
DEBUG: BufferSync 2 dirty blocks in relation.segment_fork 11641.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11642.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11644.0_0
DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11645.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11661.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11663.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11664.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11672.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11685.0_0
DEBUG: BufferSync 2097 buffers to write, 17 total dirty segment file(s) expected to need sync
This is the first checkpoint after starting to populate a new pgbench database. The next four show it extending into new segments:
DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.1_0
DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s) expected to need sync
DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.2_0
DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s) expected to need sync
DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.3_0
DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s) expected to need sync
DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.4_0
DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s) expected to need sync
The fact that it's always showing 2048 dirty blocks on these makes me think I'm computing something wrong still, but the general idea here is working now. I had to use some magic from the md layer to let bufmgr.c know how its writes were going to get mapped into file segments and correspondingly fsync calls later. Not happy about breaking the API encapsulation there, but don't see an easy way to compute that data at the per-segment level--and it's not like that's going to change in the near future anyway.
I like this approach for a providing a map of how to spread syncs out for a couple of reasons:
-It computes data that could be used to drive sync spread timing in a relatively short amount of simple code.
-You get write sorting at the database level helping out the OS. Everything I've been seeing recently on benchmarks says Linux at least needs all the help it can get in that regard, even if block order doesn't necessarily align perfectly with disk order.
-It's obvious how to take this same data and build a future model where the time allocated for fsyncs was proportional to how much that particular relation was touched.
Benchmarks of just the impact of the sorting step and continued bug swatting to follow.
I think the right way to compute "relations to sync" is to finish the sorted writes patch I sent over a not quite right yet update to already
Attached update now makes much more sense than the misguided patch I submitted two weesk ago. This takes the original sorted write code, first adjusting it so it only allocates the memory its tag structure is stored in once (in a kind of lazy way I can improve on right now). It then computes a bunch of derived statistics from a single walk of the sorted data on each pass through. Here's an example of what comes out:
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11809.0_0
DEBUG: BufferSync 2 dirty blocks in relation.segment_fork 11811.0_0
DEBUG: BufferSync 3 dirty blocks in relation.segment_fork 11812.0_0
DEBUG: BufferSync 3 dirty blocks in relation.segment_fork 16496.0_0
DEBUG: BufferSync 28 dirty blocks in relation.segment_fork 16499.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11638.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11640.0_0
DEBUG: BufferSync 2 dirty blocks in relation.segment_fork 11641.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11642.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11644.0_0
DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11645.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11661.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11663.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11664.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11672.0_0
DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11685.0_0
DEBUG: BufferSync 2097 buffers to write, 17 total dirty segment file(s) expected to need sync
This is the first checkpoint after starting to populate a new pgbench database. The next four show it extending into new segments:
DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.1_0
DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s) expected to need sync
DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.2_0
DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s) expected to need sync
DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.3_0
DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s) expected to need sync
DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.4_0
DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s) expected to need sync
The fact that it's always showing 2048 dirty blocks on these makes me think I'm computing something wrong still, but the general idea here is working now. I had to use some magic from the md layer to let bufmgr.c know how its writes were going to get mapped into file segments and correspondingly fsync calls later. Not happy about breaking the API encapsulation there, but don't see an easy way to compute that data at the per-segment level--and it's not like that's going to change in the near future anyway.
I like this approach for a providing a map of how to spread syncs out for a couple of reasons:
-It computes data that could be used to drive sync spread timing in a relatively short amount of simple code.
-You get write sorting at the database level helping out the OS. Everything I've been seeing recently on benchmarks says Linux at least needs all the help it can get in that regard, even if block order doesn't necessarily align perfectly with disk order.
-It's obvious how to take this same data and build a future model where the time allocated for fsyncs was proportional to how much that particular relation was touched.
Benchmarks of just the impact of the sorting step and continued bug swatting to follow.
-- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Mon, Jan 31, 2011 at 4:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Back to the idea at hand - I proposed something a bit along these >> lines upthread, but my idea was to proactively perform the fsyncs on >> the relations that had gone the longest without a write, rather than >> the ones with the most dirty data. > > Yeah. What I meant to suggest, but evidently didn't explain well, was > to use that or something much like it as the rule for deciding *what* to > fsync next, but to use amount-of-unsynced-data-versus-threshold as the > method for deciding *when* to do the next fsync. Oh, I see. Yeah, that could be a good algorithm. I also think Bruce's idea of calling fsync() on each relation just *before* we start writing the pages from that relation might have some merit. (I'm assuming here that we are sorting the writes.) That should tend to result in the end-of-checkpoint fsyncs being quite fast, because we'll only have as much dirty data floating around as we actually wrote during the checkpoint, which according to Greg Smith is usually a small fraction of the total data in need of flushing. Also, if one of the pre-write fsyncs takes a long time, then that'll get factored into our calculations of how fast we need to write the remaining data to finish the checkpoint on schedule. Of course there's still the possibility that the I/O system literally can't finish a checkpoint in X minutes, but even in that case, the I/O saturation will hopefully be more spread out across the entire checkpoint instead of falling like a hammer at the very end. Back to your idea: One problem with trying to bound the unflushed data is that it's not clear what the bound should be. I've had this mental model where we want the OS to write out pages to disk, but that's not always true, per Greg Smith's recent posts about Linux kernel tuning slowing down VACUUM. A possible advantage of the Momjian algorithm (as it's known in the literature) is that we don't actually start forcing anything out to disk until we have a reason to do so - namely, an impending checkpoint. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > I also think Bruce's idea of calling fsync() on each relation just > *before* we start writing the pages from that relation might have > some merit. What bothers me about that is that you may have a lot of the same dirty pages in the OS cache as the PostgreSQL cache, and you've just ensured that the OS will write those *twice*. I'm pretty sure that the reason the aggressive background writer settings we use have not caused any noticeable increase in OS disk writes is that many PostgreSQL writes of the same buffer keep an OS buffer page from becoming stale enough to get flushed until PostgreSQL writes to it taper off. Calling fsync() right before doing "one last push" of the data could be really pessimal for some workloads. -Kevin
Robert Haas wrote: > Back to your idea: One problem with trying to bound the unflushed data > is that it's not clear what the bound should be. I've had this mental > model where we want the OS to write out pages to disk, but that's not > always true, per Greg Smith's recent posts about Linux kernel tuning > slowing down VACUUM. A possible advantage of the Momjian algorithm > (as it's known in the literature) is that we don't actually start > forcing anything out to disk until we have a reason to do so - namely, > an impending checkpoint. My trivial idea was: let's assume we checkpoint every 10 minutes, and it takes 5 minutes for us to write the data to the kernel. If no one else is writing to those files, we can safely wait maybe 5 more minutes before issuing the fsync. If, however, hundreds of writes are coming in for the same files in those final 5 minutes, we should fsync right away. My idea is that our delay between writes and fsync should somehow be controlled by how many writes to the same files are coming to the kernel while we are considering waiting because the only downside to delay is the accumulation of non-critical writes coming into the kernel for the same files we are going to fsync later. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Greg Smith wrote: > Greg Smith wrote: > > I think the right way to compute "relations to sync" is to finish the > > sorted writes patch I sent over a not quite right yet update to already > > Attached update now makes much more sense than the misguided patch I > submitted two weesk ago. This takes the original sorted write code, > first adjusting it so it only allocates the memory its tag structure is > stored in once (in a kind of lazy way I can improve on right now). It > then computes a bunch of derived statistics from a single walk of the > sorted data on each pass through. Here's an example of what comes out: In that patch, I would like to see a meta-comment explaining why the sorting is happening and what we hope to gain. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Tue, Feb 1, 2011 at 12:58 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: > >> I also think Bruce's idea of calling fsync() on each relation just >> *before* we start writing the pages from that relation might have >> some merit. > > What bothers me about that is that you may have a lot of the same > dirty pages in the OS cache as the PostgreSQL cache, and you've just > ensured that the OS will write those *twice*. I'm pretty sure that > the reason the aggressive background writer settings we use have not > caused any noticeable increase in OS disk writes is that many > PostgreSQL writes of the same buffer keep an OS buffer page from > becoming stale enough to get flushed until PostgreSQL writes to it > taper off. Calling fsync() right before doing "one last push" of > the data could be really pessimal for some workloads. I was thinking about what Greg reported here: http://archives.postgresql.org/pgsql-hackers/2010-11/msg01387.php If the amount of pre-checkpoint dirty data is 3GB and the checkpoint is writing 250MB, then you shouldn't have all that many extra writes... but you might have some, and that might be enough to send the whole thing down the tubes. InnoDB apparently handles this problem by advancing the redo pointer in small steps instead of in large jumps. AIUI, in addition to tracking the LSN of each page, they also track the first-dirtied LSN. That lets you checkpoint to an arbitrary LSN by flushing just the pages with an older first-dirtied LSN. So instead of doing a checkpoint every hour, you might do a mini-checkpoint every 10 minutes. Since the mini-checkpoints each need to flush less data, they should be less disruptive than a full checkpoint. But that, too, will generate some extra writes. Basically, any idea that involves calling fsync() more often is going to tend to smooth out the I/O load at the cost of some increase in the total number of writes. If we don't want any increase at all in the number of writes, spreading out the fsync() calls is pretty much the only other option. I'm worried that even with good tuning that won't be enough to tamp down the latency spikes. But maybe it will be... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin Grittner wrote: > Robert Haas <robertmhaas@gmail.com> wrote: > > > I also think Bruce's idea of calling fsync() on each relation just > > *before* we start writing the pages from that relation might have > > some merit. > > What bothers me about that is that you may have a lot of the same > dirty pages in the OS cache as the PostgreSQL cache, and you've just > ensured that the OS will write those *twice*. I'm pretty sure that > the reason the aggressive background writer settings we use have not > caused any noticeable increase in OS disk writes is that many > PostgreSQL writes of the same buffer keep an OS buffer page from > becoming stale enough to get flushed until PostgreSQL writes to it > taper off. Calling fsync() right before doing "one last push" of > the data could be really pessimal for some workloads. OK, maybe my idea needs to be adjusted and we should trigger an early fsync if non-fsync writes are coming in for blocks _other_ than the ones we already wrote for that checkpoint. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> writes: > My trivial idea was: let's assume we checkpoint every 10 minutes, and > it takes 5 minutes for us to write the data to the kernel. If no one > else is writing to those files, we can safely wait maybe 5 more minutes > before issuing the fsync. If, however, hundreds of writes are coming in > for the same files in those final 5 minutes, we should fsync right away. Huh? I would surely hope we could assume that nobody but Postgres is writing the database files? Or are you considering that the bgwriter doesn't know exactly what the backends are doing? That's true, but I still maintain that we should design the bgwriter's behavior on the assumption that writes from backends are negligible. Certainly the backends aren't issuing fsyncs. regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > My trivial idea was: let's assume we checkpoint every 10 minutes, and > > it takes 5 minutes for us to write the data to the kernel. If no one > > else is writing to those files, we can safely wait maybe 5 more minutes > > before issuing the fsync. If, however, hundreds of writes are coming in > > for the same files in those final 5 minutes, we should fsync right away. > > Huh? I would surely hope we could assume that nobody but Postgres is > writing the database files? Or are you considering that the bgwriter > doesn't know exactly what the backends are doing? That's true, but > I still maintain that we should design the bgwriter's behavior on the > assumption that writes from backends are negligible. Certainly the > backends aren't issuing fsyncs. Right, no one else is writing but us. When I said "no one else" I meant no other bgwrites writes are going to the files we wrote as part of the checkpoint, but have not fsync'ed yet. I assume we have two write streams --- the checkpoint writes, which we know at the start of the checkpoint, and the bgwriter writes that are happening in an unpredictable way based on database activity. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Sat, Jan 15, 2011 at 05:47:24AM -0500, Greg Smith wrote: > For example, the pre-release Squeeze numbers we're seeing are awful so > far, but it's not really done yet either. Unfortunately, it does not look like Debian squeeze will change any more (or has changed much since your post) at this point, except for maybe further stable kernel updates. Which file system did you see those awful numbers on and could you maybe go into some more detail? Thanks, Michael -- <marco_g> I did send an email to propose multithreading to grub-devel on the first of april. <marco_g> Unfortunately everyone thought I was serious ;-)
Michael Banck wrote: <blockquote cite="mid:20110203225510.GA29855@nighthawk.chemicalconnection.dyndns.org" type="cite"><prewrap="">On Sat, Jan 15, 2011 at 05:47:24AM -0500, Greg Smith wrote: </pre><blockquote type="cite"><pre wrap="">Forexample, the pre-release Squeeze numbers we're seeing are awful so far, but it's not really done yet either. </pre></blockquote><pre wrap=""> Unfortunately, it does not look like Debian squeeze will change any more (or has changed much since your post) at this point, except for maybe further stable kernel updates. Which file system did you see those awful numbers on and could you maybe go into some more detail? </pre></blockquote><br /> Once the release comes out any day now I'll see if I can duplicate themon hardware I can talk about fully, and share the ZCAV graphs if it's still there. The server I've been running allof the extended pgbench tests in this thread on is running Ubuntu simply as a temporary way to get 2.6.32 before Squeezeships. Last time I tried installing one of the Squeeze betas I didn't get anywhere; hoping the installer bug I raninto has been sorted when I try again.<br /><br /><pre class="moz-signature" cols="72">-- Greg Smith 2ndQuadrant US <a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a> Baltimore, MD PostgreSQL Training, Services, and 24x7 Support <a class="moz-txt-link-abbreviated" href="http://www.2ndQuadrant.us">www.2ndQuadrant.us</a> "PostgreSQL 9.0 High Performance": <a class="moz-txt-link-freetext" href="http://www.2ndQuadrant.com/books">http://www.2ndQuadrant.com/books</a> </pre>
As already mentioned in the broader discussion at http://archives.postgresql.org/message-id/4D4C4610.1030109@2ndquadrant.com , I'm seeing no solid performance swing in the checkpoint sorting code itself. Better sometimes, worse others, but never by a large amount. Here's what the statistics part derived from the sorted data looks like on a real checkpoint spike: 2011-02-04 07:02:51 EST: LOG: checkpoint starting: xlog 2011-02-04 07:02:51 EST: DEBUG: BufferSync 10 dirty blocks in relation.segment_fork 17216.0_2 2011-02-04 07:02:51 EST: DEBUG: BufferSync 159 dirty blocks in relation.segment_fork 17216.0_1 2011-02-04 07:02:51 EST: DEBUG: BufferSync 10 dirty blocks in relation.segment_fork 17216.3_0 2011-02-04 07:02:51 EST: DEBUG: BufferSync 548 dirty blocks in relation.segment_fork 17216.4_0 2011-02-04 07:02:51 EST: DEBUG: BufferSync 808 dirty blocks in relation.segment_fork 17216.5_0 2011-02-04 07:02:51 EST: DEBUG: BufferSync 799 dirty blocks in relation.segment_fork 17216.6_0 2011-02-04 07:02:51 EST: DEBUG: BufferSync 807 dirty blocks in relation.segment_fork 17216.7_0 2011-02-04 07:02:51 EST: DEBUG: BufferSync 716 dirty blocks in relation.segment_fork 17216.8_0 2011-02-04 07:02:51 EST: DEBUG: BufferSync 3857 buffers to write, 8 total dirty segment file(s) expected to need sync 2011-02-04 07:03:31 EST: DEBUG: checkpoint sync: number=1 file=base/16384/17216.5 time=1324.614 msec 2011-02-04 07:03:31 EST: DEBUG: checkpoint sync: number=2 file=base/16384/17216.4 time=0.002 msec 2011-02-04 07:03:31 EST: DEBUG: checkpoint sync: number=3 file=base/16384/17216_fsm time=0.001 msec 2011-02-04 07:03:47 EST: DEBUG: checkpoint sync: number=4 file=base/16384/17216.10 time=16446.753 msec 2011-02-04 07:03:53 EST: DEBUG: checkpoint sync: number=5 file=base/16384/17216.8 time=5804.252 msec 2011-02-04 07:03:53 EST: DEBUG: checkpoint sync: number=6 file=base/16384/17216.7 time=0.001 msec 2011-02-04 07:03:54 EST: DEBUG: compacted fsync request queue from 32768 entries to 2 entries 2011-02-04 07:03:54 EST: CONTEXT: writing block 1642223 of relation base/16384/17216 2011-02-04 07:04:00 EST: DEBUG: checkpoint sync: number=7 file=base/16384/17216.11 time=6350.577 msec 2011-02-04 07:04:00 EST: DEBUG: checkpoint sync: number=8 file=base/16384/17216.9 time=0.001 msec 2011-02-04 07:04:00 EST: DEBUG: checkpoint sync: number=9 file=base/16384/17216.6 time=0.001 msec 2011-02-04 07:04:00 EST: DEBUG: checkpoint sync: number=10 file=base/16384/17216.3 time=0.001 msec 2011-02-04 07:04:00 EST: DEBUG: checkpoint sync: number=11 file=base/16384/17216_vm time=0.001 msec 2011-02-04 07:04:00 EST: LOG: checkpoint complete: wrote 3813 buffers (11.6%); 0 transaction log file(s) added, 0 removed, 64 recycled; write=39.073 s, sync=29.926 s, total=69.003 s; sync files=11, longest=16.446 s, average=2.720 s You can see that it ran out of fsync absorption space in the middle of the sync phase, which is usually when compaction is needed, but the recent patch to fix that kicked in and did its thing. Couple of observations: -The total number of buffers I'm computing based on the checkpoint writes being sorted it not a perfect match to the number reported by the "checkpoint complete" status line. Sometimes they are the same, sometimes not. Not sure why yet. -The estimate for "expected to need sync" computed as a by-product of the checkpoint sorting is not completely accurate either. This particular one has a fairly large error in it, percentage-wise, being off by 3 with a total of 11. Presumably these are absorbed fsync requests that were already queued up before the checkpoint even started. So any time estimate I drive based off of this count is only going to be approximate. -The order in which the sync phase processes files is unrelated to the order in which they are written out. Note that 17216.10 here, the biggest victim (cause?) of the I/O spike, isn't even listed among the checkpoint writes! The fuzziness here is a bit disconcerting, and I'll keep digging for why it happens. But I don't see any reason not to continue forward using the rough count here to derive a nap time from, which I can then feed into the "useful leftovers" patch that Robert already refactored here. Can always sharpen up that estimate later, I need to get some solid results I can share on what the delay time does to the throughput/latency pattern next. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Fri, Feb 4, 2011 at 2:08 PM, Greg Smith <greg@2ndquadrant.com> wrote: > -The total number of buffers I'm computing based on the checkpoint writes > being sorted it not a perfect match to the number reported by the > "checkpoint complete" status line. Sometimes they are the same, sometimes > not. Not sure why yet. My first guess would be that in the cases where it's not the same, some backend evicted the buffer before the background writer got to it. That's expected under heavy contention for shared_buffers. > -The estimate for "expected to need sync" computed as a by-product of the > checkpoint sorting is not completely accurate either. This particular one > has a fairly large error in it, percentage-wise, being off by 3 with a total > of 11. Presumably these are absorbed fsync requests that were already > queued up before the checkpoint even started. So any time estimate I drive > based off of this count is only going to be approximate. As previously noted, I wonder if we ought sync queued-up requests that don't require writes before beginning the write phase. > -The order in which the sync phase processes files is unrelated to the order > in which they are written out. Note that 17216.10 here, the biggest victim > (cause?) of the I/O spike, isn't even listed among the checkpoint writes! That's awful. If more than 50% of the I/O is going to happen from one fsync() call, that seems to put a pretty pessimal bound on how much improvement we can hope to achieve here. Or am I missing something? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > With the fsync queue compaction patch applied, I think most of this is > now not needed. Attached please find an attempt to isolate the > portion that looks like it might still be useful. The basic idea of > what remains here is to make the background writer still do its normal > stuff even when it's checkpointing. In particular, with this patch > applied, PG will: > > 1. Absorb fsync requests a lot more often during the sync phase. > 2. Still try to run the cleaning scan during the sync phase. > 3. Pause for 3 seconds after every fsync. > Yes, the bits you extracted were the remaining useful parts from the original patch. Today was quiet here because there were sports on or something, and I added full auto-tuning magic to the attached update. I need to kick off benchmarks and report back tomorrow to see how well this does, but any additional patch here would only be code cleanup on the messy stuff I did in here (plus proper implementation of the pair of GUCs). This has finally gotten to the exact logic I've been meaning to complete as spread sync since the idea was first postponed in 8.3, with the benefit of some fsync aborption improvements along the way too The automatic timing is modeled on the existing checkpoint_completion_target concept, except with a new tunable (not yet added as a GUC) currently called CheckPointSyncTarget, set to 0.8 right now. What I think I want to do is make the existing checkpoint_completion_target now be the target for the end of the sync phase, matching its name; people who bumped it up won't necessarily even have to change anything. Then the new guc can be checkpoint_write_target, representing the target that is in there right now. I tossed the earlier idea of counting relations to sync based on the write phase data as too inaccurate after testing, and with it for now goes checkpoint sorting. Instead, I just take a first pass over pendingOpsTable to get a total number of things to sync, which will always match the real count barring strange circumstances (like dropping a table). As for the automatically determining the interval, I take the number of syncs that have finished so far, divide by the total, and get a number between 0.0 and 1.0 that represents progress on the sync phase. I then use the same basic CheckpointWriteDelay logic that is there for spreading writes out, except with the new sync target. I realized that if we assume the checkpoint writes should have finished in CheckPointCompletionTarget worth of time or segments, we can compute a new progress metric with the formula: progress = CheckPointCompletionTarget + (1.0 - CheckPointCompletionTarget) * finished / goal; Where "finished" is the number of segments written out, while "goal" is the total. To turn this into an example, let's say the default parameters are set, we've finished the writes, and finished 1 out of 4 syncs; that much work will be considered: progress = 0.5 + (1.0 - 0.5) * 1 / 4 = 0.625 On a scale that effectively aimes to be finished sync work by 0.8. I don't use quite the same logic as the CheckpointWriteDelay though. It turns out the existing checkpoint_completion implementation doesn't always work like I thought it did, which provide some very interesting insight into why my attempts to work around checkpoint problems haven't worked as well as expected the last few years. I thought that what it did was wait until an amount of time determined by the target was reached until it did the next write. That's not quite it; what it actually does is check progress against the target, then sleep exactly one nap interval if it is is ahead of schedule. That is only the same thing if you have a lot of buffers to write relative to the amount of time involved. There's some alternative logic if you don't have bgwriter_lru_maxpages set, but in the normal situation it effectively it means that: maximum write spread time=bgwriter_delay * checkpoint dirty blocks No matter how far apart you try to spread the checkpoints. Now, typically, when people run into these checkpoint spikes in production, reducing shared_buffers improves that. But I now realize that doing so will then reduce the average number of dirty blocks participating in the checkpoint, and therefore potentially pull the spread down at the same time! Also, if you try and tune bgwriter_delay down to get better background cleaning, you're also reducing the maximum spread. Between this issue and the bad behavior when the fsync queue fills, no wonder this has been so hard to tune out of production systems. At some point, the reduction in spread defeats further attempts to reduce the size of what's written at checkpoint time, by lowering the amount of data involved. What I do instead is nap until just after the planned schedule, then execute the sync. What ends up happening then is that there can be a long pause between the end of the write phase and when syncs start to happen, which I consider a good thing. Gives the kernel a little more time to try and get writes moving out to disk. Here's what that looks like on my development desktop: 2011-02-07 00:46:24 EST: LOG: checkpoint starting: time 2011-02-07 00:48:04 EST: DEBUG: checkpoint sync: estimated segments=10 2011-02-07 00:48:24 EST: DEBUG: checkpoint sync: naps=99 2011-02-07 00:48:36 EST: DEBUG: checkpoint sync: number=1 file=base/16736/16749.1 time=12033.898 msec 2011-02-07 00:48:36 EST: DEBUG: checkpoint sync: number=2 file=base/16736/16749 time=60.799 msec 2011-02-07 00:48:48 EST: DEBUG: checkpoint sync: naps=59 2011-02-07 00:48:48 EST: DEBUG: checkpoint sync: number=3 file=base/16736/16756 time=0.003 msec 2011-02-07 00:49:00 EST: DEBUG: checkpoint sync: naps=60 2011-02-07 00:49:00 EST: DEBUG: checkpoint sync: number=4 file=base/16736/16750 time=0.003 msec 2011-02-07 00:49:12 EST: DEBUG: checkpoint sync: naps=60 2011-02-07 00:49:12 EST: DEBUG: checkpoint sync: number=5 file=base/16736/16737 time=0.004 msec 2011-02-07 00:49:24 EST: DEBUG: checkpoint sync: naps=60 2011-02-07 00:49:24 EST: DEBUG: checkpoint sync: number=6 file=base/16736/16749_fsm time=0.004 msec 2011-02-07 00:49:36 EST: DEBUG: checkpoint sync: naps=60 2011-02-07 00:49:36 EST: DEBUG: checkpoint sync: number=7 file=base/16736/16740 time=0.003 msec 2011-02-07 00:49:48 EST: DEBUG: checkpoint sync: naps=60 2011-02-07 00:49:48 EST: DEBUG: checkpoint sync: number=8 file=base/16736/16749_vm time=0.003 msec 2011-02-07 00:50:00 EST: DEBUG: checkpoint sync: naps=60 2011-02-07 00:50:00 EST: DEBUG: checkpoint sync: number=9 file=base/16736/16752 time=0.003 msec 2011-02-07 00:50:12 EST: DEBUG: checkpoint sync: naps=60 2011-02-07 00:50:12 EST: DEBUG: checkpoint sync: number=10 file=base/16736/16754 time=0.003 msec 2011-02-07 00:50:12 EST: LOG: checkpoint complete: wrote 14335 buffers (43.7%); 0 transaction log file(s) added, 0 removed, 64 recycled; write=47.873 s, sync=127.819 s, total=227.990 s; sync files=10, longest=12.033 s, average=1.209 s Since this is ext3 the spike during the first sync is brutal, anyway, but it tried very hard to avoid that: it waited 99 * 200ms = 19.8 seconds between writing the last buffer and when it started syncing them (00:42:04 to 00:48:24). Given the slow write for #1, it was then behind, so it immediately moved onto #2. But after that, it was able to insert a moderate nap time between successive syncs--60 naps is 12 seconds, and it keeps that pace for the remainder of the sync. This is the same sort of thing I'd worked out as optimal on the system this patch originated from, except it had a lot more dirty relations; that's why its naptime was the 3 seconds hard-coded into earlier versions of this patch. Results on XFS with mini-server class hardware should be interesting... -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 4df69c2..f58ac3e 100644 *** a/src/backend/postmaster/bgwriter.c --- b/src/backend/postmaster/bgwriter.c *************** static bool am_bg_writer = false; *** 168,173 **** --- 168,175 ---- static bool ckpt_active = false; + static int checkpoint_flags = 0; + /* these values are valid when ckpt_active is true: */ static pg_time_t ckpt_start_time; static XLogRecPtr ckpt_start_recptr; *************** static pg_time_t last_xlog_switch_time; *** 180,186 **** static void CheckArchiveTimeout(void); static void BgWriterNap(void); ! static bool IsCheckpointOnSchedule(double progress); static bool ImmediateCheckpointRequested(void); static bool CompactBgwriterRequestQueue(void); --- 182,188 ---- static void CheckArchiveTimeout(void); static void BgWriterNap(void); ! static bool IsCheckpointOnSchedule(double progress,double target); static bool ImmediateCheckpointRequested(void); static bool CompactBgwriterRequestQueue(void); *************** CheckpointWriteDelay(int flags, double p *** 691,696 **** --- 693,701 ---- if (!am_bg_writer) return; + /* Cache this value for a later spread sync */ + checkpoint_flags=flags; + /* * Perform the usual bgwriter duties and take a nap, unless we're behind * schedule, in which case we just try to catch up as quickly as possible. *************** CheckpointWriteDelay(int flags, double p *** 698,704 **** if (!(flags & CHECKPOINT_IMMEDIATE) && !shutdown_requested && !ImmediateCheckpointRequested() && ! IsCheckpointOnSchedule(progress)) { if (got_SIGHUP) { --- 703,709 ---- if (!(flags & CHECKPOINT_IMMEDIATE) && !shutdown_requested && !ImmediateCheckpointRequested() && ! IsCheckpointOnSchedule(progress,CheckPointCompletionTarget)) { if (got_SIGHUP) { *************** CheckpointWriteDelay(int flags, double p *** 726,731 **** --- 731,799 ---- } /* + * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint + * + * This function is called after each file sync performed by mdsync(). + * It is responsible for keeping the bgwriter's normal activities in + * progress during a long checkpoint. + */ + void + CheckpointSyncDelay(int finished,int goal) + { + int flags = checkpoint_flags; + int nap_count = 0; + double progress; + double CheckPointSyncTarget = 0.8; + + /* Do nothing if checkpoint is being executed by non-bgwriter process */ + if (!am_bg_writer) + return; + + /* + * Limit progress to the goal, which + * may be possible if the segments to sync were calculated wrong. + */ + ckpt_cached_elapsed = 0; + if (finished > goal) finished=goal; + + /* + * Base our progress on the assumption that the write took + * checkpoint_completion_target worth of time, and that sync + * progress is advancing from beyond that point. + */ + progress = CheckPointCompletionTarget + + (1.0 - CheckPointCompletionTarget) * finished / goal; + + /* + * Perform the usual bgwriter duties and nap until we've just + * crossed our deadline. + */ + elog(DEBUG2,"checkpoint sync: considering a nap after progress=%.1f",progress); + while (!(flags & CHECKPOINT_IMMEDIATE) && + !shutdown_requested && + !ImmediateCheckpointRequested() && + (IsCheckpointOnSchedule(progress,CheckPointSyncTarget))) + { + if (got_SIGHUP) + { + got_SIGHUP = false; + ProcessConfigFile(PGC_SIGHUP); + } + + elog(DEBUG2,"checkpoint sync: nap count=%d",nap_count); + nap_count++; + + AbsorbFsyncRequests(); + + BgBufferSync(); + CheckArchiveTimeout(); + BgWriterNap(); + } + if (nap_count > 0) + elog(DEBUG1,"checkpoint sync: naps=%d",nap_count); + } + + /* * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint * in time? * *************** CheckpointWriteDelay(int flags, double p *** 734,740 **** * than the elapsed time/segments. */ static bool ! IsCheckpointOnSchedule(double progress) { XLogRecPtr recptr; struct timeval now; --- 802,808 ---- * than the elapsed time/segments. */ static bool ! IsCheckpointOnSchedule(double progress,double target) { XLogRecPtr recptr; struct timeval now; *************** IsCheckpointOnSchedule(double progress) *** 743,750 **** Assert(ckpt_active); ! /* Scale progress according to checkpoint_completion_target. */ ! progress *= CheckPointCompletionTarget; /* * Check against the cached value first. Only do the more expensive --- 811,820 ---- Assert(ckpt_active); ! /* Scale progress according to given target. */ ! progress *= target; ! ! elog(DEBUG2,"checkpoint schedule check: scaled progress=%.1f target=%.1f",progress,target); /* * Check against the cached value first. Only do the more expensive *************** IsCheckpointOnSchedule(double progress) *** 773,778 **** --- 843,850 ---- ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) / CheckPointSegments; + elog(DEBUG2,"checkpoint schedule: elapsed xlogs=%.1f",elapsed_xlogs); + if (progress < elapsed_xlogs) { ckpt_cached_elapsed = elapsed_xlogs; *************** IsCheckpointOnSchedule(double progress) *** 787,792 **** --- 859,866 ---- elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) + now.tv_usec / 1000000.0) / CheckPointTimeout; + elog(DEBUG2,"checkpoint schedule: elapsed time=%.1f",elapsed_time); + if (progress < elapsed_time) { ckpt_cached_elapsed = elapsed_time; diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 9d585b6..f294f6f 100644 *** a/src/backend/storage/smgr/md.c --- b/src/backend/storage/smgr/md.c *************** *** 31,39 **** #include "pg_trace.h" - /* interval for calling AbsorbFsyncRequests in mdsync */ - #define FSYNCS_PER_ABSORB 10 - /* * Special values for the segno arg to RememberFsyncRequest. * --- 31,36 ---- *************** mdsync(void) *** 932,938 **** HASH_SEQ_STATUS hstat; PendingOperationEntry *entry; - int absorb_counter; /* Statistics on sync times */ int processed = 0; --- 929,934 ---- *************** mdsync(void) *** 943,948 **** --- 939,948 ---- uint64 longest = 0; uint64 total_elapsed = 0; + /* Sync spreading counters */ + int sync_segments = 0; + int current_segment = 0; + /* * This is only called during checkpoints, and checkpoints should only * occur in processes that have created a pendingOpsTable. *************** mdsync(void) *** 1001,1008 **** /* Set flag to detect failure if we don't reach the end of the loop */ mdsync_in_progress = true; /* Now scan the hashtable for fsync requests to process */ - absorb_counter = FSYNCS_PER_ABSORB; hash_seq_init(&hstat, pendingOpsTable); while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL) { --- 1001,1033 ---- /* Set flag to detect failure if we don't reach the end of the loop */ mdsync_in_progress = true; + /* For spread sync timing purposes, make a scan through the + * hashtable to count its entries. Using hash_get_num_entries + * instead would require a stronger lock than we want to have at + * this point, and we don't want to count requests destined for + * next cycle anyway + * + * XXX Should we skip this if there is no sync spreading, or if + * fsync is off? + */ + hash_seq_init(&hstat, pendingOpsTable); + while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL) + { + if (entry->cycle_ctr == mdsync_cycle_ctr) + continue; + sync_segments++; + } + + /* + * In the unexpected situation where the above estimate says there + * is nothing to sync, avoid division by zero errors in the + * progress computation below. + */ + if (sync_segments == 0) + sync_segments = 1; + elog(DEBUG1, "checkpoint sync: estimated segments=%d",sync_segments); + /* Now scan the hashtable for fsync requests to process */ hash_seq_init(&hstat, pendingOpsTable); while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL) { *************** mdsync(void) *** 1027,1043 **** int failures; /* ! * If in bgwriter, we want to absorb pending requests every so ! * often to prevent overflow of the fsync request queue. It is ! * unspecified whether newly-added entries will be visited by ! * hash_seq_search, but we don't care since we don't need to ! * process them anyway. */ ! if (--absorb_counter <= 0) ! { ! AbsorbFsyncRequests(); ! absorb_counter = FSYNCS_PER_ABSORB; ! } /* * The fsync table could contain requests to fsync segments that --- 1052,1060 ---- int failures; /* ! * If in bgwriter, perform normal duties. */ ! CheckpointSyncDelay(current_segment,sync_segments); /* * The fsync table could contain requests to fsync segments that *************** mdsync(void) *** 1131,1140 **** pfree(path); /* ! * Absorb incoming requests and check to see if canceled. */ ! AbsorbFsyncRequests(); ! absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */ if (entry->canceled) break; --- 1148,1156 ---- pfree(path); /* ! * If in bgwriter, perform normal duties. */ ! CheckpointSyncDelay(current_segment,sync_segments); if (entry->canceled) break; *************** mdsync(void) *** 1149,1154 **** --- 1165,1172 ---- if (hash_search(pendingOpsTable, &entry->tag, HASH_REMOVE, NULL) == NULL) elog(ERROR, "pendingOpsTable corrupted"); + + current_segment++; } /* end loop over hashtable entries */ /* Return sync performance metrics for report at checkpoint end */ diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h index eaf2206..5da0aa2 100644 *** a/src/include/postmaster/bgwriter.h --- b/src/include/postmaster/bgwriter.h *************** extern void BackgroundWriterMain(void); *** 26,31 **** --- 26,32 ---- extern void RequestCheckpoint(int flags); extern void CheckpointWriteDelay(int flags, double progress); + extern void CheckpointSyncDelay(int finished,int goal); extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum, BlockNumber segno);
2011/2/7 Greg Smith <greg@2ndquadrant.com>: > Robert Haas wrote: >> >> With the fsync queue compaction patch applied, I think most of this is >> now not needed. Attached please find an attempt to isolate the >> portion that looks like it might still be useful. The basic idea of >> what remains here is to make the background writer still do its normal >> stuff even when it's checkpointing. In particular, with this patch >> applied, PG will: >> >> 1. Absorb fsync requests a lot more often during the sync phase. >> 2. Still try to run the cleaning scan during the sync phase. >> 3. Pause for 3 seconds after every fsync. >> > > Yes, the bits you extracted were the remaining useful parts from the > original patch. Today was quiet here because there were sports on or > something, and I added full auto-tuning magic to the attached update. I > need to kick off benchmarks and report back tomorrow to see how well this > does, but any additional patch here would only be code cleanup on the messy > stuff I did in here (plus proper implementation of the pair of GUCs). This > has finally gotten to the exact logic I've been meaning to complete as > spread sync since the idea was first postponed in 8.3, with the benefit of > some fsync aborption improvements along the way too > > The automatic timing is modeled on the existing checkpoint_completion_target > concept, except with a new tunable (not yet added as a GUC) currently called > CheckPointSyncTarget, set to 0.8 right now. What I think I want to do is > make the existing checkpoint_completion_target now be the target for the end > of the sync phase, matching its name; people who bumped it up won't > necessarily even have to change anything. Then the new guc can be > checkpoint_write_target, representing the target that is in there right now. Is it worth a new thread with the different IO improvements done so far or on-going and how we may add new GUC(if required !!!) with intelligence between those patches ? ( For instance, hint bit IO limit needs probably a tunable to define something similar to hint_write_completion_target and/or IO_throttling strategy, ...items which are still in gestation...) > > I tossed the earlier idea of counting relations to sync based on the write > phase data as too inaccurate after testing, and with it for now goes > checkpoint sorting. Instead, I just take a first pass over pendingOpsTable > to get a total number of things to sync, which will always match the real > count barring strange circumstances (like dropping a table). > > As for the automatically determining the interval, I take the number of > syncs that have finished so far, divide by the total, and get a number > between 0.0 and 1.0 that represents progress on the sync phase. I then use > the same basic CheckpointWriteDelay logic that is there for spreading writes > out, except with the new sync target. I realized that if we assume the > checkpoint writes should have finished in CheckPointCompletionTarget worth > of time or segments, we can compute a new progress metric with the formula: > > progress = CheckPointCompletionTarget + (1.0 - CheckPointCompletionTarget) * > finished / goal; > > Where "finished" is the number of segments written out, while "goal" is the > total. To turn this into an example, let's say the default parameters are > set, we've finished the writes, and finished 1 out of 4 syncs; that much > work will be considered: > > progress = 0.5 + (1.0 - 0.5) * 1 / 4 = 0.625 > > On a scale that effectively aimes to be finished sync work by 0.8. > > I don't use quite the same logic as the CheckpointWriteDelay though. It > turns out the existing checkpoint_completion implementation doesn't always > work like I thought it did, which provide some very interesting insight into > why my attempts to work around checkpoint problems haven't worked as well as > expected the last few years. I thought that what it did was wait until an > amount of time determined by the target was reached until it did the next > write. That's not quite it; what it actually does is check progress against > the target, then sleep exactly one nap interval if it is is ahead of > schedule. That is only the same thing if you have a lot of buffers to write > relative to the amount of time involved. There's some alternative logic if > you don't have bgwriter_lru_maxpages set, but in the normal situation it > effectively it means that: > > maximum write spread time=bgwriter_delay * checkpoint dirty blocks > > No matter how far apart you try to spread the checkpoints. Now, typically, > when people run into these checkpoint spikes in production, reducing > shared_buffers improves that. But I now realize that doing so will then > reduce the average number of dirty blocks participating in the checkpoint, > and therefore potentially pull the spread down at the same time! Also, if > you try and tune bgwriter_delay down to get better background cleaning, > you're also reducing the maximum spread. Between this issue and the bad > behavior when the fsync queue fills, no wonder this has been so hard to tune > out of production systems. At some point, the reduction in spread defeats > further attempts to reduce the size of what's written at checkpoint time, by > lowering the amount of data involved. interesting! > > What I do instead is nap until just after the planned schedule, then execute > the sync. What ends up happening then is that there can be a long pause > between the end of the write phase and when syncs start to happen, which I > consider a good thing. Gives the kernel a little more time to try and get > writes moving out to disk. Sounds like a really good idea like that. > Here's what that looks like on my development > desktop: > > 2011-02-07 00:46:24 EST: LOG: checkpoint starting: time > 2011-02-07 00:48:04 EST: DEBUG: checkpoint sync: estimated segments=10 > 2011-02-07 00:48:24 EST: DEBUG: checkpoint sync: naps=99 > 2011-02-07 00:48:36 EST: DEBUG: checkpoint sync: number=1 > file=base/16736/16749.1 time=12033.898 msec > 2011-02-07 00:48:36 EST: DEBUG: checkpoint sync: number=2 > file=base/16736/16749 time=60.799 msec > 2011-02-07 00:48:48 EST: DEBUG: checkpoint sync: naps=59 > 2011-02-07 00:48:48 EST: DEBUG: checkpoint sync: number=3 > file=base/16736/16756 time=0.003 msec > 2011-02-07 00:49:00 EST: DEBUG: checkpoint sync: naps=60 > 2011-02-07 00:49:00 EST: DEBUG: checkpoint sync: number=4 > file=base/16736/16750 time=0.003 msec > 2011-02-07 00:49:12 EST: DEBUG: checkpoint sync: naps=60 > 2011-02-07 00:49:12 EST: DEBUG: checkpoint sync: number=5 > file=base/16736/16737 time=0.004 msec > 2011-02-07 00:49:24 EST: DEBUG: checkpoint sync: naps=60 > 2011-02-07 00:49:24 EST: DEBUG: checkpoint sync: number=6 > file=base/16736/16749_fsm time=0.004 msec > 2011-02-07 00:49:36 EST: DEBUG: checkpoint sync: naps=60 > 2011-02-07 00:49:36 EST: DEBUG: checkpoint sync: number=7 > file=base/16736/16740 time=0.003 msec > 2011-02-07 00:49:48 EST: DEBUG: checkpoint sync: naps=60 > 2011-02-07 00:49:48 EST: DEBUG: checkpoint sync: number=8 > file=base/16736/16749_vm time=0.003 msec > 2011-02-07 00:50:00 EST: DEBUG: checkpoint sync: naps=60 > 2011-02-07 00:50:00 EST: DEBUG: checkpoint sync: number=9 > file=base/16736/16752 time=0.003 msec > 2011-02-07 00:50:12 EST: DEBUG: checkpoint sync: naps=60 > 2011-02-07 00:50:12 EST: DEBUG: checkpoint sync: number=10 > file=base/16736/16754 time=0.003 msec > 2011-02-07 00:50:12 EST: LOG: checkpoint complete: wrote 14335 buffers > (43.7%); 0 transaction log file(s) added, 0 removed, 64 recycled; > write=47.873 s, sync=127.819 s, total=227.990 s; sync files=10, > longest=12.033 s, average=1.209 s > > Since this is ext3 the spike during the first sync is brutal, anyway, but it > tried very hard to avoid that: it waited 99 * 200ms = 19.8 seconds between > writing the last buffer and when it started syncing them (00:42:04 to > 00:48:24). Given the slow write for #1, it was then behind, so it > immediately moved onto #2. But after that, it was able to insert a moderate > nap time between successive syncs--60 naps is 12 seconds, and it keeps that > pace for the remainder of the sync. This is the same sort of thing I'd > worked out as optimal on the system this patch originated from, except it > had a lot more dirty relations; that's why its naptime was the 3 seconds > hard-coded into earlier versions of this patch. > > Results on XFS with mini-server class hardware should be interesting... > > -- > Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD > PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us > "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
Cédric Villemain wrote: > Is it worth a new thread with the different IO improvements done so > far or on-going and how we may add new GUC(if required !!!) with > intelligence between those patches ? ( For instance, hint bit IO limit > needs probably a tunable to define something similar to > hint_write_completion_target and/or IO_throttling strategy, ...items > which are still in gestation...) > Maybe, but I wouldn't bring all that up right now. Trying to wrap up the CommitFest, too distracting, etc. As a larger statement on this topic, I'm never very excited about redesigning here starting from any point other than "saw a bottleneck doing <x> on a production system". There's a long list of such things already around waiting to be addressed, and I've never seen any good evidence of work related to hint bits being on it. Please correct me if you know of some--I suspect you do from the way you're brining this up. If we were to consider kicking off some larger work here, I would drive that by asking where the data supporting that work being necessary is at first. It's hard enough to fix a bottleneck that's staring right at you, trying to address one that's just theorized is impossible. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Greg Smith <greg@2ndquadrant.com> wrote: > As a larger statement on this topic, I'm never very excited about > redesigning here starting from any point other than "saw a > bottleneck doing <x> on a production system". There's a long list > of such things already around waiting to be addressed, and I've > never seen any good evidence of work related to hint bits being on > it. Please correct me if you know of some--I suspect you do from > the way you're brining this up. There are occasional posts from those wondering why their read-only queries are so slow after a bulk load, and why they are doing heavy writes. (I remember when I posted about that, as a relative newbie, and I know I've seen others.) I think worst case is probably: - Bulk load data. - Analyze (but don't vacuum) the new data. - Start a workload with a lot of small, concurrent random reads. - Watch performance tank when the write cache gluts. This pattern is why we've adopted a pretty strict rule in our shop that we run VACUUM FREEZE ANALYZE between a bulk load and putting the database back into production. It's probably a bigger issue for those who can't do that. -Kevin
Kevin Grittner wrote: > There are occasional posts from those wondering why their read-only > queries are so slow after a bulk load, and why they are doing heavy > writes. (I remember when I posted about that, as a relative newbie, > and I know I've seen others.) > Sure; I created http://wiki.postgresql.org/wiki/Hint_Bits a while back specifically to have a resource to explain that mystery to offer people. But there's a difference between having a performance issue that people don't understand, and having a real bottleneck you can't get rid of. My experience is that people who have hint bit issues run into them as a minor side-effect of a larger vacuum issue, and that if you get that under control they're only a minor detail in comparison. Makes it hard to get too excited about optimizing them. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Looks like it's time to close the book on this one for 9.1 development...the unfortunate results are at http://www.2ndquadrant.us/pgbench-results/index.htm Test set #12 is the one with spread sync I was hoping would turn out better than #9, the reference I was trying to improve on. TPS is about 5% slower on the scale=500 and 15% slower on the scale=1000 tests with sync spread out. Even worse, maximum latency went up a lot. I am convinced of a couple of things now: 1) Most of the benefit we were seeing from the original patch I submitted was simply from doing much better at absorbing fsync requests from backends while the checkpoint sync was running. The already committed fsync compaction patch effectively removes that problem though, to the extent it's possible to do so, making the remaining pieces here not as useful in its wake. 2) I need to start over testing here with something that isn't 100% write all of the time the way pgbench is. It's really hard to isolate out latency improvements when the test program guarantees all associated write caches will be completely filled at every moment. Also, I can't see any benefit if I make changes that improve performance only for readers with it, which is quite unrealistic relative to real-world workloads. 3) The existing write spreading code in the background writer needs to be overhauled, too, before spreading the syncs around is going to give the benefits I was hoping for. Given all that, I'm going to take my feedback and give the test server a much deserved break. I'm happy that the fsync compaction patch has made 9.1 much more tolerant of write-heavy loads than earlier versions, so it's not like no progress was made in this release. For anyone who wants more details here...the news on this spread sync implementation is not all bad. If you compare this result from HEAD, with scale=1000 and clients=256: http://www.2ndquadrant.us/pgbench-results/611/index.html Against its identically configured result with spread sync: http://www.2ndquadrant.us/pgbench-results/708/index.html There are actually significantly less times in the >2000 ms latency area. That shows up as a reduction in the 90th percentile latency figures I compute, and you can see it in the graph if you look at how much denser the points are in the 2000 - 4000 ms are on #611. But that's a pretty weak change. But the most disappointing part here relative to what I was hoping is what happens with bigger buffer caches. The main idea driving this approach was that it would enable larger values of shared_buffers without the checkpoint spikes being as bad. Test set #13 tries that out, by increasing shared_buffers from 256MB to 4GB, along with a big enough increase in checkpoint_segments to make most checkpoints time based. Not only did smaller scale TPS drop in half, all kinds of bad things happened to latency. Here's a sample of the sort of dysfunctional checkpoints that came out of that: 2011-02-10 02:41:17 EST: LOG: checkpoint starting: xlog 2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: estimated segments=22 2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: number=1 file=base/16384/16768 time=150.008 msec 2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: number=2 file=base/16384/16749 time=0.002 msec 2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: number=3 file=base/16384/16749_fsm time=0.001 msec 2011-02-10 02:53:23 EST: DEBUG: checkpoint sync: number=4 file=base/16384/16761 time=8014.102 msec 2011-02-10 02:53:23 EST: DEBUG: checkpoint sync: number=5 file=base/16384/16752_vm time=0.002 msec 2011-02-10 02:53:35 EST: DEBUG: checkpoint sync: number=6 file=base/16384/16761.5 time=11739.038 msec 2011-02-10 02:53:37 EST: DEBUG: checkpoint sync: number=7 file=base/16384/16761.6 time=2205.721 msec 2011-02-10 02:53:45 EST: DEBUG: checkpoint sync: number=8 file=base/16384/16761.2 time=8273.849 msec 2011-02-10 02:54:06 EST: DEBUG: checkpoint sync: number=9 file=base/16384/16766 time=20874.167 msec 2011-02-10 02:54:06 EST: DEBUG: checkpoint sync: number=10 file=base/16384/16762 time=0.002 msec 2011-02-10 02:54:08 EST: DEBUG: checkpoint sync: number=11 file=base/16384/16761.3 time=2440.441 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=12 file=base/16384/16766.1 time=635.839 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=13 file=base/16384/16752_fsm time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=14 file=base/16384/16764 time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=15 file=base/16384/16768_fsm time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=16 file=base/16384/16761_vm time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=17 file=base/16384/16761.4 time=150.702 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=18 file=base/16384/16752 time=0.002 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=19 file=base/16384/16761_fsm time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=20 file=base/16384/16749_vm time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=21 file=base/16384/16385 time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=22 file=base/16384/16761.1 time=175.575 msec 2011-02-10 02:54:10 EST: LOG: checkpoint complete: wrote 242614 buffers (46.3%); 0 transaction log file(s) added, 0 removed, 34 recycled; write=716.637 s, sync=54.659 s, total=772.976 s; sync files=22, longest=20.874 s, average=2.484 s That's 12 minutes for the write phase, even though checkpoints should be happening every 5 minutes here. With that bad of a write phase overrun, spread sync had no room to work, so no net improvement at all. What is happening here is similar to the behavior I described seeing on my client system but didn't have an example to share until now. During the write phase, looking at "Dirty:" in /proc/meminfo showed the value peaking at over 1GB while writes were happening, and eventually the background writer process wasn't getting any serious CPU time compared to the backends; this is what it looked like via ps: %CPU %MEM TIME+ COMMAND 4 0 01:51.28 /home/gsmith/pgwork/inst/spread-sync/bin/pgbench -f /home/gsmith/pgbench-tools 2 8.1 00:39.71 postgres: gsmith pgbench ::1(43871) UPDATE 2 8 00:39.28 postgres: gsmith pgbench ::1(43875) UPDATE 2 8.1 00:39.92 postgres: gsmith pgbench ::1(43865) UPDATE 2 8.1 00:39.54 postgres: gsmith pgbench ::1(43868) UPDATE 2 8 00:39.36 postgres: gsmith pgbench ::1(43870) INSERT 2 8.1 00:39.47 postgres: gsmith pgbench ::1(43877) UPDATE 1 8 00:39.39 postgres: gsmith pgbench ::1(43864) COMMIT 1 8.1 00:39.78 postgres: gsmith pgbench ::1(43866) UPDATE 1 8 00:38.99 postgres: gsmith pgbench ::1(43867) UPDATE 1 8.1 00:39.55 postgres: gsmith pgbench ::1(43872) UPDATE 1 8.1 00:39.90 postgres: gsmith pgbench ::1(43873) UPDATE 1 8.1 00:39.64 postgres: gsmith pgbench ::1(43876) UPDATE 1 8.1 00:39.93 postgres: gsmith pgbench ::1(43878) UPDATE 1 8.1 00:39.83 postgres: gsmith pgbench ::1(43863) UPDATE 1 8 00:39.47 postgres: gsmith pgbench ::1(43869) UPDATE 1 8.1 00:40.11 postgres: gsmith pgbench ::1(43874) UPDATE 1 0 00:11.91 [flush-9:1] 0 0 27:43.75 [xfsdatad/6] 0 9.4 00:02.21 postgres: writer process I want to make this problem go away, but as you can see spreading the sync calls around isn't enough. I think the main write loop needs to get spread out more, too, so that the background writer is trying to work at a more reasonable pace. I am pleased I've been able to reproduce this painful behavior at home using test data, because that much improves my odds of being able to isolate its cause and test solutions. But it's a tricky problem, and I'm certainly not going to fix it in the next week. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Thu, Feb 10, 2011 at 10:30 PM, Greg Smith <greg@2ndquadrant.com> wrote: > 3) The existing write spreading code in the background writer needs to be > overhauled, too, before spreading the syncs around is going to give the > benefits I was hoping for. I've been thinking about this problem a bit. It strikes me that the whole notion of a background writer delay is probably wrong-headed. Instead of having fixed-length cycles, we might want to make the delay dependent on whether we're actually keeping up. So during each cycle, we decide how many buffers we want to clean, and we write 'em. Then we go to sleep. When we wake up again, we figure out whether we kept up. If the number of buffers we wrote during the prior cycle was more than the required number, then we'll sleep longer the next time, up to some maximum; if we we didn't write enough, we'll reduce the sleep. Along with this, we'd want to change the minimum rate of writing checkpoint buffers from 1 per cycle to 1 for every 200 ms, or something like that. We could even possibly have a system where backends wake the background writer up early if they notice that it's not keeping up, although it's not exactly clear what a good algorithm would be. Another thing that would be really nice is if backends could somehow let the background writer know when they're using a BufferAccessStrategy, and somehow convince the background writer to write those buffers out to the OS at top speed. > I want to make this problem go away, but as you can see spreading the sync > calls around isn't enough. I think the main write loop needs to get spread > out more, too, so that the background writer is trying to work at a more > reasonable pace. I am pleased I've been able to reproduce this painful > behavior at home using test data, because that much improves my odds of > being able to isolate its cause and test solutions. But it's a tricky > problem, and I'm certainly not going to fix it in the next week. Thanks for working on this. I hope we get a better handle on it for 9.2. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company