Thread: Load distributed checkpoint
This is a proposal for load distributed checkpoint. (It is presented on postgresql anniversary summit in last summer.) We offen encounters performance gap during checkpoint. The reason is write bursts. Storage devices are too overworked in checkpoint, so they can not supply usual transaction processing. Checkpoint consists of the following four steps, and the major performance problem is 2nd step. All dirty buffers are written without interval in it. 1. Query information (REDO pointer, next XID etc.)2. Write dirty pages in buffer pool3. Flush all modified files4. Updatecontrol file I suggested to write pages with sleeping in 2nd step, using normal activity of the background writer. It is something like cost-based vacuum delay. Background writer has two pointers, 'ALL' and 'LRU', indicating where to write out in buffer pool. We can wait for the ALL clock-hand going around to guarantee all pages to be written. Here is pseudo-code for the proposed method. The internal loop is just the same as bgwriter's activity. PrepareCheckPoint(); -- do step 1 Reset num_of_scanned_pages by ALL activity; do { BgBufferSync(); -- do a part ofstep 2 sleep(bgwriter_delay); } while (num_of_scanned_pages < shared_buffers); CreateCheckPoint(); -- do step 3 and4 We may accelerate background writer to reduce works at checkpoint instead of the method, but it introduces another performance problem; Extra pressure is always put on the storage devices to keep the number of dirty pages low. I'm working about adjusting the progress of checkpoint to checkpoint timeout and wal segments limitation automatically to avoid overlap of two checkpoints. I'll post a patch sometime soon. Comments and suggestions welcome. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Hello, Itagaki-san > This is a proposal for load distributed checkpoint. > We offen encounters performance gap during checkpoint. The reason is write > bursts. Storage devices are too overworked in checkpoint, so they can not > supply usual transaction processing. Good! You are focusing on a very important problem. System designers don't like unsteady performance -- sudden slowdown. Commercial database systems have made efforts to provide steady performance. I've seen somewhere the report that Oracle provides stable throughput even during checkpoint. I wonder how it is implemented. > I'm working about adjusting the progress of checkpoint to checkpoint timeout > and wal segments limitation automatically to avoid overlap of two checkpoints. Have you already tried your patch? What's your first impression about the improvement? I'm very interested. On my machine, pgbench shows 210 tps at first, however, it drops to 70 tps during checkpoint. > Checkpoint consists of the following four steps, and the major performance > problem is 2nd step. All dirty buffers are written without interval in it. > 1. Query information (REDO pointer, next XID etc.) > 2. Write dirty pages in buffer pool > 3. Flush all modified files > 4. Update control file Hmm. Isn't it possible that step 3 affects the performance greatly? I'm sorry if you have already identified step 2 as disturbing backends. As you know, PostgreSQL does not transfer the data to disk when write()ing. Actual transfer occurs when fsync()ing at checkpoints, unless the filesystem cache runs short. So, disk is overworked at fsync()s. What processing of bgwriter (shared locking of buffers, write(), fsync(), flushing of log for WAL, etc.) do you consider (or did you detect) as disturbing what processing of backends (exclusive locking of buffers, putting log records onto WAL buffers, flushing log at commit, writing dirty buffers when shared buffers run short)?
>>> On Thu, Dec 7, 2006 at 12:05 AM, in message <20061207144843.6269.ITAGAKI.TAKAHIRO@oss.ntt.co.jp>, ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> wrote: > > We offen encounters performance gap during checkpoint. The reason is write > bursts. Storage devices are too overworked in checkpoint, so they can not > supply usual transaction processing. When we first switched our web site to PostgreSQL, this was one of our biggest problems. Queries which normally run in afew milliseconds were hitting the 20 second limit we impose in our web application. These were happening in bursts whichsuggested that they were caused by checkpoints. We adjusted the background writer configuration and nearly eliminatedthe problem.bgwriter_all_maxpages | 600bgwriter_all_percent | 10bgwriter_delay | 200bgwriter_lru_maxpages | 200bgwriter_lru_percent | 20 Between the xfs caching and the batter backed cache in the RAID controller, the disk writes seemed to settle out pretty well. > Checkpoint consists of the following four steps, and the major performance > problem is 2nd step. All dirty buffers are written without interval in it. > > 1. Query information (REDO pointer, next XID etc.) > 2. Write dirty pages in buffer pool > 3. Flush all modified files > 4. Update control file > > I suggested to write pages with sleeping in 2nd step, using normal activity > of the background writer. It is something like cost- based vacuum delay. > Background writer has two pointers, 'ALL' and 'LRU', indicating where to > write out in buffer pool. We can wait for the ALL clock- hand going around > to guarantee all pages to be written. > > Here is pseudo- code for the proposed method. The internal loop is just the > same as bgwriter's activity. > > PrepareCheckPoint(); -- do step 1 > Reset num_of_scanned_pages by ALL activity; > do { > BgBufferSync(); -- do a part of step 2 > sleep(bgwriter_delay); > } while (num_of_scanned_pages < shared_buffers); > CreateCheckPoint(); -- do step 3 and 4 Would the background writer be disabled during this extended checkpoint? How is it better to concentrate step 2 in an extendedcheckpoint periodically rather than consistently in the background writer? > We may accelerate background writer to reduce works at checkpoint instead of > the method, but it introduces another performance problem; Extra pressure > is always put on the storage devices to keep the number of dirty pages low. Doesn't the file system caching logic combined with a battery backed cache in the controller cover this, or is your patchto help out those who don't have battery backed controller cache? What would the impact of your patch be on environmentslike ours? Will there be any affect on PITR techniques, in terms of how current the copied WAL files would be? > I'm working about adjusting the progress of checkpoint to checkpoint timeout > and wal segments limitation automatically to avoid overlap of two > checkpoints. > I'll post a patch sometime soon. > > Comments and suggestions welcome. > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center
Takayuki Tsunakawa wrote: > Hello, Itagaki-san >> Checkpoint consists of the following four steps, and the major >> performance >> problem is 2nd step. All dirty buffers are written without interval >> in it. >> 1. Query information (REDO pointer, next XID etc.) >> 2. Write dirty pages in buffer pool >> 3. Flush all modified files >> 4. Update control file > > Hmm. Isn't it possible that step 3 affects the performance greatly? > I'm sorry if you have already identified step 2 as disturbing > backends. > > As you know, PostgreSQL does not transfer the data to disk when > write()ing. Actual transfer occurs when fsync()ing at checkpoints, > unless the filesystem cache runs short. So, disk is overworked at > fsync()s. It seems to me that virtual memory settings of the OS will determine if step 2 or step 3 causes much of the actual disk I/O. In particular, on Linux, things like /proc/sys/vm/dirty_expire_centisecs and dirty_writeback_centisecs and possibly dirty_background_ratio would affect this. If those numbers are high, ISTM most write()s from step 2 would wait for the flush in step 3. If I understand correctly, if the dirty_expire_centisecs number is low, most write()s from step 2 would happen before step 3 because of the pdflush daemons. I expect other OS's would have different but similar knobs to tune this. It seems to me that the most portable way postgresql could force the I/O to be balanced would be to insert otherwise unnecessary fsync()s into step 2; but that it might (not sure why) be better to handle this through OS-specific tuning outside of postgres.
Hello, As Mr. Mayer points out, which of step 2 or 3 actually causes I/O depends on the VM settings, and the amount of RAM available for file system cache. "Ron Mayer" <rm_pg@cheapcomplexdevices.com> wrote in message news:45786549.2000602@cheapcomplexdevices.com... > It seems to me that the most portable way postgresql could force > the I/O to be balanced would be to insert otherwise unnecessary > fsync()s into step 2; but that it might (not sure why) be better > to handle this through OS-specific tuning outside of postgres. I'm afraid it is difficult for system designers to expect steady throughput/response time, as long as PostgreSQL depends on the flushing of file system cache. How does Oracle provide stable performance? Though I'm not sure, isn't it the key to use O_SYNC so that write()s transfer data to disk? That is, PostgreSQL completely controls the timing of data transfer. Moreover, if possible, it's better to bypass the file system cache, using such as O_DIRECT flag for open() on UNIX and FILE_FLAG_NO_BUFFERING flag for CreateFile() on Windows. As far as I know, SQL Server and Oracle does this. I think commercial DBMSs do the same thing to control and anticipate the I/O activity without being influenced by VM policy. If PostgreSQL is to use these, writing of dirty buffers has to be improved. To decrease the count of I/O, pages adjacent on disk that are also adjacent on memory must be written with one write(). From: "Kevin Grittner" <Kevin.Grittner@wicourts.gov> > Would the background writer be disabled during this extended checkpoint? How is it better to concentrate step 2 in an extended checkpoint periodically rather than consistently in the background writer? > Will there be any affect on PITR techniques, in terms of how current the copied WAL files would be? Extending the checkpoint can also cause extended downtime, to put it in an extreme way. I understand that checkpoints occur during crash recovery and PITR, so time for those operations would get longer. The checkpoint also occurs at server shutdown. However, distinction among these might be made, and undesirable extension could be avoided.
On Thu, 7 Dec 2006, Kevin Grittner wrote: > Between the xfs caching and the batter backed cache in the RAID... Mmmmm, battered cache. You can deep fry anything nowadays. > Would the background writer be disabled during this extended checkpoint? The background writer is the same process that does the full buffer sweep at checkpoint time. You wouldn't have to disable it because it would be busy doing this extended checkpoint instead of its normal job. > How is it better to concentrate step 2 in an extended checkpoint > periodically rather than consistently in the background writer? Right now, when the checkpoint flush is occuring, there is no background writer active--that process is handling the checkpoint. Itagaki's suggestion is basically to take the current checkpoint code, which runs all in one burst, and spread it out over time. I like the concept, as I've seen the behavior he's describing (even after tuning the background writer like you suggest and doing Linux disk tuning as Ron describes), but I think solving the problem is a little harder than suggested. I have two concerns with the logic behind this approach. The first is that if the background writer isn't keeping up with writing out all the dirty pages, what makes you think that running the checkpoint with a similar level of activity is going to? If your checkpoint is taking a long time, it's because the background writer has an overwhelming load and needs to be bailed out. Slowing down the writes with a lazier checkpoint process introduces the possibility that you'll hit a second checkpoint request before you're even finished cleaning up the first one, and then you're really in trouble. Second, the assumption here is that it's writing the dirty buffers out that is the primary cause of the ugly slowdown. I too believe it could just as easily be the fsync when it's done that killing you, and slowing down the writes isn't necessarily going to make that faster. > Doesn't the file system caching logic combined with a battery backed > cache in the controller cover this, or is your patch to help out those > who don't have battery backed controller cache? Unless your shared buffer pool is so small that you can write it all out onto the cache, that won't help much with this problem. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 8 Dec 2006, Takayuki Tsunakawa wrote: > Though I'm not sure, isn't it the key to use O_SYNC so that write()s > transfer data to disk? If disk writes near checkpoint time aren't happening fast enough now, I doubt forcing a sync after every write will make that better. > If PostgreSQL is to use these, writing of dirty buffers has to be > improved. To decrease the count of I/O, pages adjacent on disk that > are also adjacent on memory must be written with one write(). Sorting out which pages are next to one another on disk is one of the jobs the file system cache does; bypassing it will then make all that complicated sorting logic the job of the database engine. And unlike the operating system, the engine doesn't even really know anything about the filesystem or physical disks involved, so what are the odds it's going to do a better job? That's the concern with assuming direct writes are the solution here--you have to be smarter than the OS is for that to be an improvement, and you have a lot less information to make your decisions with than it does. I like Itagaki's idea of some automatic tuning of the checkpoint timeout and wal segment parameters to help out with checkpoint behavior; that's an idea that would help a lot of people. I'm not so sure that trying to make PostgreSQL take over operations that it's relying on the OS to handle intelligently right now will be as helpful, and it's a big programming job to even try. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote: > >> 1. Query information (REDO pointer, next XID etc.) > >> 2. Write dirty pages in buffer pool > >> 3. Flush all modified files > >> 4. Update control file > > > > Hmm. Isn't it possible that step 3 affects the performance greatly? > > I'm sorry if you have already identified step 2 as disturbing > > backends. > > It seems to me that virtual memory settings of the OS will determine > if step 2 or step 3 causes much of the actual disk I/O. > > if the dirty_expire_centisecs number is low, most write()s > from step 2 would happen before step 3 because of the pdflush daemons. Exactly. It depends on OSes, kernel settings, and filesystems. I tested the patch on Linux kernel 2.6.9-39, default settings, and ext3fs. Maybe pdflush daemons were strong enough to write dirty buffers in kernel, so step 2 was a main part and 3 was not. There are technical issues to distribute step 3. We can write buffers on a page basis, that is granular enough. However, fsync() is on a file basis (1GB), so we can only control granularity of fsync roughly. sync_file_range (http://lwn.net/Articles/178199/) or some special APIs would be a help, but there are portability issues... Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Greg Smith <gsmith@gregsmith.com> writes: > On Fri, 8 Dec 2006, Takayuki Tsunakawa wrote: >> Though I'm not sure, isn't it the key to use O_SYNC so that write()s >> transfer data to disk? > If disk writes near checkpoint time aren't happening fast enough now, I > doubt forcing a sync after every write will make that better. I think the idea would be to force the writes to actually occur, rather than just being scheduled (and then forced en-masse by an fsync at checkpoint time). Since the point of the bgwriter is to try to force writes to occur *outside* checkpoint times, this seems to make sense. I share your doubts about the value of slowing down checkpoints --- but to the extent that bgwriter-issued writes are delayed by the kernel until the next checkpoint, we are certainly not getting the desired effect of leveling the write load. >> To decrease the count of I/O, pages adjacent on disk that >> are also adjacent on memory must be written with one write(). > Sorting out which pages are next to one another on disk is one of the jobs > the file system cache does; bypassing it will then make all that > complicated sorting logic the job of the database engine. Indeed --- the knowledge that we don't know the physical layout has always been the strongest argument against using O_SYNC in this way. But I don't think anyone's made any serious tests. A properly tuned bgwriter should be eating only a "background" level of I/O effort between checkpoints, so maybe it doesn't matter too much if it's not optimally scheduled. regards, tom lane
"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote: > I'm afraid it is difficult for system designers to expect steady > throughput/response time, as long as PostgreSQL depends on the > flushing of file system cache. How does Oracle provide stable > performance? > Though I'm not sure, isn't it the key to use O_SYNC so that write()s > transfer data to disk? AFAIK, other databases use write() and fsync() in combination. They call fsync() immediately after they write buffers in some small batches. Otherwise, they uses asynchronous and direct I/O options. Therefore, dirty pages in kernel buffers are keeped to be low at any time. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > We adjusted the background writer configuration and nearly eliminated > the problem. > > bgwriter_all_maxpages | 600 > bgwriter_all_percent | 10 > bgwriter_delay | 200 > bgwriter_lru_maxpages | 200 > bgwriter_lru_percent | 20 > > Between the xfs caching and the batter backed cache in the RAID > controller, the disk writes seemed to settle out pretty well. Yes, higher bgwriter_all_maxpages is better for stability. I also do so up to now. However, if some processes makes lots of dirty buffers in the shortest time, ex.VACUUM, bgwriter begins to write many pages and affects responce time. We will be able to set bgwriter_all_maxpages to lower value with load distributed checkpoint. It expands the range of tuning of bgwriter. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Thu, Dec 07, 2006 at 10:03:05AM -0600, Kevin Grittner wrote: > When we first switched our web site to PostgreSQL, this was one of our biggest problems. Queries which normally run ina few milliseconds were hitting the 20 second limit we impose in our web application. These were happening in bursts whichsuggested that they were caused by checkpoints. We adjusted the background writer configuration and nearly eliminatedthe problem. > > bgwriter_all_maxpages | 600 > bgwriter_all_percent | 10 > bgwriter_delay | 200 > bgwriter_lru_maxpages | 200 > bgwriter_lru_percent | 20 Bear in mind that bgwriter settings should be considered in conjunction with shared_buffer and checkpoint_timeout settings. For example, if you have 60,000 shared buffers and a 300 second checkpoint interval, those settings are going to be pretty aggressive. Generally, I try and configure the all* settings so that you'll get 1 clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't have any actual tests to back that methodology up. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Fri, Dec 08, 2006 at 02:22:14PM +0900, ITAGAKI Takahiro wrote: > AFAIK, other databases use write() and fsync() in combination. They call > fsync() immediately after they write buffers in some small batches. Otherwise, > they uses asynchronous and direct I/O options. Therefore, dirty pages in > kernel buffers are keeped to be low at any time. The "easy" solution I can think of is, when a session/backend is exiting cleanly (client sent quit command), execute fsync() on some of the descriptors before actually closing. At this point the user isn't waiting anymore, so it can take its time. The problem with fsync() remains that it can cause a write spike, althoguh the more often you do it the less of an effect it would have. A longer term solution maybe be create a daemon with system specific information that monitors the load and tweaks parameters in response. Not just postgresql parameters, but also system parameters. Even if it never becomes part of postgresql, it will provide a way to test all these "hunches" people have about optimising the system. BTW, has anyone ever considered having the bgwriter do a NOTIFY whenever it starts/ends a checkpoint, so client coulds monitor the activity without reading the logs? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On 12/7/06, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote:
dirty_expire_centisecs will have little, if any, effect on a box with consistent workload. Under uniform load bgwriter will keep pushing the buffers to fs cache which will result in eviction/flushing of pages to disk. That the pages will age quickly can lower the cap of dirty pages but it won't/can't handle sudden spike at checkpoint time.
Again on a system that encounters IO chokes on checkpoints pdflush is presumably working like crazy at that time. Reducing the gap between its wakeup calls will have probably very little impact on the checkpoint performance.
I have seen this to put a real cap on number of dirty pages during normal running. As regards checkpoints, this again seems to have little effect.
The problem while dealing with checkpoints is that we are dealing with two starkly different type of IO loads. The larger the number of shared_buffers the greater the spike in IO activity at checkpoint. AFAICS no specific vm tunables can smooth out checkpoint spikes by itself. There has to be some intelligence in the bgwriter to even the load out.
Takayuki Tsunakawa wrote:
> Hello, Itagaki-san
>> Checkpoint consists of the following four steps, and the major
>> performance
>> problem is 2nd step. All dirty buffers are written without interval
>> in it.
>> 1. Query information (REDO pointer, next XID etc.)
>> 2. Write dirty pages in buffer pool
>> 3. Flush all modified files
>> 4. Update control file
>
> Hmm. Isn't it possible that step 3 affects the performance greatly?
> I'm sorry if you have already identified step 2 as disturbing
> backends.
>
> As you know, PostgreSQL does not transfer the data to disk when
> write()ing. Actual transfer occurs when fsync()ing at checkpoints,
> unless the filesystem cache runs short. So, disk is overworked at
> fsync()s.
It seems to me that virtual memory settings of the OS will determine
if step 2 or step 3 causes much of the actual disk I/O.
In particular, on Linux, things like /proc/sys/vm/dirty_expire_centisecs
dirty_expire_centisecs will have little, if any, effect on a box with consistent workload. Under uniform load bgwriter will keep pushing the buffers to fs cache which will result in eviction/flushing of pages to disk. That the pages will age quickly can lower the cap of dirty pages but it won't/can't handle sudden spike at checkpoint time.
and dirty_writeback_centisecs
Again on a system that encounters IO chokes on checkpoints pdflush is presumably working like crazy at that time. Reducing the gap between its wakeup calls will have probably very little impact on the checkpoint performance.
and possibly dirty_background_ratio
I have seen this to put a real cap on number of dirty pages during normal running. As regards checkpoints, this again seems to have little effect.
The problem while dealing with checkpoints is that we are dealing with two starkly different type of IO loads. The larger the number of shared_buffers the greater the spike in IO activity at checkpoint. AFAICS no specific vm tunables can smooth out checkpoint spikes by itself. There has to be some intelligence in the bgwriter to even the load out.
would affect this. If those numbers are high, ISTM most write()s
from step 2 would wait for the flush in step 3. If I understand
correctly, if the dirty_expire_centisecs number is low, most write()s
from step 2 would happen before step 3 because of the pdflush daemons.
I expect other OS's would have different but similar knobs to tune this.
It seems to me that the most portable way postgresql could force
the I/O to be balanced would be to insert otherwise unnecessary
fsync()s into step 2; but that it might (not sure why) be better
to handle this through OS-specific tuning outside of postgres.
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
>>> On Fri, Dec 8, 2006 at 1:13 AM, in message <20061208071305.GG44124@nasby.net>, "Jim C. Nasby" <jim@nasby.net> wrote: > On Thu, Dec 07, 2006 at 10:03:05AM - 0600, Kevin Grittner wrote: >> We adjusted the background writer configuration >> and nearly eliminated the problem. >> >> bgwriter_all_maxpages | 600 >> bgwriter_all_percent | 10 >> bgwriter_delay | 200 >> bgwriter_lru_maxpages | 200 >> bgwriter_lru_percent | 20 > > Bear in mind that bgwriter settings should be considered in conjunction > with shared_buffer and checkpoint_timeout settings. For example, if you > have 60,000 shared buffers and a 300 second checkpoint interval, those > settings are going to be pretty aggressive. > > Generally, I try and configure the all* settings so that you'll get 1 > clock- sweep per checkpoint_timeout. It's worked pretty well, but I don't > have any actual tests to back that methodology up. We have 20,000 shared buffers and a 300 second checkpoint interval. We got to these numbers somewhat scientifically. I studied I/O patterns under production load and figured we should be able to handle about 800 writes in per 200 ms without causing problems. I have to admit that I based the percentages and the ratio between "all" and "lru" on gut feel after musing over the documentation. Since my values were such a dramatic change from the default, I boosted the production settings a little bit each day and looked for feedback from our web team. Things improved with each incremental increase. When I got to my calculated values (above) they reported that these timeouts had dropped to an acceptable level -- a few per day on a website with 2 million hits per day. We may benefit from further adjustments, but since the problem is negligible with these settings, there are bigger fish to fry at the moment. By the way, if I remember correctly, these boxes have 256 MB battery backed cache, while 20,000 buffers is 156.25 MB. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > "Jim C. Nasby" <jim@nasby.net> wrote: >> Generally, I try and configure the all* settings so that you'll get 1 >> clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't >> have any actual tests to back that methodology up. > We got to these numbers somewhat scientifically. I studied I/O > patterns under production load and figured we should be able to handle > about 800 writes in per 200 ms without causing problems. I have to > admit that I based the percentages and the ratio between "all" and "lru" > on gut feel after musing over the documentation. I like Kevin's settings better than what Jim suggests. If the bgwriter only makes one sweep between checkpoints then it's hardly going to make any impact at all on the number of dirty buffers the checkpoint will have to write. The point of the bgwriter is to reduce the checkpoint I/O spike by doing writes between checkpoints, and to have any meaningful impact on that, you'll need it to make the cycle several times. Another point here is that you want checkpoints to be pretty far apart to minimize the WAL load from full-page images. So again, a bgwriter that's only making one loop per checkpoint is not gonna be doing much. I wonder whether it would be feasible to teach the bgwriter to get more aggressive as the time for the next checkpoint approaches? Writes issued early in the interval have a much higher probability of being wasted (because the page gets re-dirtied later). But maybe that just reduces to what Takahiro-san already suggested, namely that checkpoint-time writes should be done with the same kind of scheduling the bgwriter uses outside checkpoints. We still have the problem that the real I/O storm is triggered by fsync() not write(), and we don't have a way to spread out the consequences of fsync(). regards, tom lane
>>> On Fri, Dec 8, 2006 at 10:43 AM, in message <6439.1165596207@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wonder whether it would be feasible to teach the bgwriter to get more > aggressive as the time for the next checkpoint approaches? Writes > issued early in the interval have a much higher probability of being > wasted (because the page gets re- dirtied later). But wouldn't the ones written earlier stand a much better chance of being scheduled for a physical write before the fsync? They aren't ALL re-dirtied. In our environment, we seem to be getting a lot written before the fsync with our current settings. > But maybe that just > reduces to what Takahiro- san already suggested, namely that > checkpoint- time writes should be done with the same kind of scheduling > the bgwriter uses outside checkpoints. We still have the problem that > the real I/O storm is triggered by fsync() not write(), and we don't > have a way to spread out the consequences of fsync(). We don't have a way to force writes before the fsync, but early writes to the file system encourages it. After reading this thread, I'm tempted to nudge our settings a little higher -- especially the percentages. How much overhead is there in checking whether buffer pages are dirty? -Kevin
>>> On Fri, Dec 8, 2006 at 11:01 AM, in message <4579461A.EE98.0025.0@wicourts.gov>, "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: I'm going to correct a previous statement, based on having just chatted with a member of the web team. The remaining dribble of these was only occurring on the Windows machines. (We were running two servers from Windows and two from Linux for a while to see how things compared.) The settings I previously posted have completely eliminated the problem on Linux, and we are moving all servers to Linux, so this has become a total non-issue for us. I really encourage people to try some much more aggressive background writer settings, and possibly try tuning the OS dirty write settings. Unless PostgreSQL gets much closer to the hardware than the community consensus seems to support, I don't understand what you could do in the checkpoint phase that would improve on that. (That, of course, doesn't mean I'm not missing something, just that the arguments made so far haven't shown me that the suggested changes would do anything but move the problem around a little bit.) -Kevin
On Fri, 2006-12-08 at 11:55 -0600, Kevin Grittner wrote: > >>> On Fri, Dec 8, 2006 at 11:01 AM, in message > <4579461A.EE98.0025.0@wicourts.gov>, "Kevin Grittner" > <Kevin.Grittner@wicourts.gov> wrote: > > I'm going to correct a previous statement, based on having just chatted > with a member of the web team. > > The remaining dribble of these was only occurring on the Windows > machines. (We were running two servers from Windows and two from Linux > for a while to see how things compared.) The settings I previously > posted have completely eliminated the problem on Linux, and we are > moving all servers to Linux, so this has become a total non-issue for > us. I really encourage people to try some much more aggressive > background writer settings, and possibly try tuning the OS dirty write > settings. How much increased I/O usage have you seen in regular operation with those settings? -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
>>> On Fri, Dec 8, 2006 at 12:18 PM, in message <1165601938.10248.56.camel@dba5.int.libertyrms.com>, Brad Nicholson <bnichols@ca.afilias.info> wrote: > > How much increased I/O usage have you seen in regular operation with > those settings? We have not experience any increase in I/O, just a smoothing. Keep in mind that the file system cache will collapse repeated writes to the same location until things settle, and the controller's cache also has a chance of doing so. If we just push dirty pages out to the OS as soon as possible, and let the file system do its job, I think we're in better shape than if we try to micro-manage it within our buffer pages. You mileage may vary of course, but I'm curious whether any real world production examples exist where this approach is a loser. -Kevin
Unless PostgreSQL gets much closer to the hardware than the community
consensus seems to support, I don't understand what you could do in the
checkpoint phase that would improve on that. (That, of course, doesn't
mean I'm not missing something, just that the arguments made so far
haven't shown me that the suggested changes would do anything but move
the problem around a little bit.)
If we can try out things like O_SYNC and/or O_DIRECT that will give us exact information about dirty pages that need be written to the disk at any given point in time. We can then use this information to design light weight checkpoints.
On Fri, 2006-12-08 at 11:05 +0900, Takayuki Tsunakawa wrote: > I understand that checkpoints occur during crash > recovery and PITR, so time for those operations would get longer. A restorepoint happens during recovery, not a checkpoint. The recovery is merely repeating the work of the checkpoint that occurred in the original WAL stream. Elongating the checkpoint would not have any effect on a restorepoint: we only record the checkpoint when it is complete and we only create a restorepoint when we see the checkpoint record. Crash recovery and PITR use almost exactly the same code path (by design), so there isn't anything special to say about PITR either. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > We have not experience any increase in I/O, just a smoothing. Keep in > mind that the file system cache will collapse repeated writes to the > same location until things settle, and the controller's cache also has a > chance of doing so. If we just push dirty pages out to the OS as soon > as possible, and let the file system do its job, I think we're in better > shape than if we try to micro-manage it within our buffer pages. Maybe we have two entirely different tuning approaches: 1. Retain dirty buffers in database, and keep OS buffers clean. 2.Keep database clean, and entrust OS to manage dirty buffers. I suggested the 1st one, and you did the 2nd. Bottle-neck in checkpoints vary in the approaches; write() will be worse in 1st, fsync() in 2nd. Distributed write() is easier than distributed fsync(), because we can use write() on a page basis, but fsync() only on a file basis. Also, database has own access-frequency information for its buffers, so I think 1st approach behaves better in handling re-dirty of buffers. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Hello, From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp> "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote: >> I'm afraid it is difficult for system designers to expect steady >> throughput/response time, as long as PostgreSQL depends on the >> flushing of file system cache. How does Oracle provide stable >> performance? >> Though I'm not sure, isn't it the key to use O_SYNC so that write()s >> transfer data to disk? > > AFAIK, other databases use write() and fsync() in combination. They call > fsync() immediately after they write buffers in some small batches. Otherwise, > they uses asynchronous and direct I/O options. Therefore, dirty pages in > kernel buffers are keeped to be low at any time. Oracle seems to use O_SYNC. I've found a related page in the Oracle manuals. -------------------------------------------------- http://download-west.oracle.com/docs/cd/B19306_01/win.102/b15688/ap_unix.htm Direct Writes to Disk On both UNIX and Windows platforms, bypassing the file system buffer cache ensures data is written to disk. On UNIX, Oracle Database uses the O_SYNC flag to bypass the file system buffer cache. The flag name depends on the UNIX port. On Windows, Oracle Database bypasses the file system buffer cache completely. -------------------------------------------------- As Itagaki-san says, asynchronous+direct I/O provides best performance, I believe. Oracle supplies the combination as follows: -------------------------------------------------- http://download-west.oracle.com/docs/cd/B19306_01/server.102/b15658/appc_linux.htm#sthref870 Oracle Database supports kernel asynchronous I/O. This feature is disabled by default. By default, the DISK_ASYNCH_IO initialization parameter in the parameter file is set to TRUE to enable asynchronous I/O on raw devices. To enable asynchronous I/O on file system files: Ensure that all Oracle Database files are located on file systems that support asynchronous I/O. Set the FILESYSTEMIO_OPTIONS initialization parameter in the parameter file to one of the following values: Linux Distribution Recommended Value SUSE Linux Enterprise Server 9 SETALL Other distributions ASYNCH -------------------------------------------------- I believe SQL Server also uses direct+asynchronous I/O, because Microsoft recommends in MSDN that the combination plus appropriate multi-threading provides best performance. I tested Oracle9i on RHEL 2.1. I straced DBWn (database writer, which is like the bgwriter of PostgreSQL) while creating tables, indexes, etc. and shutting down the database server. Oracle surely uses the O_SYNC as follows, but it doesn't use fsync(). 24462 open("/work4/ora/tuna/users01.dbf", O_RDWR|O_SYNC|O_LARGEFILE) = 16 I wonder how the other big DBMS, IBM DB2, handles this. Is Itagaki-san referring to DB2?
Mr. Riggs, Thank you for teaching me the following. I seem to have misunderstood. I'll learn more. From: "Simon Riggs" <simon@2ndquadrant.com> > On Fri, 2006-12-08 at 11:05 +0900, Takayuki Tsunakawa wrote: >> I understand that checkpoints occur during crash >> recovery and PITR, so time for those operations would get longer. > > A restorepoint happens during recovery, not a checkpoint. The recovery > is merely repeating the work of the checkpoint that occurred in the > original WAL stream. Elongating the checkpoint would not have any effect > on a restorepoint: we only record the checkpoint when it is complete and > we only create a restorepoint when we see the checkpoint record. Regards, ----- Original Message ----- From: "Simon Riggs" <simon@2ndquadrant.com> To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> Cc: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>; <pgsql-hackers@postgresql.org> Sent: Monday, December 11, 2006 6:30 PM Subject: Re: [HACKERS] Load distributed checkpoint > On Fri, 2006-12-08 at 11:05 +0900, Takayuki Tsunakawa wrote: >> I understand that checkpoints occur during crash >> recovery and PITR, so time for those operations would get longer. > > A restorepoint happens during recovery, not a checkpoint. The recovery > is merely repeating the work of the checkpoint that occurred in the > original WAL stream. Elongating the checkpoint would not have any effect > on a restorepoint: we only record the checkpoint when it is complete and > we only create a restorepoint when we see the checkpoint record. > > Crash recovery and PITR use almost exactly the same code path (by > design), so there isn't anything special to say about PITR either. > > -- > Simon Riggs > EnterpriseDB http://www.enterprisedb.com > > >
I wonder how the other big DBMS, IBM DB2, handles this. Is Itagaki-san
referring to DB2?
DB2 would also open data files with O_SYNC option and page_cleaners (counterparts of bgwriter) would exploit AIO if available on the system.
Inaam Rana
EnterpriseDB http://www.enterprisedb.com
ITAGAKI Takahiro wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > >> ...the file system cache will collapse repeated writes to the >> same location until things settle ... >> If we just push dirty pages out to the OS as soon as possible, >> and let the file system do its job, I think we're in better >> shape... > > Maybe we have two entirely different tuning approaches: > 1. Retain dirty buffers in database, and keep OS buffers clean. > 2. Keep database clean, and entrust OS to manage dirty buffers. > > I suggested the 1st one, and you did the 2nd. Bottle-neck in checkpoints > vary in the approaches; write() will be worse in 1st, fsync() in 2nd. The fsync() won't necessarily be worse in the second approach. OS's have quite a few tunable parameters that can encourage the system to physically write the pending write()s before the fsync() - either in the background or by the process doing the write() itself when there are too many dirty pages. For checkpoints, I think the main question is whether postgresql's background writer is smarter or less smart than pdflush or the equivalent on your system for database workloads. > Also, database has own access-frequency information for its buffers, > so I think 1st approach behaves better in handling re-dirty of buffers. I'm curious what access-frequency info the OS and the database has. One thing I do worry about is if both postgresql and the OS are both delaying write()s in the hopes of collapsing them at the same time. If so, this would cause both to be experience bigger delays than expected, and make checkpoints worse. I'm guesing if you use approach 1. you might be better off turning down the amount of buffering that the OS does with dirty pages - and if you use approach 2, you might be better off turning down the amount of delays that postgresql adds.
>>> On Mon, Dec 11, 2006 at 3:31 PM, in message <457DCE4D.7090402@cheapcomplexdevices.com>, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote: > > One thing I do worry about is if both postgresql and the OS > are both delaying write()s in the hopes of collapsing them > at the same time. If so, this would cause both to be experience > bigger delays than expected, and make checkpoints worse. That is my concern. Letting 30 seconds worth of dirty pages accumulate between checkpoints and then trying to smooth the writes within checkpoint code seems like a doomed effort. Either we take over control of everything, like many other products, and schedule it all ourselves, or we push it out to the file system fairly quickly and let it do its job. Let's not stand with one foot on the pier and one on the boat. It is entirely possible that the current setup, where a delay of a couple seconds may collapse several OS API calls, could have a performance benefit; but, let's be clear that we're talking about saving that overhead, not disk I/O. -Kevin
> > One thing I do worry about is if both postgresql and the OS > > are both delaying write()s in the hopes of collapsing them > > at the same time. If so, this would cause both to be experience > > bigger delays than expected, and make checkpoints worse. > > That is my concern. Letting 30 seconds worth of dirty pages accumulate > between checkpoints and then trying to smooth the writes within > checkpoint code seems like a doomed effort. 30 seconds ??? You are supposed to avoid excessive checkpoints. If you are reducing checkpoint_timeout to avoid the spike, you are imho definitely tuning the wrong knob. You are supposed to increase checkpoint_timeout as far as you can to still have an acceptable recovery delay after a crash. If you cannot afford a huge spike during checkpoint, say every 30 minutes, you have to make bgwriter more aggressive. This would generally be true for both of the variants, db driven direct io and buffered filesystem io. Andreas
>>> On Tue, Dec 12, 2006 at 3:22 AM, in message <E1539E0ED7043848906A8FF995BDA5790198ECC8@m0143.s-mxs.net>, "Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at> wrote: >> > One thing I do worry about is if both postgresql and the OS >> > are both delaying write()s in the hopes of collapsing them >> > at the same time. If so, this would cause both to be experience >> > bigger delays than expected, and make checkpoints worse. >> >> That is my concern. Letting 30 seconds worth of dirty pages > accumulate >> between checkpoints and then trying to smooth the writes within >> checkpoint code seems like a doomed effort. > > 30 seconds ??? You are supposed to avoid excessive checkpoints. > If you are reducing checkpoint_timeout to avoid the spike, you are imho > definitely tuning the wrong knob. Sorry for the fuzzy language -- I was intending to describe a situation where dirty pages accumulate by checkpoint time which would take 30 seconds to write to disk. We were into this situation (and worse) with the default bgwriter settings. > you have to make bgwriter more aggressive. This is what I've been saying. I've also been saying that if the PostgreSQL way is to let the file system handle the caching and I/O scheduling, we should trust it to know what to do with dirty pages, and not try to second-guess it. (Of course there are knobs to tune the file system if needed.) Our checkpoint performance issues went away when we went to settings which basically never leave a dirty page hidden from the file system for more than two seconds. -Kevin
I have thought a while about this and I have some ideas. Ideally, we would be able to trickle the sync of individuals blocks during the checkpoint, but we can't because we rely on the kernel to sync all dirty blocks that haven't made it to disk using fsync(). We could trickle the fsync() calls, but that just extends the amount of data we are writing that has been dirtied post-checkpoint. In an ideal world, we would be able to fsync() only part of a file at a time, and only those blocks that were dirtied pre-checkpoint, but I don't see that happening anytime soon (and one reason why many commercial databases bypass the kernel cache). So, in the real world, one conclusion seems to be that our existing method of tuning the background writer just isn't good enough for the average user: #bgwriter_delay = 200ms # 10-10000ms between rounds#bgwriter_lru_percent = 1.0 # 0-100% of LRUbuffers scanned/round#bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round#bgwriter_all_percent =0.333 # 0-100% of all buffers scanned/round#bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round These settings control what the bgwriter does, but they do not clearly relate to the checkpoint timing, which is the purpose of the bgwriter, and they don't change during the checkpoint interval, which is also less than ideal. If set to aggressively, it writes too much, and if too low, the checkpoint does too much I/O. We clearly need more bgwriter activity as the checkpoint approaches, and one that is more auto-tuned, like many of our other parameters. I think we created these settings to see how they worked in the field, so it probably time to reevaluate them based on field reports. I think the bgwriter should keep track of how far it is to the next checkpoint, and use that information to increase write activity. Basically now, during a checkpoint, the bgwriter does a full buffer scan and fsync's all dirty files, so it changes from the configuration parameter-defined behavior right to 100% activity. I think it would be ideal if we could ramp up the writes so that when it is 95% to the next checkpoint, it can be operating at 95% of the activity it would do during a checkpoint. My guess is if we can do that, we will have much smoother performance because we have more WAL writes just after checkpoint for newly-dirtied pages, and the new setup will give us more write activity just before checkpoint. One other idea is for the bgwriter to use O_DIRECT or O_SYNC to avoid the kernel cache, so we are sure data will be on disk by checkpoint time. This was avoided in the past because of the expense of second-guessing the kernel disk I/O scheduling algorithms. --------------------------------------------------------------------------- Tom Lane wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > > "Jim C. Nasby" <jim@nasby.net> wrote: > >> Generally, I try and configure the all* settings so that you'll get 1 > >> clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't > >> have any actual tests to back that methodology up. > > > We got to these numbers somewhat scientifically. I studied I/O > > patterns under production load and figured we should be able to handle > > about 800 writes in per 200 ms without causing problems. I have to > > admit that I based the percentages and the ratio between "all" and "lru" > > on gut feel after musing over the documentation. > > I like Kevin's settings better than what Jim suggests. If the bgwriter > only makes one sweep between checkpoints then it's hardly going to make > any impact at all on the number of dirty buffers the checkpoint will > have to write. The point of the bgwriter is to reduce the checkpoint > I/O spike by doing writes between checkpoints, and to have any > meaningful impact on that, you'll need it to make the cycle several times. > > Another point here is that you want checkpoints to be pretty far apart > to minimize the WAL load from full-page images. So again, a bgwriter > that's only making one loop per checkpoint is not gonna be doing much. > > I wonder whether it would be feasible to teach the bgwriter to get more > aggressive as the time for the next checkpoint approaches? Writes > issued early in the interval have a much higher probability of being > wasted (because the page gets re-dirtied later). But maybe that just > reduces to what Takahiro-san already suggested, namely that > checkpoint-time writes should be done with the same kind of scheduling > the bgwriter uses outside checkpoints. We still have the problem that > the real I/O storm is triggered by fsync() not write(), and we don't > have a way to spread out the consequences of fsync(). > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
> Tom Lane wrote: >> >> I like Kevin's settings better than what Jim suggests. If the bgwriter >> only makes one sweep between checkpoints then it's hardly going to make >> any impact at all on the number of dirty buffers the checkpoint will >> have to write. The point of the bgwriter is to reduce the checkpoint >> I/O spike by doing writes between checkpoints, and to have any >> meaningful impact on that, you'll need it to make the cycle several times. >> >> Another point here is that you want checkpoints to be pretty far apart >> to minimize the WAL load from full-page images. So again, a bgwriter >> that's only making one loop per checkpoint is not gonna be doing much. I missed the previous message but it sounds like you're operating under a different set of assumptions than the original poster. If you do a single sweep through all of the buffers *and sync them* then you've just finished a checkpoint -- the *previous* checkpoint. Not the subsequent one. That is, rather than trying to spread the load of the checkpoint out by getting the writes into the kernel sooner but make no attempt to sync them until checkpoint time, start the checkpoint as soon as the previous checkpoint finishes, and dribble the blocks of the checkpoint out slowly throughout an entire checkpoint cycle syncing them immediately using O_SYNC/ODIRECT. It's a fundamental shift in the idea of the purpose of bgwriter. Instead of trying to suck i/o away from the subsequent checkpoint it would be responsible for all the i/o of the previous checkpoint which would still be in progress for the entire time of checkpoint_timeout. It would only complete when bgwriter had finished doing its one full sweep. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > It's a fundamental shift in the idea of the purpose of bgwriter. Instead of > trying to suck i/o away from the subsequent checkpoint it would be responsible > for all the i/o of the previous checkpoint which would still be in progress > for the entire time of checkpoint_timeout. It would only complete when > bgwriter had finished doing its one full sweep. I think that's basically the same as the original suggestion, which is to do checkpoints using less than the full I/O bandwidth of the machine --- tying it exactly to the default bgwriter rate may or may not be appropriate. The difficulty with such schemes is that if you go over to using O_DIRECT writes instead of fsync in the bgwriter, it's hard to avoid doing the same when a random backend has to write a dirty buffer --- yet you'd really rather that such a backend not have to wait for the ensuing I/O. And this gets a lot worse if checkpoints are slowed down, because it gets more likely that the bufmgr will run out of clean buffers and have to do a write() from a backend. regards, tom lane
On Fri, Dec 08, 2006 at 11:43:27AM -0500, Tom Lane wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > > "Jim C. Nasby" <jim@nasby.net> wrote: > >> Generally, I try and configure the all* settings so that you'll get 1 > >> clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't > >> have any actual tests to back that methodology up. > > > We got to these numbers somewhat scientifically. I studied I/O > > patterns under production load and figured we should be able to handle > > about 800 writes in per 200 ms without causing problems. I have to > > admit that I based the percentages and the ratio between "all" and "lru" > > on gut feel after musing over the documentation. > > I like Kevin's settings better than what Jim suggests. If the bgwriter > only makes one sweep between checkpoints then it's hardly going to make > any impact at all on the number of dirty buffers the checkpoint will > have to write. The point of the bgwriter is to reduce the checkpoint > I/O spike by doing writes between checkpoints, and to have any > meaningful impact on that, you'll need it to make the cycle several times. It would be good if the docs included more detailed info on how exactly the bgwriter goes about flushing stuff to disk. You can certainly read them and think that the bgwriter just goes through and issues writes for any dirty buffers it finds. Though, looking at BgBufferSync, I think it actually does write out pages during the "all" scan, regardless of what usage_count says. > I wonder whether it would be feasible to teach the bgwriter to get more > aggressive as the time for the next checkpoint approaches? Writes > issued early in the interval have a much higher probability of being > wasted (because the page gets re-dirtied later). But maybe that just > reduces to what Takahiro-san already suggested, namely that > checkpoint-time writes should be done with the same kind of scheduling > the bgwriter uses outside checkpoints. We still have the problem that > the real I/O storm is triggered by fsync() not write(), and we don't > have a way to spread out the consequences of fsync(). Would the ramp-up of write activity push the kernel to actually write stuff? My understanding is that most OSes have a time limit on how long they'll let a write-request sit in cache, so ISTM a better way to smooth out disk IO is to write things in a steady stream. If the bgwriter takes the buffer access counter into account when deciding what to write out, it might make sense to write more recently accessed pages as checkpoint nears. The idea being that odds are good those buffers are about to get flushed by BufferSync() anyway. Also, I have a dumb question... BgBufferSync uses buf_id1 to keep track of what buffer the bgwriter_all scan is looking at, which means that it should remember where it was at the end of the last scan; yet it's initialized to 0 every time BgBufferSync is called. Is there someplace else that is remembering where the complete scan is leaving off when bgwriter_all_percent or bgwriter_all_maxpages is hit? Or does the scan in fact just keep re-scanning the beginning of the buffers? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Hello, From: "Jim C. Nasby" <jim@nasby.net>Also, I have a dumb question... BgBufferSync uses buf_id1 to keep track > of what buffer the bgwriter_all scan is looking at, which means that > it should remember where it was at the end of the last scan; yet it's > initialized to 0 every time BgBufferSync is called. Is there someplace > else that is remembering where the complete scan is leaving off when > bgwriter_all_percent or bgwriter_all_maxpages is hit? Or does the scan > in fact just keep re-scanning the beginning of the buffers? No. BgBufferSync() correctly keeps track of the position to restart scanning at. bufid1 is not initialized to 0 every time BgBufferSync() is called, because bufid1 is a static local variable. Please see the following code. It prints: a=0 a=1 a=2 #include <stdio.h> void func(void) {static int a = 0; printf("a=%d\n", a);a++; } int main(void) {func();func();func(); return 0; }
On Wed, Dec 13, 2006 at 06:27:38PM +0900, Takayuki Tsunakawa wrote: > No. BgBufferSync() correctly keeps track of the position to restart > scanning at. bufid1 is not initialized to 0 every time BgBufferSync() > is called, because bufid1 is a static local variable. Please see the > following code. It prints: Doh, I completely missed the static part of the declaration. Thanks for the C tutorial. :) -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Hello, Itagaki-san, all Sorry for my long mail. I've had trouble in sending this mail because it's too long for pgsql-hackers to accept (I couldn't find how large mail is accepted.) So I'm trying to send several times. Please see the attachment for the content.
"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote: > I performed some simple tests, and I'll show the results below. > (1) The default case > 235 80 226 77 240 > (2) No write case > 242 250 244 253 280 > (3) No checkpoint case > 229 252 256 292 276 > (4) No fsync() case > 236 112 215 216 221 > (5) No write by PostgreSQL, but fsync() by another program case > 9 223 260 283 292 > (6) case (5) + O_SYNC by write_fsync > 97 114 126 112 125 > (7) O_SYNC case > 182 103 41 50 74 I posted a patch to PATCHES. Please try out it. It does write() smoothly, but fsync() at a burst. I suppose the result will be between (3) and (5). Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Hello, Itagaki-san > I posted a patch to PATCHES. Please try out it. Really!? I've just joined pgsql-patches. When did you post it, yesterday? I couldn't find the patch in the following page which lists the mails to pgsql-patches of this month: http://archives.postgresql.org/pgsql-patches/2006-12/index.php Could you send me the patch if it has not already registered on the above page. I want to try the patch by all means, because smoothing response times is very important so that we can recommend PostgreSQL to system designers. I really wish your patch will be the real solution. > It does write() smoothly, but fsync() at a burst. > I suppose the result will be between (3) and (5). Hmm... I think some logical reasoning is needed to get the understanding from community members (sorry if the community members have already agreed.) Excuse me for repeating myself, but I'm afraid fsync() will be the evil sometime in some environments. Success in one test environment is not the real success.
Hello, Itagaki-san, all I have to report a sad result. Your patch didn't work. Let's consider the solution together. What you are addressing is very important for the system designers in the real world -- smoothing response time. Recall that unpatched PostgreSQL showed the following tps's in case (1) (i.e. with default bgwriter_* and checkpoint_* settings.) 235 80 226 77 240 The patched PostgreSQL showed the following tps's: 230 228 77 209 66 [disk usage] The same tendency can be seen as with the unpatched PostgreSQL. That is: When the tps is low, the %util of disk for data files is high, and %util of disk for WAL is low. Why is transaction logging is disturbed by cleaning and/or syncing activity? While the bgwriter is fsync()ing, it does not lock any data structures that the transactions want to access. Even though they share the same SCSI controller and bus, they are different disks. The bandwidth does not appear to be exhausted, since Ultra320 is said to have 256MB band width in practice. (Recall that WAL is on sdd and data files are on sde.) Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 810.78 0.00 102.20 0.00 7306.99 0.00 3653.49 71.50 1.12 10.95 7.32 74.77 sde 0.00 25.35 0.00 6.19 0.00 252.30 0.00 126.15 40.77 0.50 81.32 5.94 3.67 sdd 0.00 884.20 0.00 126.00 0.00 8080.00 0.00 4040.00 64.13 1.26 10.00 7.11 89.64 sde 0.00 21.40 0.00 5.00 0.00 211.20 0.00 105.60 42.24 0.31 62.56 6.52 3.26 sdd 0.00 924.80 0.00 116.20 0.00 8326.40 0.00 4163.20 71.66 1.23 10.59 7.37 85.64 sde 0.00 27.60 0.00 26.60 0.00 433.60 0.00 216.80 16.30 4.24 159.29 2.44 6.50 sdd 0.00 721.20 0.00 102.40 0.00 6588.80 0.00 3294.40 64.34 0.99 9.71 7.07 72.40 sde 0.00 1446.80 0.00 101.60 0.00 20289.60 0.00 10144.80 199.70 1192.40 572.45 2.29 23.30 sdd 0.00 0.00 0.00 0.20 0.00 1.60 0.00 0.80 8.00 0.11 539.00 539.00 10.80 sde 0.00 0.00 0.00 452.10 0.00 0.00 0.00 0.00 0.00 3829.57 3715.83 2.22 100.22 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 349.80 0.00 0.00 0.00 0.00 0.00 1745.52 8515.74 2.86 100.02 sdd 0.00 442.40 0.00 51.00 0.00 3948.80 0.00 1974.40 77.43 0.60 11.73 7.54 38.46 sde 0.00 2.80 0.00 184.00 0.00 25.60 0.00 12.80 0.14 277.52 12629.41 3.19 58.74 sdd 0.00 898.00 0.00 124.80 0.00 8182.40 0.00 4091.20 65.56 1.30 10.40 7.24 90.30 sde 0.00 19.20 0.00 3.80 0.00 184.00 0.00 92.00 48.42 0.24 62.11 14.11 5.36 sdd 0.00 842.28 0.00 109.02 0.00 7612.02 0.00 3806.01 69.82 1.33 12.26 8.35 91.02 sde 0.00 45.49 0.00 46.89 0.00 739.08 0.00 369.54 15.76 9.04 192.73 3.38 15.85 sdd 0.00 1198.41 0.00 71.51 0.00 10505.18 0.00 5252.59 146.90 128.19 99.76 13.48 96.43 sde 0.00 1357.77 0.00 199.80 0.00 19263.75 0.00 9631.87 96.41 2251.09 1179.42 2.39 47.81 sdd 0.00 0.00 0.00 7.20 0.00 0.00 0.00 0.00 0.00 203.87 5671.83 138.92 100.02 sde 0.00 0.00 0.00 409.60 0.00 0.00 0.00 0.00 0.00 3171.04 4779.83 2.44 100.02 sdd 0.00 0.00 0.00 17.80 0.00 0.00 0.00 0.00 0.00 137.87 10240.90 56.19 100.02 sde 0.00 0.00 0.00 240.60 0.00 0.00 0.00 0.00 0.00 1573.85 9815.29 4.16 100.02 sdd 0.00 109.80 0.00 35.40 0.00 1012.80 0.00 506.40 28.61 42.14 7974.47 27.86 98.64 sde 0.00 2.80 0.00 198.80 0.00 30.40 0.00 15.20 0.15 428.49 14474.39 4.30 85.56 sdd 0.00 466.20 0.00 62.80 0.00 4230.40 0.00 2115.20 67.36 0.59 9.49 6.79 42.62 sde 0.00 5.20 0.00 0.80 0.00 48.00 0.00 24.00 60.00 0.01 16.25 11.25 0.90 sdd 0.00 0.00 0.00 0.20 0.00 1.60 0.00 0.80 8.00 0.01 35.00 35.00 0.70 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 I suspect that fsync() is the criminal as I've been afraid. I'll show you an interesting data. I collected the stack traces of backend processes while a checkpoint is happening. [bgwriter] Oh, he is fsync()ing hard. #0 0x0000003a629bfbb2 in __fsync_nocancel () from /lib64/tls/libc.so.6 #1 0x00000000005742a1 in mdsync () #2 0x00000000005753d7 in smgrsync () #3 0x0000000000564d65 in FlushBufferPool () ... [some backends] They are forced to wait for some lock pertaining to WAL when they try to insert a log record. #0 0x0000003a629c7b79 in semop () from /lib64/tls/libc.so.6 #1 0x000000000054c9c6 in PGSemaphoreLock () #2 0x0000000000572374 in LWLockAcquire () #3 0x000000000046ab10 in XLogInsert () [some backends] They are waiting for some lock when they commit. #0 0x0000003a629c7b79 in semop () from /lib64/tls/libc.so.6 #1 0x000000000054c9c6 in PGSemaphoreLock () #2 0x0000000000572374 in LWLockAcquire () #3 0x000000000046a746 in XLogFlush () #4 0x000000000046228e in RecordTransactionCommit () [the second criminal backend] He is making other backends wait for WALWRITE lock, because he is doing a long-time operation - fdatasync() - with the WALWRITE lock being held. But I think he is also a victim. He is probably forced to wait by bgwriter's fsync() activity. #0 0x0000003a629bfc59 in fdatasync () from /lib64/tls/libc.so.6 #1 0x0000000000469ac0 in issue_xlog_fsync () #2 0x000000000046a210 in XLogWrite () #3 0x000000000046a7df in XLogFlush () #4 0x000000000046228e in RecordTransactionCommit () The following stack trace sometimes appears as the second criminal instead of the above: #0 0x0000003a629bfbb2 in __fsync_nocancel () from /lib64/tls/libc.so.6 #1 0x000000000046530e in UpdateControlFile () #2 0x000000000046a05d in XLogWrite () #3 0x000000000046a7df in XLogFlush () #4 0x000000000046228e in RecordTransactionCommit () [Conclusion] I believe that the problem cannot be solved in a real sense by avoiding fsync/fdatasync(). We can't ignore what commercial databases have done so far. The kernel does as much as he likes when PostgreSQL requests him to fsync(). One question is the disk utilization. While bgwriter is fsync()ing, %util of WAL disk drops to almost 0. But the the bandwidth of Ultra320 SCSI does not appear to be used fully. Any idea? So what should we do next?
On Wed, Dec 20, 2006 at 08:10:56PM +0900, Takayuki Tsunakawa wrote: > One question is the disk utilization. While bgwriter is fsync()ing, > %util of WAL disk drops to almost 0. But the the bandwidth of > Ultra320 SCSI does not appear to be used fully. Any idea? That implies that fsyncing a datafile blocks fsyncing the WAL. That seems terribly unlikely (although...). What OS/Kernel/Filesystem is this. I note a sync bug in linux for ext3 that may have relevence. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
> That implies that fsyncing a datafile blocks fsyncing the WAL. That > seems terribly unlikely (although...). What OS/Kernel/Filesystem is > this. I note a sync bug in linux for ext3 that may have relevence. Oh, really? What bug? I've heard that ext3 reports wrong data to iostat when it performs writes (the data is correct when performing reads.) My env is: OS: RHEL 4.0 for AMD64/EM64T kernel: 2.6.9-42.ELsmp The file system is ext3. Terribly unlikely? But I've seen the disk utilization quite often.
On Wed, Dec 20, 2006 at 09:14:50PM +0900, Takayuki Tsunakawa wrote: > > That implies that fsyncing a datafile blocks fsyncing the WAL. That > > seems terribly unlikely (although...). What OS/Kernel/Filesystem is > > this. I note a sync bug in linux for ext3 that may have relevence. > > Oh, really? What bug? I've heard that ext3 reports wrong data to > iostat when it performs writes (the data is correct when performing > reads.) I was referring to this in the 2.6.6 changelog: http://www.linuxhq.com/kernel/changelog/v2.6/6/index.html ext3's fsync/fdatasync implementation is currently syncing the inode via a full journal commit even if it was unaltered. However you're running a later version so that's not it. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On 12/20/06, Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com> wrote:
[Conclusion]
I believe that the problem cannot be solved in a real sense by
avoiding fsync/fdatasync(). We can't ignore what commercial databases
have done so far. The kernel does as much as he likes when PostgreSQL
requests him to fsync().
I am new to the community and am very interested in the tests that you have done. I am also working on resolving the sudden IO spikes at checkpoint time. I agree with you that fsync() is the core issue here.
Being a new member I was wondering if someone on this list has done testing with O_DIRECT and/or O_SYNC for datafiles as that seems to be the most logical way of dealing with fsync() flood at checkpoint time. If so, I'll be very interested in the results. As mentioned in this thread that a single bgwriter with O_DIRECT will not be able to keep pace with cleaning effort causing backend writes. I think (i.e. IMHO) multiple bgwriters and/or AsyncIO with O_DIRECT can resolve this issue.
Talking of bgwriter_* parameters I think we are missing a crucial internal counter i.e. number of dirty pages. How much work bgwriter has to do at each wakeup call should be a function of total buffers and currently dirty buffers. Relying on both these values instead of just one static NBuffers should allow bgwriter to adapt more quickly to workload changes and ensure that not much work is accumulated for checkpoint.
--
Inaam Rana
EnterpriseDB http://www.enterprisedb.com
On 12/20/06, Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com> wrote:
> > [Conclusion]
> > I believe that the problem cannot be solved in a real sense by
> > avoiding fsync/fdatasync(). We can't ignore what commercial databases
> > have done so far. The kernel does as much as he likes when PostgreSQL
> > requests him to fsync().
> > [Conclusion]
> > I believe that the problem cannot be solved in a real sense by
> > avoiding fsync/fdatasync(). We can't ignore what commercial databases
> > have done so far. The kernel does as much as he likes when PostgreSQL
> > requests him to fsync().
From: Inaam Rana
> I am new to the community and am very interested in the tests that you have done. I am also working on resolving the sudden IO spikes at checkpoint time. I agree with you that fsync() is the core issue here.
Thank you for understanding my bad English correctly. Yes, what I've been insisting is that it is necessary to avoid fsync()/fdatasync() and to use O_SYNC (plus O_DIRECT if supported on the target platform) to really eliminate the big spikes.
In my mail, the following sentence made a small mistake.
In my mail, the following sentence made a small mistake.
"I believe that the problem cannot be solved in a real sense by avoiding fsync/fdatasync()."
The correct sentence is:
"I believe that the problem cannot be solved in a real sense without avoiding fsync/fdatasync()."
> Being a new member I was wondering if someone on this list has done testing with O_DIRECT and/or O_SYNC for datafiles as that seems to be the most logical way of dealing with fsync() flood at checkpoint time. If so, I'll be very interested in the results.
Could you see the mail I sent on Dec 18? Its content was so long that I zipped the whole content and attached to the mail. I just performed the same test simply adding O_SYNC to open() in mdopen() and another function in md.c. I couldn't succeed in running with O_DIRECT because O_DIRECT requires the shared buffers to be aligned on the sector-size boundary. To perform O_DIRECT test, a little more modification is necessary to the code where the shared buffers are allocated.
The result was bad. But that's just a starting point. We need some improvements that commercial databases have done. I think some approaches we should take are:
The result was bad. But that's just a starting point. We need some improvements that commercial databases have done. I think some approaches we should take are:
(1) two-checkpoint (described in Jim Gray's textbook "Transaction Processing: Concepts and Techniques"
(2) what Oracle suggests in its manual (see my previous mails)
(3) write multiple contiguous buffers with one write() to decrease the count of write() calls
(2) what Oracle suggests in its manual (see my previous mails)
(3) write multiple contiguous buffers with one write() to decrease the count of write() calls
> As mentioned in this thread that a single bgwriter with O_DIRECT will not be able to keep pace with cleaning effort causing backend writes. I think (i.e. IMHO) multiple bgwriters and/or AsyncIO with O_DIRECT can resolve this issue.
I agree with you. Oracle provides a parameter called DB_WRITER_PROCESSES to set the number of database writer processes. Oracle also provides asynchronous I/O to solve the problem you are saying about. Please see section 10.3.9 the following page:
> Talking of bgwriter_* parameters I think we are missing a crucial internal counter i.e. number of dirty pages. How much work bgwriter has to do at each wakeup call should be a function of total buffers and currently dirty buffers. Relying on both these values instead of just one static NBuffers should allow bgwriter to adapt more quickly to workload changes and ensure that not much work is accumulated for checkpoint.
I agree with you in the sense that the current bgwriter is a bit careless about the system load. I believe that PostgreSQL should be more gentle to OLTP transactions -- many users of the system as a result. I think the speed of WAL accumulation should also be taken into account. Let's list up the problems and ideas.
--
"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote: > I have to report a sad result. Your patch didn't work. Let's > consider the solution together. What you are addressing is very > important for the system designers in the real world -- smoothing > response time. You were running the test on the very memory-depend machine. > shared_buffers = 4GB / The scaling factor is 50, 800MB of data. Thet would be why the patch did not work. I tested it with DBT-2, 10GB of data and 2GB of memory. Storage is always the main part of performace here, even not in checkpoints. If you use Linux, it has very unpleased behavior in fsync(); It locks all metadata of the file being fsync-ed. We have to wait for the completion of fsync when we do read(), write(), and even lseek(). Almost of your data is in the accounts table and it was stored in a single file. All of transactions must wait for fsync to the single largest file, so you saw the bottleneck was in the fsync. > [Conclusion] > I believe that the problem cannot be solved in a real sense by > avoiding fsync/fdatasync(). I think so, too. However, I assume we can resolve a part of the checkpoint spikes with smoothing of write() alone. BTW, can we use the same way to fsync? We call fsync()s to all modified files without rest in mdsync(), but it's not difficult at all to insert sleeps between fsync()s. Do you think it helps us? One of issues is that we have to sleep in file unit, which is maybe rough granularity. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp> > You were running the test on the very memory-depend machine. >> shared_buffers = 4GB / The scaling factor is 50, 800MB of data. > Thet would be why the patch did not work. I tested it with DBT-2, 10GB of > data and 2GB of memory. Storage is always the main part of performace here, > even not in checkpoints. Yes, I used half the size of RAM as the shared buffers, which is reasonable. And I cached all the data. The effect of fsync() is a heavier offence, isn't it? System administrators would say "I have enough memory. The data hasn't exhausted the DB cache yet. But the users complain to me about the response. Why? What should I do? What? Checkpoint?? Why doesn't PostgreSQL take care of frontend users?" BTW, is DBT-2 an OLTP benchmark which randomly access some parts of data, or a batch application which accesses all data? I'm not familiar with it. I know that IPA opens it to the public. > If you use Linux, it has very unpleased behavior in fsync(); It locks all > metadata of the file being fsync-ed. We have to wait for the completion of > fsync when we do read(), write(), and even lseek(). > Almost of your data is in the accounts table and it was stored in a single > file. All of transactions must wait for fsync to the single largest file, > so you saw the bottleneck was in the fsync. Oh, really, what an evil fsync is! Yes, I sometimes saw a backend waiting for lseek() to complete when it committed. But why does the backend which is syncing WAL/pg_control have to wait for syncing the data file? They are, not to mention, different files, and WAL and data files are stored on separate disks. >> [Conclusion] >> I believe that the problem cannot be solved in a real sense by >> avoiding fsync/fdatasync(). > > I think so, too. However, I assume we can resolve a part of the > checkpoint spikes with smoothing of write() alone. First, what's the goal (if possible numerically? Have you explained to community members why the patch would help many people? At least, I haven't heard that fsync() can be seriously bad and we would close our eyes to what fsync() does. By the way, what good results did you get with DBT-2? If you don't mind, can you show us? > BTW, can we use the same way to fsync? We call fsync()s to all modified > files without rest in mdsync(), but it's not difficult at all to insert > sleeps between fsync()s. Do you think it helps us? One of issues is that > we have to sleep in file unit, which is maybe rough granularity. No, it definitely won't help us. There is no reason why it will help. It might help in some limited environments, though, how can we characterize such environments? Can we say "our approach helps our environments, but it won't help you. The kernel VM settings may help you. Good luck!"? We have to consider seriously. I think it's time to face the problem and we should follow the approaches of experts like Jim Gray and DBMS vendors, unless we have a new clever idea like them.
"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote: > > If you use Linux, it has very unpleased behavior in fsync(); It locks all > > metadata of the file being fsync-ed. We have to wait for the completion of > > fsync when we do read(), write(), and even lseek(). > > Oh, really, what an evil fsync is! Yes, I sometimes saw a backend > waiting for lseek() to complete when it committed. But why does the > backend which is syncing WAL/pg_control have to wait for syncing the > data file? They are, not to mention, different files, and WAL and > data files are stored on separate disks. Backends call lseek() in planning, so they have to wait fsync() to the table that they will access. Even if all of data in the file is in the cache, lseek() conflict with fsync(). You can see a lot of backends are waiting in planning phase in checkpoints, not executing phase. > > it's not difficult at all to insert sleeps between fsync()s. > Can we say "our approach helps our > environments, but it won't help you. The kernel VM settings may help > you. Good luck!"? I didn't say such a thing at all. There are several opinions in the discussion: 1. High bgwriter setting is enough! 2. Change your OS :-) 3. Use O_SYNC orO_DIRECT, but very poor performance. 4. We may settle for single fsync(), but not many fsync()s in a short time. I just suggested 4. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> Cc: <pgsql-hackers@postgresql.org> Sent: Thursday, December 21, 2006 6:46 PM Subject: Re: [HACKERS] Load distributed checkpoint > From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp> > "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote: >> Oh, really, what an evil fsync is! Yes, I sometimes saw a backend >> waiting for lseek() to complete when it committed. But why does the >> backend which is syncing WAL/pg_control have to wait for syncing the >> data file? They are, not to mention, different files, and WAL and >> data files are stored on separate disks. > > Backends call lseek() in planning, so they have to wait fsync() to > the table that they will access. Even if all of data in the file is in > the cache, lseek() conflict with fsync(). You can see a lot of backends > are waiting in planning phase in checkpoints, not executing phase. I see. I found one backend like the following. But one in my case one out of 16 backends. Most of others are waiting to acquire WALWRITE lock. #0 0x0000003a629c6902 in __lseek_nocancel () from /lib64/tls/libc.so.6 #1 0x000000000056789f in FileSeek () #2 0x0000000000574053 in mdnblocks () #3 0x0000000000574f4a in smgrnblocks () #4 0x00000000005489e8 in estimate_rel_size () #5 0x0000000000548bee in get_relation_info () #6 0x000000000054aa3d in build_simple_rel () #7 0x0000000000539c6b in add_base_rels_to_query () #8 0x000000000053b955 in query_planner () #9 0x000000000053c1c9 in grouping_planner () #10 0x000000000053d3b4 in subquery_planner () #11 0x000000000053d5b3 in planner () #12 0x00000000005778fc in pg_plan_query () #13 0x000000000057798c in pg_plan_queries () #14 0x0000000000577c53 in exec_simple_query ()
On Thu, Dec 21, 2006 at 06:46:36PM +0900, ITAGAKI Takahiro wrote: > > Oh, really, what an evil fsync is! Yes, I sometimes saw a backend > > waiting for lseek() to complete when it committed. But why does the > > backend which is syncing WAL/pg_control have to wait for syncing the > > data file? They are, not to mention, different files, and WAL and > > data files are stored on separate disks. > > Backends call lseek() in planning, so they have to wait fsync() to > the table that they will access. Even if all of data in the file is in > the cache, lseek() conflict with fsync(). You can see a lot of backends > are waiting in planning phase in checkpoints, not executing phase. Hmm, there are other ways to sync parts of a file. For example doing an mmap()/msync()/munmap() cycle to start an asyncronous flush. But given what you're saying that might suffer from the same problem. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
> > You were running the test on the very memory-depend machine. > >> shared_buffers = 4GB / The scaling factor is 50, 800MB of data. > > Thet would be why the patch did not work. I tested it with DBT-2, 10GB of > > data and 2GB of memory. Storage is always the main part of performace here, > > even not in checkpoints. > > Yes, I used half the size of RAM as the shared buffers, which is > reasonable. And I cached all the data. For pg, half RAM for shared_buffers is too much. The ratio is good for other db software, that does not use the OS cache. Andreas
----- Original Message ----- From: "Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at> To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>; "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp> > > Yes, I used half the size of RAM as the shared buffers, which is > > reasonable. And I cached all the data. > For pg, half RAM for shared_buffers is too much. The ratio is good for > other db software, that does not use the OS cache. What percentage of RAM is recommended for shared buffers in general? 40%? 30%? Or, is the general recommendation like "According to the amount of your data, this much RAM should be left for the kernel cache. But tha's the story on Linux. It may be different for other OSes."? Hmm, if it is so, it sounds hard for system designers/administrators to judge. ----- Original Message ----- From: "Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at> To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>; "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp> Cc: <pgsql-hackers@postgresql.org> Sent: Thursday, December 21, 2006 11:04 PM Subject: RE: [HACKERS] Load distributed checkpoint > > You were running the test on the very memory-depend machine. > >> shared_buffers = 4GB / The scaling factor is 50, 800MB of data. > > Thet would be why the patch did not work. I tested it with DBT-2, 10GB of > > data and 2GB of memory. Storage is always the main part of performace here, > > even not in checkpoints. > > Yes, I used half the size of RAM as the shared buffers, which is > reasonable. And I cached all the data. For pg, half RAM for shared_buffers is too much. The ratio is good for other db software, that does not use the OS cache. Andreas
"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote: > > For pg, half RAM for shared_buffers is too much. The ratio is good for > > other db software, that does not use the OS cache. > > What percentage of RAM is recommended for shared buffers in general? > 40%? 30%? Or, is the general recommendation like "According to the > amount of your data, this much RAM should be left for the kernel > cache. But tha's the story on Linux. It may be different for other > OSes."? > Hmm, if it is so, it sounds hard for system designers/administrators > to judge. If you use linux, try the following settings: 1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio. 2. Increasewal_buffers to redule WAL flushing. 3. Set wal_sync_method to open_sync; O_SYNC is faster then fsync(). 4. Separatedata and WAL files into different partitions or disks. I suppose 1 is important for you, because kernel will not write dirty buffers until 10% of buffers become dirty in default settings. You have large memory (8GB), but small data set (800MB). So kernel almost never writes buffers not in checkpoints. Accumulate dirty buffers are written at a burst in fsync(). We would be happy if we would be free from a difficult combination of tuning. If you have *idea for improvements*, please suggest it. I think we've already understood *problem itself*. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Hello, Itagaki-san, Thank you for an interesting piece of information. From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp> > If you use linux, try the following settings: > 1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio. > 2. Increase wal_buffers to redule WAL flushing. > 3. Set wal_sync_method to open_sync; O_SYNC is faster then fsync(). > 4. Separate data and WAL files into different partitions or disks. > > I suppose 1 is important for you, because kernel will not write dirty > buffers until 10% of buffers become dirty in default settings. > You have large memory (8GB), but small data set (800MB). So kernel > almost never writes buffers not in checkpoints. Accumulate dirty buffers > are written at a burst in fsync(). I'll show the results of this tuning to share information with people who don't have experience of this kind. The numbers shown below are the tps when running "pgbench -c16 -t100 postgres" five times in succession. (1) Default case(this is show again for comparison and reminder) The bgwriter_* and checkpoint_* are set to those defaults. wal_buffers and wal_sync_method are also set to those defaults (64kB and fdatasync respectively.) 235 80 226 77 240 (2) Default + WAL 1MB case The configuration is the same as case (1) except that wal_buffers is set to 1024kB. 302 328 82 330 85 This is better improvement than I expected. (3) Default + wal_sync_method=open_sync case The configuration is the same as case (1) except that wal_sync_method is set to open_sync. 162 67 176 67 164 Too bad compared to case (2). Do you know the reason? (4) (2)+(3) case 322 350 85 321 84 This is good, too. (5) (4) + /proc/sys/vm/dirty* tuning dirty_background_ratio is changed from 10 to 1, and dirty_ratio is changed from 40 to 4. 308 349 84 349 84 The tuning of kernel cache doesn't appear to bring performance improvement in my env. The kernel still waits too long before it starts flushing dirty buffers because the cache is large? If so, increasingly available RAM may cause trouble more frequently in the near future. Do the dirty_*_ratio accept values less than 1? BTW, in case (1), the best response time of a transaction was 6 milliseconds. On the other hand, the worst response time was 13 seconds. > We would be happy if we would be free from a difficult combination > of tuning. If you have *idea for improvements*, please suggest it. > I think we've already understood *problem itself*. I agree with you. Let's make the ideas more concrete, doing some experimentations.
On Wed, 20 Dec 2006, Inaam Rana wrote: > Talking of bgwriter_* parameters I think we are missing a crucial > internal counter i.e. number of dirty pages. How much work bgwriter has > to do at each wakeup call should be a function of total buffers and > currently dirty buffers. This is actually a question I'd been meaning to throw out myself to this list. How hard would it be to add an internal counter to the buffer management scheme that kept track of the current number of dirty pages? I've been looking at the bufmgr code lately trying to figure out how to insert one as part of building an auto-tuning bgwriter, but it's unclear to me how I'd lock such a resource properly and scalably. I have a feeling I'd be inserting a single-process locking bottleneck into that code with any of the naive implementations I considered. The main problem I've been seeing is also long waits stuck behind a slow fsync on Linux. What I've been moving toward testing is an approach slightly different from the proposals here. What if all the database page writes (background writer, buffer eviction, or checkpoint scan) were counted and periodic fsync requests send to the bgwriter based on that? For example, when I know I have a battery-backed caching controller that will buffer 64MB worth of data for me, if I forced a fsync after every 6000 8K writes, no single fsync would get stuck waiting for the disk to write for longer than I'd like. Give the admin a max_writes_before_sync parameter, make the default of 0 work just like the current behavior, and off you go; a simple tunable that doesn't require a complicated scheme to implement or break anybody's existing setup. Combined with a properly tuned background writer, that would solve the issues I've been running into. It would even make the problem of Linux caching too many writes until checkpoint time go away (I know how to eliminate that by adjusting caching policy, but I have to be root to do it; a DBA should be able to work around that issue even if they don't have access to the kernel tunables.) While I'm all for testing to prove me wrong, my gut feel is that going all the way to sync writes a la Oracle is a doomed approach, particularly on low-end hardware where they're super expensive. Following The Oracle Way is a good roadmap for a lot of things, but I wouldn't put building a lean enough database to run on modest hardware on that list. You can do sync writes with perfectly good performance on systems with a good battery-backed cache, but I think you'll get creamed in comparisons against MySQL on IDE disks if you start walking down that path; since right now a fair comparison with similar logging behavior is an even match there, that's a step backwards. Also on the topic of sync writes to the database proper: wouldn't using O_DIRECT for those potentially counter-productive? I was under the impressions that one of the behaviors counted on by Postgres was that data evicted from its buffer cache, eventually intended for writing to disk, was still kept around for a bit in the OS buffer cache. A subsequent read because the data was needed again might find the data already in the OS buffer, therefore avoiding an actual disk read; that substantially reduces the typical penalty for the database engine making a bad choice on what to evict. I fear a move to direct writes would put more pressure on the LRU implementation to be very smart, and that's code that you really don't want to be more complicated. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
From: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> > (5) (4) + /proc/sys/vm/dirty* tuning > dirty_background_ratio is changed from 10 to 1, and dirty_ratio is > changed from 40 to 4. > > 308 349 84 349 84 Sorry, I forgot to include the result when using Itagaki-san's patch. The patch showd the following tps for case (5). 323 350 340 59 225 The best response time was 4 msec, and the worst one was 16 seconds. ----- Original Message ----- From: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> To: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp> Cc: <pgsql-hackers@postgresql.org> Sent: Friday, December 22, 2006 3:20 PM Subject: Re: [HACKERS] Load distributed checkpoint > Hello, Itagaki-san, > > Thank you for an interesting piece of information. > > From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp> >> If you use linux, try the following settings: >> 1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio. >> 2. Increase wal_buffers to redule WAL flushing. >> 3. Set wal_sync_method to open_sync; O_SYNC is faster then fsync(). >> 4. Separate data and WAL files into different partitions or disks. >> >> I suppose 1 is important for you, because kernel will not write > dirty >> buffers until 10% of buffers become dirty in default settings. >> You have large memory (8GB), but small data set (800MB). So kernel >> almost never writes buffers not in checkpoints. Accumulate dirty > buffers >> are written at a burst in fsync(). > > I'll show the results of this tuning to share information with people > who don't have experience of this kind. > The numbers shown below are the tps when running "pgbench -c16 -t100 > postgres" five times in succession. > > (1) Default case(this is show again for comparison and reminder) > The bgwriter_* and checkpoint_* are set to those defaults. > wal_buffers and wal_sync_method are also set to those defaults (64kB > and fdatasync respectively.) > > 235 80 226 77 240 > > > (2) Default + WAL 1MB case > The configuration is the same as case (1) except that wal_buffers is > set to 1024kB. > > 302 328 82 330 85 > > This is better improvement than I expected. > > > (3) Default + wal_sync_method=open_sync case > The configuration is the same as case (1) except that wal_sync_method > is set to open_sync. > > 162 67 176 67 164 > > Too bad compared to case (2). Do you know the reason? > > > (4) (2)+(3) case > > 322 350 85 321 84 > > This is good, too. > > > (5) (4) + /proc/sys/vm/dirty* tuning > dirty_background_ratio is changed from 10 to 1, and dirty_ratio is > changed from 40 to 4. > > 308 349 84 349 84 > > The tuning of kernel cache doesn't appear to bring performance > improvement in my env. The kernel still waits too long before it > starts flushing dirty buffers because the cache is large? If so, > increasingly available RAM may cause trouble more frequently in the > near future. Do the dirty_*_ratio accept values less than 1? > > BTW, in case (1), the best response time of a transaction was 6 > milliseconds. On the other hand, the worst response time was 13 > seconds. > > >> We would be happy if we would be free from a difficult combination >> of tuning. If you have *idea for improvements*, please suggest it. >> I think we've already understood *problem itself*. > > I agree with you. Let's make the ideas more concrete, doing some > experimentations. > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate >
On 12/22/06, Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com> wrote:
Which IO Shceduler (elevator) you are using?From: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>
> (5) (4) + /proc/sys/vm/dirty* tuning
> dirty_background_ratio is changed from 10 to 1, and dirty_ratio is
> changed from 40 to 4.
>
> 308 349 84 349 84
Sorry, I forgot to include the result when using Itagaki-san's patch.
The patch showd the following tps for case (5).
323 350 340 59 225
The best response time was 4 msec, and the worst one was 16 seconds.
--
Inaam Rana
EnterpriseDB http://www.enterprisedb.com
From: Inaam Rana
> Which IO Shceduler (elevator) you are using?
> Which IO Shceduler (elevator) you are using?
Elevator? Sorry, I'm not familiar with the kernel implementation, so I don't what it is. My Linux distribution is Red Hat Enterprise Linux 4.0 for AMD64/EM64T, and the kernel is 2.6.9-42.ELsmp. I probably havn't changed any kernel settings, except for IPC settings to run PostgreSQL.
"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote: > (1) Default case(this is show again for comparison and reminder) > 235 80 226 77 240 > (2) Default + WAL 1MB case > 302 328 82 330 85 > (3) Default + wal_sync_method=open_sync case > 162 67 176 67 164 > (4) (2)+(3) case > 322 350 85 321 84 > (5) (4) + /proc/sys/vm/dirty* tuning > 308 349 84 349 84 (3) is very strange. Your machine seems to be too restricted by WAL so that other factors cannot be measured properly. I'll send results on my machine. - Pentium4 3.6GHz with HT / 3GB RAM / Windows XP :-) - shared_buffers=1GB - wal_sync_method = open_datasync - wal_buffers = 1MB - checkpoint_segments = 16 - checkpoint_timeout = 5min I repeated "pgbench -c16 -t500 -s50" and picked up results around checkpoints. [HEAD] ... 560.8 373.5 <- checkpoint is here 570.8 ... [with patch] ... 562.0 528.4 <- checkpoint (fsync) is here 547.0 ... Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
From: "Greg Smith" <gsmith@gregsmith.com> > This is actually a question I'd been meaning to throw out myself to this > list. How hard would it be to add an internal counter to the buffer > management scheme that kept track of the current number of dirty pages? > I've been looking at the bufmgr code lately trying to figure out how to > insert one as part of building an auto-tuning bgwriter, but it's unclear > to me how I'd lock such a resource properly and scalably. I have a > feeling I'd be inserting a single-process locking bottleneck into that > code with any of the naive implementations I considered. To put it in an extreme way, how about making bgwriter count the dirty buffers periodically scanning all the buffers? Do you know the book "Principles of Transaction Processing"? Jim Gray was one of the reviewers of this book. http://www.amazon.com/gp/aa.html?HMAC=&CartId=&Operation=ItemLookup&&ItemId=1558604154&ResponseGroup=Request,Large,Variations&bStyle=aaz.jpg&MerchantId=All&isdetail=true&bsi=Books&logo=foo&Marketplace=us&AssociateTag=pocketpc In chapter 8, the author describes fuzzy checkpoint combined with two-checkpoint approach. In his explanation, recovery manager (which would be bgwriter in PostgreSQL) scans the buffers and records the list of dirty buffers at each checkpoint. This won't need any locking in PostgreSQL if I understand correctly. Then, the recovery manager performs the next checkpoint after writing those dirty buffers. In two-checkpoint approach, crash recovery starts redoing from the second to last checkpoint. Two-checkpoint is described in Jim Gray's book, too. But they don't refer to how the recovery manager tunes the speed of writing. > slightly different from the proposals here. What if all the database page > writes (background writer, buffer eviction, or checkpoint scan) were > counted and periodic fsync requests send to the bgwriter based on that? > For example, when I know I have a battery-backed caching controller that > will buffer 64MB worth of data for me, if I forced a fsync after every > 6000 8K writes, no single fsync would get stuck waiting for the disk to > write for longer than I'd like. That seems interesting. > You can do sync > writes with perfectly good performance on systems with a good > battery-backed cache, but I think you'll get creamed in comparisons > against MySQL on IDE disks if you start walking down that path; since > right now a fair comparison with similar logging behavior is an even match > there, that's a step backwards. I wonder what characteristics SATA disks have compared to IDE. Recent PCs are equiped with SATA disks, aren't they? What do you feel your approach compares to MySQL on IDE disks? > Also on the topic of sync writes to the database proper: wouldn't using > O_DIRECT for those potentially counter-productive? I was under the > impressions that one of the behaviors counted on by Postgres was that data > evicted from its buffer cache, eventually intended for writing to disk, > was still kept around for a bit in the OS buffer cache. A subsequent read > because the data was needed again might find the data already in the OS > buffer, therefore avoiding an actual disk read; that substantially reduces > the typical penalty for the database engine making a bad choice on what to > evict. I fear a move to direct writes would put more pressure on the LRU > implementation to be very smart, and that's code that you really don't > want to be more complicated. I'm worried about this, too.
> (3) is very strange. Your machine seems to be too restricted > by WAL so that other factors cannot be measured properly. Right... It takes as long as 15 seconds to fsync 1GB file. It's strange. This is a borrowed PC server, so the disk may be RAID 5? However, the WAL disk and DB disks show the same throughput. I'll investigate. I may have to find another machine. - Pentium4 3.6GHz with HT / 3GB RAM / Windows XP :-) Oh, Windows. Maybe the fsync() problem Itagaki-san pointed out does not exist. BTW, your env is showing attractive result, isn't it? ----- Original Message ----- From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp> To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> Cc: <pgsql-hackers@postgresql.org> Sent: Friday, December 22, 2006 6:09 PM Subject: Re: [HACKERS] Load distributed checkpoint "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote: > (1) Default case(this is show again for comparison and reminder) > 235 80 226 77 240 > (2) Default + WAL 1MB case > 302 328 82 330 85 > (3) Default + wal_sync_method=open_sync case > 162 67 176 67 164 > (4) (2)+(3) case > 322 350 85 321 84 > (5) (4) + /proc/sys/vm/dirty* tuning > 308 349 84 349 84 (3) is very strange. Your machine seems to be too restricted by WAL so that other factors cannot be measured properly. I'll send results on my machine. - Pentium4 3.6GHz with HT / 3GB RAM / Windows XP :-) - shared_buffers=1GB - wal_sync_method = open_datasync - wal_buffers = 1MB - checkpoint_segments = 16 - checkpoint_timeout = 5min I repeated "pgbench -c16 -t500 -s50" and picked up results around checkpoints. [HEAD] ... 560.8 373.5 <- checkpoint is here 570.8 ... [with patch] ... 562.0 528.4 <- checkpoint (fsync) is here 547.0 ... Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
> > If you use linux, try the following settings: > > 1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio. You will need to pair this with bgwriter_* settings, else too few pages are written to the os inbetween checkpoints. > > 2. Increase wal_buffers to redule WAL flushing. You will want the possibility of single group writes to be able to reach 256kb. The default is thus not enough when you have enough RAM. You also want enough, so that new txns don't need to wait for an empty buffer (that is only freed by a write). > > 3. Set wal_sync_method to open_sync; O_SYNC is faster then fsync(). O_SYNC's only advantage over fdatasync is that it saves a system call, since it still passes through OS cache, but the disadvantage is that it does not let the OS group writes. Thus it is more susceptible to too few WAL_buffers. What you want is O_DIRECT + enough wal_buffers to allow 256k writes. > > 4. Separate data and WAL files into different partitions or disks. While this is generally suggested, I somehow doubt the validity when you only have few disk spindles. If e.g. you only have 2-3 (mirrored) disks I wouldn't do it (at least on the average 70/30 read write systems). Andreas
On Thu, 2006-12-21 at 18:46 +0900, ITAGAKI Takahiro wrote: > "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote: > > > > If you use Linux, it has very unpleased behavior in fsync(); It locks all > > > metadata of the file being fsync-ed. We have to wait for the completion of > > > fsync when we do read(), write(), and even lseek(). > > > > Oh, really, what an evil fsync is! Yes, I sometimes saw a backend > > waiting for lseek() to complete when it committed. But why does the > > backend which is syncing WAL/pg_control have to wait for syncing the > > data file? They are, not to mention, different files, and WAL and > > data files are stored on separate disks. > > Backends call lseek() in planning, so they have to wait fsync() to > the table that they will access. Even if all of data in the file is in > the cache, lseek() conflict with fsync(). You can see a lot of backends > are waiting in planning phase in checkpoints, not executing phase. It isn't clear to me why you are doing planning during a test at all. If you are doing replanning during test execution then the real performance problem will be the planning, not the fact that the fsync stops planning from happening. Prepared queries are only replanned manually, so the chances of replanning during a checkpoint are fairly low. So although it sounds worrying, I'm not sure that we'll want to alter the use of lseek during planning - though there may be other arguments also. I have also seen cases where the WAL drive, even when separated, appears to spike upwards during a checkpoint. My best current theory, so far untested, is that the WAL and data drives are using the same CFQ scheduler and that the scheduler actively slows down WAL requests when it need not. Mounting the drives as separate block drives with separate schedulers, CFQ for data and Deadline for WAL should help. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
I have a new idea. Rather than increasing write activity as we approach checkpoint, I think there is an easier solution. I am very familiar with the BSD kernel, and it seems they have a similar issue in trying to smooth writes:http://www.brno.cas.cz/cgi-bin/bsdi-man?proto=1.1&query=update&msection=4&apropos=0UPDATE(4) BSD Programmer's Manual UPDATE(4)NAME update - trickle sync filesystem caches to diskDESCRIPTION At system boot time, the kernel starts filesys_syncer, process 3. This process helps protect theintegrity of disk volumes by ensuring that volatile cached filesystem data are written to disk within the vfs.generic.syncdelayinterval which defaults to thirty seconds (see sysctl(8)). When a vnode is first written itis placed vfs.generic.syncdelay seconds down on the trickle sync queue. If it still exists and has dirty data when it reaches the top of the queue, filesys_syncer writes it to disk. This approach evens out the load on the underlying I/O system and avoids writing short-lived files. The pa- pers on trickle-sync tend to favor aging basedon buffers rather than files. However, BSD/OS synchronizes on file age rather than buffer age because the datastructures are much smaller as there are typically far fewer files than buffers. Although this can make theI/O bursty when a big file is written to disk, it is still much better than the wholesale writes that were beingdone by the historic update process which wrote all dirty data buffers every 30 seconds. It also adapts much better to the soft update code which wants to control aging to improve performance (inodes age in one third of vfs.generic.syncdelay seconds, directories in one half of vfs.generic.syncdelay seconds). This ordering ensures that most dependencies are gone (e.g., inodes are written when directory en- tries want to go to disk) reducing theamount of work that the soft up- date code needs to do. I assume other kernels have similar I/O smoothing, so that data sent to the kernel via write() gets to disk within 30 seconds. I assume write() is not our checkpoint performance problem, but the transfer to disk via fsync(). Perhaps a simple solution is to do the write()'s of all dirty buffers as we do now at checkpoint time, but delay 30 seconds and then do fsync() on all the files. The goal here is that during the 30-second delay, the kernel will be forcing data to the disk, so the fsync() we eventually do will only be for the write() of buffers during the 30-second delay, and because we wrote all dirty buffers 30 seconds ago, there shouldn't be too many of them. I think the basic difference between this and the proposed patch is that we do not put delays in the buffer write() or fsync() phases --- we just put a delay _between_ the phases, and wait for the kernel to smooth it out for us. The kernel certainly knows more about what needs to get to disk, so it seems logical to let it do the I/O smoothing. --------------------------------------------------------------------------- Bruce Momjian wrote: > > I have thought a while about this and I have some ideas. > > Ideally, we would be able to trickle the sync of individuals blocks > during the checkpoint, but we can't because we rely on the kernel to > sync all dirty blocks that haven't made it to disk using fsync(). We > could trickle the fsync() calls, but that just extends the amount of > data we are writing that has been dirtied post-checkpoint. In an ideal > world, we would be able to fsync() only part of a file at a time, and > only those blocks that were dirtied pre-checkpoint, but I don't see that > happening anytime soon (and one reason why many commercial databases > bypass the kernel cache). > > So, in the real world, one conclusion seems to be that our existing > method of tuning the background writer just isn't good enough for the > average user: > > #bgwriter_delay = 200ms # 10-10000ms between rounds > #bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round > #bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round > #bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round > #bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round > > These settings control what the bgwriter does, but they do not clearly > relate to the checkpoint timing, which is the purpose of the bgwriter, > and they don't change during the checkpoint interval, which is also less > than ideal. If set to aggressively, it writes too much, and if too low, > the checkpoint does too much I/O. > > We clearly need more bgwriter activity as the checkpoint approaches, and > one that is more auto-tuned, like many of our other parameters. I think > we created these settings to see how they worked in the field, so it > probably time to reevaluate them based on field reports. > > I think the bgwriter should keep track of how far it is to the next > checkpoint, and use that information to increase write activity. > Basically now, during a checkpoint, the bgwriter does a full buffer scan > and fsync's all dirty files, so it changes from the configuration > parameter-defined behavior right to 100% activity. I think it would be > ideal if we could ramp up the writes so that when it is 95% to the next > checkpoint, it can be operating at 95% of the activity it would do > during a checkpoint. > > My guess is if we can do that, we will have much smoother performance > because we have more WAL writes just after checkpoint for newly-dirtied > pages, and the new setup will give us more write activity just before > checkpoint. > > One other idea is for the bgwriter to use O_DIRECT or O_SYNC to avoid > the kernel cache, so we are sure data will be on disk by checkpoint > time. This was avoided in the past because of the expense of > second-guessing the kernel disk I/O scheduling algorithms. > > --------------------------------------------------------------------------- > > Tom Lane wrote: > > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > > > "Jim C. Nasby" <jim@nasby.net> wrote: > > >> Generally, I try and configure the all* settings so that you'll get 1 > > >> clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't > > >> have any actual tests to back that methodology up. > > > > > We got to these numbers somewhat scientifically. I studied I/O > > > patterns under production load and figured we should be able to handle > > > about 800 writes in per 200 ms without causing problems. I have to > > > admit that I based the percentages and the ratio between "all" and "lru" > > > on gut feel after musing over the documentation. > > > > I like Kevin's settings better than what Jim suggests. If the bgwriter > > only makes one sweep between checkpoints then it's hardly going to make > > any impact at all on the number of dirty buffers the checkpoint will > > have to write. The point of the bgwriter is to reduce the checkpoint > > I/O spike by doing writes between checkpoints, and to have any > > meaningful impact on that, you'll need it to make the cycle several times. > > > > Another point here is that you want checkpoints to be pretty far apart > > to minimize the WAL load from full-page images. So again, a bgwriter > > that's only making one loop per checkpoint is not gonna be doing much. > > > > I wonder whether it would be feasible to teach the bgwriter to get more > > aggressive as the time for the next checkpoint approaches? Writes > > issued early in the interval have a much higher probability of being > > wasted (because the page gets re-dirtied later). But maybe that just > > reduces to what Takahiro-san already suggested, namely that > > checkpoint-time writes should be done with the same kind of scheduling > > the bgwriter uses outside checkpoints. We still have the problem that > > the real I/O storm is triggered by fsync() not write(), and we don't > > have a way to spread out the consequences of fsync(). > > > > regards, tom lane > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 9: In versions below 8.0, the planner will ignore your desire to > > choose an index scan if your joining column's datatypes do not > > match > > -- > Bruce Momjian bruce@momjian.us > EnterpriseDB http://www.enterprisedb.com > > + If your life is a hard drive, Christ can be your backup. + > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 12/22/06, Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com> wrote:
There are four IO schedulers in Linux. Anticipatory, CFQ (default), deadline, and noop. For typical OLTP type loads generally deadline is recommended. If you are constrained on CPU and you have a good controller then its better to use noop.From: Inaam Rana
> Which IO Shceduler (elevator) you are using?Elevator? Sorry, I'm not familiar with the kernel implementation, so I don't what it is. My Linux distribution is Red Hat Enterprise Linux 4.0 for AMD64/EM64T, and the kernel is 2.6.9-42.ELsmp. I probably havn't changed any kernel settings, except for IPC settings to run PostgreSQL.
Deadline attempts to merge requests by maintaining two red black trees in sector sort order and it also ensures that a request is serviced in given time by using FIFO. I don't expect it to do the magic but was wondering that it may dilute the issue of fsync() elbowing out WAL writes.
You can look into /sys/block/<device>/queue/scheduler to see which scheduler you are using.
regards,
inaam
--
Inaam Rana
EnterpriseDB http://www.enterprisedb.com
"Bruce Momjian" <bruce@momjian.us> writes: > I have a new idea. Rather than increasing write activity as we approach > checkpoint, I think there is an easier solution. I am very familiar > with the BSD kernel, and it seems they have a similar issue in trying to > smooth writes: Just to give a bit of context for this. The traditional mechanism for syncing buffers to disk on BSD which this daemon was a replacement for was to simply call "sync" every 30s. Compared to that this daemon certainly smooths the I/O out over the 30s window... Linux has a more complex solution to this (of course) which has undergone a few generations over time. Older kernels had a user space daemon called bdflush which called an undocumented syscall every 5s. More recent ones have a kernel thread called pdflush. I think both have various mostly undocumented tuning knobs but neither makes any sort of guarantee about the amount of time a dirty buffer might live before being synced. Your thinking is correct but that's already the whole point of bgwriter isn't it? To get the buffers out to the kernel early in the checkpoint interval so that come checkpoint time they're hopefully already flushed to disk. As long as your checkpoint interval is well over 30s only the last 30s (or so, it's a bit fuzzier on Linux) should still be at risk of being pending. I think the main problem with an additional pause in the hopes of getting more buffers synced is that during the 30s pause on a busy system there would be a continual stream of new dirty buffers being created as bgwriter works and other backends need to reuse pages. So when the fsync is eventually called there will still be a large amount of i/o to do. Fundamentally the problem is that fsync is too blunt an instrument. We only need to fsync the buffers we care about, not the entire file. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Fri, 22 Dec 2006, Simon Riggs wrote: > I have also seen cases where the WAL drive, even when separated, appears > to spike upwards during a checkpoint. My best current theory, so far > untested, is that the WAL and data drives are using the same CFQ > scheduler and that the scheduler actively slows down WAL requests when > it need not. Mounting the drives as separate block drives with separate > schedulers, CFQ for data and Deadline for WAL should help. The situation I've been seeing is that the database needs a new block to complete a query and issues a read request to get it, but that read is behind the big checkpoint fsync. Client sits there for quite some time waiting for the fsync to finish before it gets the data it needs, and now your trivial select took seconds to complete. It's fairly easy to replicate this problem using pgbench on Linux--I've seen a query sit there for 15 seconds when going out of my way to aggrevate the behavior. One of Takayuki's posts here mentioned a worst-case delay of 13 seconds, that's the problem rearing its ugly head. You may be right that what you're seeing would be solved with a more complicated tuning on a per-device basis (which, by the way, isn't available unless you're running a more recent Linux kernel than most many distributions have available). You can tune the schedulers all day and not make a lick of difference to what I've been running into; I know, I tried. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Gregory Stark wrote: > > "Bruce Momjian" <bruce@momjian.us> writes: > > > I have a new idea. Rather than increasing write activity as we approach > > checkpoint, I think there is an easier solution. I am very familiar > > with the BSD kernel, and it seems they have a similar issue in trying to > > smooth writes: > > Just to give a bit of context for this. The traditional mechanism for syncing > buffers to disk on BSD which this daemon was a replacement for was to simply > call "sync" every 30s. Compared to that this daemon certainly smooths the I/O > out over the 30s window... > > Linux has a more complex solution to this (of course) which has undergone a > few generations over time. Older kernels had a user space daemon called > bdflush which called an undocumented syscall every 5s. More recent ones have a > kernel thread called pdflush. I think both have various mostly undocumented > tuning knobs but neither makes any sort of guarantee about the amount of time > a dirty buffer might live before being synced. > > Your thinking is correct but that's already the whole point of bgwriter isn't > it? To get the buffers out to the kernel early in the checkpoint interval so > that come checkpoint time they're hopefully already flushed to disk. As long > as your checkpoint interval is well over 30s only the last 30s (or so, it's a > bit fuzzier on Linux) should still be at risk of being pending. > > I think the main problem with an additional pause in the hopes of getting more > buffers synced is that during the 30s pause on a busy system there would be a > continual stream of new dirty buffers being created as bgwriter works and > other backends need to reuse pages. So when the fsync is eventually called > there will still be a large amount of i/o to do. Fundamentally the problem is > that fsync is too blunt an instrument. We only need to fsync the buffers we > care about, not the entire file. Well, one idea would be for the bgwriter not to do many write()'s between the massive checkpoint write()'s and the fsync()'s. That would cut down on the extra I/O that fsync() would have to do. The problem I see with making the bgwriter do more writes between checkpoints is that overhead of those scans, and the overhead of doing write's that will later be dirtied before the checkpoint. With the delay between stages idea, we don't need to guess how agressive the bgwriter needs to be --- we can just do the writes, and wait for a while. On an idle system, would someone dirty a large file, and watch the disk I/O to see how long it takes for the I/O to complete to disk? In what we have now, we are either having the bgwriter do too much I/O between checkpoints, or guaranteeing an I/O storm during a checkpoint by doing lots of write()'s and then calling fsync() right away. I don't see how we are ever going to get that properly tuned. Would someone code up a patch and test it? -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Hello, Inaam-san,
> There are four IO schedulers in Linux. Anticipatory, CFQ (default), deadline, and noop. For typical OLTP type loads generally deadline is recommended. If you are constrained on CPU and you have a good controller then its better to use noop.
> Deadline attempts to merge requests by maintaining two red black trees in sector sort order and it also ensures that a request is serviced in given time by using FIFO. I don't expect it to do the magic but was wondering that it may dilute the issue of fsync() elbowing out WAL writes.
> You can look into /sys/block/<device>/queue/scheduler to see which scheduler you are using.
> Deadline attempts to merge requests by maintaining two red black trees in sector sort order and it also ensures that a request is serviced in given time by using FIFO. I don't expect it to do the magic but was wondering that it may dilute the issue of fsync() elbowing out WAL writes.
> You can look into /sys/block/<device>/queue/scheduler to see which scheduler you are using.
Thank you for your information.
I could only find the following files in /sys/block/<device>/queue/:
I could only find the following files in /sys/block/<device>/queue/:
iosched
max_hw_sectors_kb
max_sectors_kb
nr_requests
read_ahead_kb
max_hw_sectors_kb
max_sectors_kb
nr_requests
read_ahead_kb
In iosched, the following files exist:
quantum (the content is "4")
queued (the content is "8")
queued (the content is "8")
From: "Bruce Momjian" <bruce@momjian.us> > On an idle system, would someone dirty a large file, and watch the disk > I/O to see how long it takes for the I/O to complete to disk? I ran "dd if=/dev/zero of=<file on DB disk> bs=8k count=`expr 1048576 / 8`, that is, writing 1GB file with 8KB write()'s. It took about 175 seconds for the kernel to flush buffers. I'll put the result of "iostat -x 5" which was started at the same time as dd. 175 seconds means 35 rows with nonzero %util * 5 seconds of interval. Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sde 0.00 981.56 0.00 50.90 0.00 8439.28 0.00 4219.64 165.80 34.38 525.01 6.25 31.82 sde 0.00 25737.00 0.00 118.60 0.00 219820.80 0.00 109910.40 1853.46 6529.74 1587.08 8.43 100.02 sde 0.00 1912.97 0.00 127.74 0.00 16325.75 0.00 8162.87 127.80 8192.67 6502.98 7.81 99.82 sde 0.00 1728.00 0.00 117.80 0.00 14745.60 0.00 7372.80 125.17 8209.36 11227.29 8.49 100.02 sde 0.00 1536.00 0.00 103.80 0.00 13107.20 0.00 6553.60 126.27 8209.10 16729.09 9.64 100.02 sde 0.00 1344.00 0.00 91.60 0.00 11468.80 0.00 5734.40 125.21 8208.69 21481.22 10.92 100.02 sde 0.00 1532.80 0.00 98.00 0.00 13081.60 0.00 6540.80 133.49 8209.34 26583.57 10.21 100.02 sde 0.00 1632.00 0.00 107.00 0.00 13926.40 0.00 6963.20 130.15 8208.89 31662.93 9.35 100.02 sde 0.00 1536.00 0.00 106.40 0.00 13107.20 0.00 6553.60 123.19 8209.66 36443.07 9.40 100.02 sde 0.00 1442.89 0.00 95.59 0.00 12312.63 0.00 6156.31 128.81 8227.23 41446.78 10.49 100.24 sde 0.00 1532.80 0.00 103.00 0.00 13081.60 0.00 6540.80 127.01 8210.77 46606.03 9.71 100.04 sde 0.00 1440.00 0.00 92.00 0.00 12288.00 0.00 6144.00 133.57 8208.82 51421.78 10.87 100.02 sde 0.00 1344.00 0.00 91.80 0.00 11468.80 0.00 5734.40 124.93 8209.86 56524.37 10.90 100.02 sde 0.00 1539.08 0.00 101.00 0.00 13133.47 0.00 6566.73 130.03 8225.59 61477.93 9.92 100.22 sde 0.00 1436.80 0.00 95.40 0.00 12262.40 0.00 6131.20 128.54 8208.88 66566.42 10.48 100.02 sde 0.00 1344.00 0.00 92.40 0.00 11468.80 0.00 5734.40 124.12 8209.47 71466.12 10.82 100.02 sde 0.00 1102.40 0.00 101.00 0.00 9408.38 0.00 4704.19 93.15 8174.36 76538.41 9.88 99.82 sde 0.00 0.00 0.00 89.00 0.00 0.00 0.00 0.00 0.00 7855.72 80795.64 11.24 100.02 sde 0.00 0.00 0.00 91.38 0.00 0.00 0.00 0.00 0.00 7422.53 81823.89 10.97 100.22 sde 0.00 0.00 0.00 96.80 0.00 0.00 0.00 0.00 0.00 6919.07 83194.91 10.33 100.02 sde 0.00 0.00 0.00 91.78 0.00 0.00 0.00 0.00 0.00 6480.85 84657.04 10.92 100.22 sde 0.00 0.00 0.00 92.60 0.00 0.00 0.00 0.00 0.00 5997.22 84749.79 10.80 100.02 sde 0.00 0.00 0.00 91.02 0.00 0.00 0.00 0.00 0.00 5528.97 85345.07 10.97 99.82 sde 0.00 0.00 0.00 102.61 0.00 0.00 0.00 0.00 0.00 5059.61 85057.91 9.77 100.22 sde 0.00 0.00 0.00 93.20 0.00 0.00 0.00 0.00 0.00 4572.57 85284.49 10.73 100.02 sde 0.00 0.00 0.00 98.20 0.00 0.00 0.00 0.00 0.00 4109.15 86086.50 10.21 100.22 sde 0.00 0.00 0.00 91.42 0.00 0.00 0.00 0.00 0.00 3611.72 86405.24 10.92 99.82 sde 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 3135.62 86292.49 10.00 100.02 sde 0.00 0.00 0.00 100.40 0.00 0.00 0.00 0.00 0.00 2652.63 86609.79 9.96 100.02 sde 0.00 0.00 0.00 92.80 0.00 0.00 0.00 0.00 0.00 2153.69 86168.58 10.78 100.02 sde 0.00 0.00 0.00 88.80 0.00 0.00 0.00 0.00 0.00 1694.74 86275.58 11.26 100.02 sde 0.00 0.00 0.00 98.20 0.00 0.00 0.00 0.00 0.00 1241.41 86708.40 10.19 100.02 sde 0.00 0.00 0.00 95.20 0.00 0.00 0.00 0.00 0.00 747.52 86505.59 10.51 100.02 sde 0.00 0.00 0.00 89.20 0.00 0.00 0.00 0.00 0.00 283.44 86551.11 11.21 100.02 sde 0.00 51.20 0.00 17.20 0.00 449.60 0.00 224.80 26.14 4.30 61572.65 9.05 15.56 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Bruce Momjian <bruce@momjian.us> wrote: > I assume write() is not our checkpoint performance problem, but the > transfer to disk via fsync(). Perhaps a simple solution is to do the > write()'s of all dirty buffers as we do now at checkpoint time, but > delay 30 seconds and then do fsync() on all the files. I think there are two platforms that have different problems in checkpoints. It's in fsync() on one platform, and in write() on another. It is complex depending on OS, the amount of memory, disks, writeback-cache and so on. > I think the basic difference between this and the proposed patch is that > we do not put delays in the buffer write() or fsync() phases --- we just > put a delay _between_ the phases, and wait for the kernel to smooth it > out for us. The kernel certainly knows more about what needs to get to > disk, so it seems logical to let it do the I/O smoothing. Both proposals do not conflict each other. Also, solutions for either platform do not have bad effect on the other platform. Can we employ both of them? I tested your proposal but it did not work on write-critical machine. However, if the idea works well on BSD or some platforms, we would be better off buying it. [pgbench results] ... 566.973777 327.158222 <- (1) write() 560.773868 <- (2) sleep 544.106645 <- (3) fsync() ... [changes in codes] (This is a bad implementation because shutdown takes long time!) void FlushBufferPool(void) { BufferSync(); // (1) write -- about 20s time_t start = time(NULL); while (time(NULL) - start < 30) // (2) sleep -- 30s { pg_usleep(BgWriterDelay* 1000L); BgBufferSync(); AbsorbFsyncRequests(); } smgrsync(); // (3) fsync -- less than 200ms } Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
ITAGAKI Takahiro wrote: > > Bruce Momjian <bruce@momjian.us> wrote: > > > I assume write() is not our checkpoint performance problem, but the > > transfer to disk via fsync(). Perhaps a simple solution is to do the > > write()'s of all dirty buffers as we do now at checkpoint time, but > > delay 30 seconds and then do fsync() on all the files. > > I think there are two platforms that have different problems in checkpoints. > It's in fsync() on one platform, and in write() on another. It is complex > depending on OS, the amount of memory, disks, writeback-cache and so on. > > > I think the basic difference between this and the proposed patch is that > > we do not put delays in the buffer write() or fsync() phases --- we just > > put a delay _between_ the phases, and wait for the kernel to smooth it > > out for us. The kernel certainly knows more about what needs to get to > > disk, so it seems logical to let it do the I/O smoothing. > > Both proposals do not conflict each other. Also, solutions for either > platform do not have bad effect on the other platform. Can we employ > both of them? > > I tested your proposal but it did not work on write-critical machine. > However, if the idea works well on BSD or some platforms, we would be > better off buying it. > > [pgbench results] > ... > 566.973777 > 327.158222 <- (1) write() > 560.773868 <- (2) sleep > 544.106645 <- (3) fsync() OK, so you are saying that performance dropped only during the write(), and not during the fsync()? Interesting. I would like to know the results of a few tests just like you reported them above:1a) write spread out over 30 seconds1b) write with no delay2a) sleep(0)2b)sleep(30) 3) fsync I would like to know the performance at each stage for each combination, e.g. when using 1b, 2a, 3, performance during the write() phase was X, during the sleep it was Y, and during the fsync it was Z. (Of course, sleep(0) has no stage timing.) -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote: > I assume other kernels have similar I/O smoothing, so that data sent to > the kernel via write() gets to disk within 30 seconds. > > I assume write() is not our checkpoint performance problem, but the > transfer to disk via fsync(). Well, its correct to say that the transfer to disk is the source of the problem, but that doesn't only occur when we fsync(). There are actually two disk storms that occur, because of the way the fs cache works. [Ron referred to this effect uplist] Linux 2.6+ will attempt to write to disk any dirty blocks in excess of a certain threshold, the dirty _background_ratio, which defaults to 10% of RAM. So when the checkpoint issues lots of write() calls, we generally exceed the threshold and then begin io storm number 1 to get us back down to the dirty_background_ratio. When we issue the fsync() calls, we then begin io storm number 2, which takes us down close to zero dirty blocks (on a dedicated server). [Thanks to my colleague Richard Kennedy for confirming this via investigation; he's been on leave throughout this discussion, regrettably]. Putting delays in is very simple and does help, however, how much it helps depends upon: - the number of dirty blocks in shared_buffers - the dirty_background_ratio - the number of dirty blocks in each file when we fsync() For example, on a system with a very large RAM, and yet a medium write workload, the dirty_background_ratio may never be exceeded. In that case, all of the I/O happens during storm 2, so . If you set dirty_background_ratio lower, then most of the writes happen during storm 1. During storm 2, the fsync calls write all dirty blocks in a file to disk. In many cases, a few tables/files have all of the writes, so adding a delay between fsyncs doesn't spread out the writes like you would hope it would. Most of the time, storm 1 and storm 2 run together in a continuous stream, but sometimes you see a double peak. There is an overlap of a few seconds between 1 and 2 in many cases. Linux will also write blocks to disk after a period of inactivity, dirty_expire_centisecs which by default is 30 seconds. So putting a delay between storm1 and storm2 should help matters somewhat, but 30 secs is probably almost exactly the wrong number (by chance), though I do like the idea. > Perhaps a simple solution is to do the > write()'s of all dirty buffers as we do now at checkpoint time, but > delay 30 seconds and then do fsync() on all the files. So yes, putting a short delay, say 10 seconds, in at that point should help matters somewhat, sometimes. (But the exact number depends upon how the OS is tuned.) > The goal here is > that during the 30-second delay, the kernel will be forcing data to the > disk, so the fsync() we eventually do will only be for the write() of > buffers during the 30-second delay, and because we wrote all dirty > buffers 30 seconds ago, there shouldn't be too many of them. ...but not for that reason. IMHO the best thing to do is to 1. put a short delay between the write() steps in the checkpoint 2. put a longer delay in between the write() phase and the fsync() phase 3. tune the OS writeback mechanism to help smoothing Either set (1) and (2) as GUCs, or have code that reads the OS settings and acts accordingly. Or alternatively put, both Bruce and Itagaki-san have good ideas. > > So, in the real world, one conclusion seems to be that our existing > > method of tuning the background writer just isn't good enough for the > > average user: > > > > #bgwriter_delay = 200ms # 10-10000ms between rounds > > #bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round > > #bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round > > #bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round > > #bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round > > > > These settings control what the bgwriter does, but they do not clearly > > relate to the checkpoint timing, which is the purpose of the bgwriter, > > and they don't change during the checkpoint interval, which is also less > > than ideal. If set to aggressively, it writes too much, and if too low, > > the checkpoint does too much I/O. Yes, that's very clear. > > We clearly need more bgwriter activity as the checkpoint approaches I'd put it that we should write a block to disk prior to checkpoint if it appears that it won't be dirtied again if we do so. That doesn't necessarily translate directly into *more* activity. > , and > > one that is more auto-tuned, like many of our other parameters. I think > > we created these settings to see how they worked in the field, so it > > probably time to reevaluate them based on field reports. > > > > I think the bgwriter should keep track of how far it is to the next > > checkpoint, and use that information to increase write activity. > > Basically now, during a checkpoint, the bgwriter does a full buffer scan > > and fsync's all dirty files, so it changes from the configuration > > parameter-defined behavior right to 100% activity. I think it would be > > ideal if we could ramp up the writes so that when it is 95% to the next > > checkpoint, it can be operating at 95% of the activity it would do > > during a checkpoint. > > > > My guess is if we can do that, we will have much smoother performance > > because we have more WAL writes just after checkpoint for newly-dirtied > > pages, and the new setup will give us more write activity just before > > checkpoint. Well, as long as the kernel ignores Postgres and Postgres ignores the kernel, things will never be smooth (literally). If we write more, but are still below the dirty_background_ratio, it won't make the slightest bit of difference. Trying to get trustworthy/explicable performance test reports is already difficult for this reason. IMHO, the best approach would be one that takes into how the OS behaves, so we can work with it. Regrettably, I can't see any way of doing this other than OS-specific code of some shape or form. > > One other idea is for the bgwriter to use O_DIRECT or O_SYNC to avoid > > the kernel cache, so we are sure data will be on disk by checkpoint > > time. This was avoided in the past because of the expense of > > second-guessing the kernel disk I/O scheduling algorithms. Seems like a longer term best approach to me. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Mon, 2006-12-18 at 14:47 +0900, Takayuki Tsunakawa wrote: > Hello, Itagaki-san, all > > Sorry for my long mail. I've had trouble in sending this mail because > it's too long for pgsql-hackers to accept (I couldn't find how large > mail is accepted.) So I'm trying to send several times. > Please see the attachment for the content. Your results for fsync are interesting. I've noticed that a checkpoint seems to increase the activity on the WAL drive as well as increasing I/O wait times. That doesn't correspond to any real increase in WAL traffic I'm aware of. Have you tried setting deadline scheduler on the WAL device and CFQ on the data device? That should allow the I/Os to move through different queues and prevent interference. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Wed, Dec 27, 2006 at 09:24:06PM +0000, Simon Riggs wrote: > On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote: > > > I assume other kernels have similar I/O smoothing, so that data sent to > > the kernel via write() gets to disk within 30 seconds. > > > > I assume write() is not our checkpoint performance problem, but the > > transfer to disk via fsync(). > > Well, its correct to say that the transfer to disk is the source of the > problem, but that doesn't only occur when we fsync(). There are actually > two disk storms that occur, because of the way the fs cache works. [Ron > referred to this effect uplist] As someone looking from the outside: fsync only works on one file, so presumably the checkpoint process is opening each file one by one and fsyncing them. Does that make any difference here? Could you adjust the timing here? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On Wed, 2006-12-27 at 23:26 +0100, Martijn van Oosterhout wrote: > On Wed, Dec 27, 2006 at 09:24:06PM +0000, Simon Riggs wrote: > > On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote: > > > > > I assume other kernels have similar I/O smoothing, so that data sent to > > > the kernel via write() gets to disk within 30 seconds. > > > > > > I assume write() is not our checkpoint performance problem, but the > > > transfer to disk via fsync(). > > > > Well, its correct to say that the transfer to disk is the source of the > > problem, but that doesn't only occur when we fsync(). There are actually > > two disk storms that occur, because of the way the fs cache works. [Ron > > referred to this effect uplist] > > As someone looking from the outside: > > fsync only works on one file, so presumably the checkpoint process is > opening each file one by one and fsyncing them. Yes > Does that make any > difference here? Could you adjust the timing here? Thats the hard bit with io storm 2. When you fsync a file you don't actually know how many blocks you're writing, plus there's no way to slow down those writes by putting delays between them (although its possible your controller might know how to do this, I've never heard of one that does). If we put a delay after each fsync, that will space out all the ones that don't need spacing out and do nothing to the ones that most need it. Unfortunately. IMHO there isn't any simple scheme that works all the time, for all OS settings, default configurations and mechanisms. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Bruce Momjian <bruce@momjian.us> wrote: > > 566.973777 > > 327.158222 <- (1) write() > > 560.773868 <- (2) sleep > > 544.106645 <- (3) fsync() > > OK, so you are saying that performance dropped only during the write(), > and not during the fsync()? Interesting. Almost yes, but there is a small drop in fsync. (560->540) > I would like to know the > results of a few tests just like you reported them above: > > 1a) write spread out over 30 seconds > 1b) write with no delay > > 2a) sleep(0) > 2b) sleep(30) > > 3) fsync > > I would like to know the performance at each stage for each combination, > e.g. when using 1b, 2a, 3, performance during the write() phase was X, > during the sleep it was Y, and during the fsync it was Z. (Of course, > sleep(0) has no stage timing.) I'm thinking about generalizing your idea; Adding three parameters (checkpoint_write, checkpoint_naptime and checkpoint_fsync) to control sleeps in each stage. 1) write() spread out over 'checkpoint_write' seconds. 2) sleep 'checkpoint_naptime' seconds between write() and fsync(). 3) fsync() spread out over 'checkpoint_fsync' seconds. If three parameter are all zero, checkpoints behave as the same as now. If checkpoint_write = checkpoint_timeout and other two are zero, it is just like my proposal before. As you might expect, I intend the above only for development purpose. Additinal three parameters are hard to use for users. If we can pull out some proper values from the tests, we'd better to set those values as default. I assume we can derive them from existing checkpoint_timeout. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Gregory Stark wrote: > "Bruce Momjian" <bruce@momjian.us> writes: > >> I have a new idea. ...the BSD kernel...similar issue...to smooth writes: > Linux has a more complex solution to this (of course) which has undergone a > few generations over time. Older kernels had a user space daemon called > bdflush which called an undocumented syscall every 5s. More recent ones have a > kernel thread called pdflush. I think both have various mostly undocumented > tuning knobs but neither makes any sort of guarantee about the amount of time > a dirty buffer might live before being synced. Earlier in this thread (around the 7th) was a discussion of /proc/sys/vm/dirty_expire_centisecs and /proc/vm/dirty_writeback_centisecs which seem to be the tunables that matter here. Googling suggests that dirty_expire_centisecs specifies that data which has been dirty in memory for longer than this interval will be writtenout next time a pdflush daemon wakes up" and dirty_writeback_centisecs "expresses the interval between those wakeups" It seems to me that the sum of the two times does determine the maximum time before the kernel will start syncing a dirtied page. Bottom line, though is that it seems both postgresql and the OS's are trying to delay writes in the hopes of collapsing them; and that the actual delay is the sum of the OS's delay and postgresql's delay. I think Kevin Grittner's experimentation earlier in the thread did indeed suggest that getting writes to the OS faster and let it handle the collapsing of the writes was an effective method of reducing painful checkpoints.
On Wed, Dec 27, 2006 at 10:54:57PM +0000, Simon Riggs wrote: > On Wed, 2006-12-27 at 23:26 +0100, Martijn van Oosterhout wrote: > > On Wed, Dec 27, 2006 at 09:24:06PM +0000, Simon Riggs wrote: > > > On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote: > > > > > > > I assume other kernels have similar I/O smoothing, so that data sent to > > > > the kernel via write() gets to disk within 30 seconds. > > > > > > > > I assume write() is not our checkpoint performance problem, but the > > > > transfer to disk via fsync(). > > > > > > Well, its correct to say that the transfer to disk is the source of the > > > problem, but that doesn't only occur when we fsync(). There are actually > > > two disk storms that occur, because of the way the fs cache works. [Ron > > > referred to this effect uplist] > > > > As someone looking from the outside: > > > > fsync only works on one file, so presumably the checkpoint process is > > opening each file one by one and fsyncing them. > > Yes > > > Does that make any > > difference here? Could you adjust the timing here? > > Thats the hard bit with io storm 2. When you fsync a file you don't > actually know how many blocks you're writing, plus there's no way to > slow down those writes by putting delays between them (although its > possible your controller might know how to do this, I've never heard of > one that does). Any controller that sophisticated would likely also have a BBU and write caching, which should greatly reduce the impact of at least the fsync storm... unless you fill the cache. I suspect we might need a way to control how much data we try and push out at a time to avoid that... As for settings, I really like the simplicity of the Oracle system... "Just try to ensure recovery takes about X amount of seconds". I like the idea of a creeping checkpoint even more; only writing a buffer out when we need to checkpoint it makes a lot more sense to me than trying to guess when we'll next dirty a buffer. Such a system would probably also be a lot easier to tune than the current bgwriter, even if we couldn't simplify it all the way to "seconds for recovery". -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
ITAGAKI Takahiro wrote: > > Bruce Momjian <bruce@momjian.us> wrote: > > > > 566.973777 > > > 327.158222 <- (1) write() > > > 560.773868 <- (2) sleep > > > 544.106645 <- (3) fsync() > > > > OK, so you are saying that performance dropped only during the write(), > > and not during the fsync()? Interesting. > > Almost yes, but there is a small drop in fsync. (560->540) > > > > I would like to know the > > results of a few tests just like you reported them above: > > > > 1a) write spread out over 30 seconds > > 1b) write with no delay > > > > 2a) sleep(0) > > 2b) sleep(30) > > > > 3) fsync > > > > I would like to know the performance at each stage for each combination, > > e.g. when using 1b, 2a, 3, performance during the write() phase was X, > > during the sleep it was Y, and during the fsync it was Z. (Of course, > > sleep(0) has no stage timing.) > > I'm thinking about generalizing your idea; Adding three parameters > (checkpoint_write, checkpoint_naptime and checkpoint_fsync) > to control sleeps in each stage. > > 1) write() spread out over 'checkpoint_write' seconds. > 2) sleep 'checkpoint_naptime' seconds between write() and fsync(). > 3) fsync() spread out over 'checkpoint_fsync' seconds. > > If three parameter are all zero, checkpoints behave as the same as now. > If checkpoint_write = checkpoint_timeout and other two are zero, > it is just like my proposal before. > > > As you might expect, I intend the above only for development purpose. > Additinal three parameters are hard to use for users. If we can pull out > some proper values from the tests, we'd better to set those values as > default. I assume we can derive them from existing checkpoint_timeout. Great idea, though I wouldn't bother with checkpoint_fsync. I think Simon's previous email spelled out the problems of trying to delay fsyncs() --- in most cases, there will be one file with most of the I/O, and that fsync is going to be the flood. Basically, I think the variability of table access is too great for the fsync delay to ever be tunable by users. To summarize, if we could have fsync() only write the dirty buffers that happened as part of the checkpoint, we could delay the write() for the entire time between checkpoints, but we can't do that, so we have to make it user-tunable. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Thu, Dec 28, 2006 at 12:50:19PM -0500, Bruce Momjian wrote: > To summarize, if we could have fsync() only write the dirty buffers that > happened as part of the checkpoint, we could delay the write() for the > entire time between checkpoints, but we can't do that, so we have to > make it user-tunable. What about the mmap/msync(?)/munmap idea someone mentioned? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim C. Nasby wrote: > On Thu, Dec 28, 2006 at 12:50:19PM -0500, Bruce Momjian wrote: > > To summarize, if we could have fsync() only write the dirty buffers that > > happened as part of the checkpoint, we could delay the write() for the > > entire time between checkpoints, but we can't do that, so we have to > > make it user-tunable. > > What about the mmap/msync(?)/munmap idea someone mentioned? I see that as similar to using O_DIRECT during checkpoint, which had poor performance. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Jim C. Nasby wrote: >> What about the mmap/msync(?)/munmap idea someone mentioned? > I see that as similar to using O_DIRECT during checkpoint, which had > poor performance. That's a complete nonstarter on portability grounds, even if msync gave us the desired semantics, which it doesn't. It's no better than fsync for our purposes. To my mind the problem with fsync is not that it gives us too little control but that it gives too much: we have to specify a particular order of writing out files. What we'd really like is a version of sync(2) that tells us when it's done but doesn't constrain the I/O scheduler's choices at all. Unfortunately there's no such API ... regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Jim C. Nasby wrote: > >> What about the mmap/msync(?)/munmap idea someone mentioned? > > > I see that as similar to using O_DIRECT during checkpoint, which had > > poor performance. > > That's a complete nonstarter on portability grounds, even if msync gave > us the desired semantics, which it doesn't. It's no better than fsync > for our purposes. > > To my mind the problem with fsync is not that it gives us too little > control but that it gives too much: we have to specify a particular > order of writing out files. What we'd really like is a version of > sync(2) that tells us when it's done but doesn't constrain the I/O > scheduler's choices at all. Unfortunately there's no such API ... Yea, we used to use sync() but that did all files, not just the PostgreSQL ones. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom Lane wrote: > To my mind the problem with fsync is not that it gives us too little > control but that it gives too much: we have to specify a particular > order of writing out files. What we'd really like is a version of > sync(2) that tells us when it's done but doesn't constrain the I/O > scheduler's choices at all. Unfortunately there's no such API ... The problem I see with fsync is that it causes an immediate I/O storm as the OS tries to flush everything out as quickly as possible. But we're not in a hurry. What we'd need is a lazy fsync, that would tell the operating system "let me know when all these dirty buffers are written to disk, but I'm not in a hurry, take your time". It wouldn't change the scheduling of the writes, just inform the caller when they're done. If we wanted more precise control of the flushing, we could use sync_file_range on Linux, but that's not portable. Nevertheless, I think it would be OK to have an ifdef and use it on platformsthat support it, if it gave a benefit. As a side note, with full_page_writes on, a checkpoint wouldn't actually need to fsync those pages that have been written to WAL after the checkpoint started. Doesn't make much difference in most cases, but we could take that into account if we start taking more control of the flushing. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Dec 28, 2006 at 09:28:48PM +0000, Heikki Linnakangas wrote: > Tom Lane wrote: > >To my mind the problem with fsync is not that it gives us too little > >control but that it gives too much: we have to specify a particular > >order of writing out files. What we'd really like is a version of > >sync(2) that tells us when it's done but doesn't constrain the I/O > >scheduler's choices at all. Unfortunately there's no such API ... > > The problem I see with fsync is that it causes an immediate I/O storm as > the OS tries to flush everything out as quickly as possible. But we're > not in a hurry. What we'd need is a lazy fsync, that would tell the > operating system "let me know when all these dirty buffers are written > to disk, but I'm not in a hurry, take your time". It wouldn't change the > scheduling of the writes, just inform the caller when they're done. > > If we wanted more precise control of the flushing, we could use > sync_file_range on Linux, but that's not portable. Nevertheless, I think > it would be OK to have an ifdef and use it on platforms that support > it, if it gave a benefit. I believe there's something similar for OS X as well. The question is: would it be better to do that, or to just delay calling fsync until the OS has had a chance to write things out. > As a side note, with full_page_writes on, a checkpoint wouldn't actually > need to fsync those pages that have been written to WAL after the > checkpoint started. Doesn't make much difference in most cases, but we > could take that into account if we start taking more control of the > flushing. Hrm, interesting point, but I suspect the window involved there is too small to be worth worrying about. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
"Jim C. Nasby" <jim@nasby.net> writes: > I believe there's something similar for OS X as well. The question is: > would it be better to do that, or to just delay calling fsync until the > OS has had a chance to write things out. A delay is not going to help unless you can suppress additional writes to the file, which I don't think you can unless there's very little going on in the database --- dirty buffers have to get written to make room for other pages, checkpoint in progress or no. regards, tom lane
Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > I believe there's something similar for OS X as well. The question is: > > would it be better to do that, or to just delay calling fsync until the > > OS has had a chance to write things out. > > A delay is not going to help unless you can suppress additional writes > to the file, which I don't think you can unless there's very little > going on in the database --- dirty buffers have to get written to make > room for other pages, checkpoint in progress or no. I am afraid a delay between write and fsync is the only portable option we have right now --- there is hope that since the check point write, we will not have a huge number of dirty buffers at the start of the checkpoint that need to be written out. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, Dec 29, 2006 at 09:02:11PM -0500, Bruce Momjian wrote: > Tom Lane wrote: > > "Jim C. Nasby" <jim@nasby.net> writes: > > > I believe there's something similar for OS X as well. The question is: > > > would it be better to do that, or to just delay calling fsync until the > > > OS has had a chance to write things out. > > > > A delay is not going to help unless you can suppress additional writes > > to the file, which I don't think you can unless there's very little > > going on in the database --- dirty buffers have to get written to make > > room for other pages, checkpoint in progress or no. > > I am afraid a delay between write and fsync is the only portable option > we have right now --- there is hope that since the check point write, we > will not have a huge number of dirty buffers at the start of the > checkpoint that need to be written out. We could potentially control that behavior, too. For example, trying to suppress writes to a given file when we're getting ready to checkpoint it (of course, separating checkpointing into a file-by-file operation is a non-trivial change, but it should be possible). -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
> > I believe there's something similar for OS X as well. The question is: > > would it be better to do that, or to just delay calling fsync until the > > OS has had a chance to write things out. > > A delay is not going to help unless you can suppress additional writes > to the file, which I don't think you can unless there's very little > going on in the database --- dirty buffers have to get written to make > room for other pages, checkpoint in progress or no. But checkpoint first writes all dirty pages, so we have more than average pages that can be replaced without a write. Thus we have a window where we can wait until dirty pages have to be replaced again. Since the bgwriter is sleeping until fsync, only pages that have to be replaced will be written. Question is, how do we time that window. Andreas
Happy new year From: "Simon Riggs" <simon@2ndquadrant.com> > Have you tried setting deadline scheduler on the WAL device and CFQ on > the data device? That should allow the I/Os to move through different > queues and prevent interference. No, I've not tried yet. Inaam-san told me that Linux had a few I/O schedulers but I'm not familiar with them. I'll find information about them (how to change the scheduler settings) and try the same test. ----- Original Message ----- From: "Simon Riggs" <simon@2ndquadrant.com> To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> Cc: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>; <pgsql-hackers@postgresql.org> Sent: Thursday, December 28, 2006 7:07 AM Subject: Re: [HACKERS] Load distributed checkpoint > On Mon, 2006-12-18 at 14:47 +0900, Takayuki Tsunakawa wrote: >> Hello, Itagaki-san, all >> >> Sorry for my long mail. I've had trouble in sending this mail because >> it's too long for pgsql-hackers to accept (I couldn't find how large >> mail is accepted.) So I'm trying to send several times. >> Please see the attachment for the content. > > Your results for fsync are interesting. > > I've noticed that a checkpoint seems to increase the activity on the WAL > drive as well as increasing I/O wait times. That doesn't correspond to > any real increase in WAL traffic I'm aware of. > > Have you tried setting deadline scheduler on the WAL device and CFQ on > the data device? That should allow the I/Os to move through different > queues and prevent interference. > > -- > Simon Riggs > EnterpriseDB http://www.enterprisedb.com > > >
I wrote: > I'm thinking about generalizing your idea; Adding three parameters > to control sleeps in each stage. I sent a patch to -patches that adds 3+1 GUC parameters for checkpoints. We can use three of them to control sleeps in each stage during checkpoints. The last is an experimental approach to replace fsync() for fine control. 1. checkpoint_write_duration (default=0, in seconds) Sets the duration of write() phase in checkpoints. 2. checkpoint_nap_duration (default=0, in seconds) Sets the duration between write() and fsync() phases in checkpoints. 3. checkpoint_sync_duration (default=0, in seconds) Sets the duration of fsync() phase in checkpoints. The 1st parameter spreads write(). If you set checkpoint_write_duration to 90% of checkpoint_timeout, it's just same as the patch I sent before. The 2nd is naptime between write() and fsync() phases. Kernel's writer might work much if you set it to around 30-60s, that might be useful for some traditional UNIXes, as you say. In contrast, the 1st was the most variable in my machine somehow (Windows and Linux). The 3rd spreads fsync(). This parameter only works when you have several tables or a very large table (that consists of some 1GB of files), because fsync() is on a file basis. Bruce Momjian <bruce@momjian.us> wrote: > To summarize, if we could have fsync() only write the dirty buffers that > happened as part of the checkpoint, we could delay the write() for the > entire time between checkpoints, but we can't do that, so we have to > make it user-tunable. The 3rd has the above limitation so that I added another parameter. 4. checkpoint_sync_size (default=0, in KB) Sets the synchronization unit of data files in checkpoints. It uses sync_file_range or mmap/msync/munmap to divide file-synchronization into specified granularity. I think 16-64MB fits the machines in that performance are restricted by fsync() in checkpoints. The feature is uncompleted. For example, sync_file_range does not flush metadata of files in fact (it's equivalent of fdatasync), so we may lose data under the patch. It must be fixed, but I want to measure the advantage before that. I'm interested in which parameter is useful for each environment. Any comments and testing reports will be appreciated. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
No, I've not tried yet. Inaam-san told me that Linux had a few I/O
schedulers but I'm not familiar with them. I'll find information
about them (how to change the scheduler settings) and try the same
test.
I am sorry, your response just slipped by me. The docs for RHEL (I believe you are running RHEL which has 2.6.9 kernel) say that it does support selectable IO scheduler.
http://www.redhat.com/rhel/details/limits/
I am not sure where else to look for scheduler apart from /sys
regards,
inaam
Thread added to TODO list: * Reduce checkpoint performance degredation by forcing data to disk more evenly http://archives.postgresql.org/pgsql-hackers/2006-12/msg00337.php http://archives.postgresql.org/pgsql-hackers/2007-01/msg00079.php --------------------------------------------------------------------------- ITAGAKI Takahiro wrote: > This is a proposal for load distributed checkpoint. > (It is presented on postgresql anniversary summit in last summer.) > > > We offen encounters performance gap during checkpoint. The reason is write > bursts. Storage devices are too overworked in checkpoint, so they can not > supply usual transaction processing. > > Checkpoint consists of the following four steps, and the major performance > problem is 2nd step. All dirty buffers are written without interval in it. > > 1. Query information (REDO pointer, next XID etc.) > 2. Write dirty pages in buffer pool > 3. Flush all modified files > 4. Update control file > > I suggested to write pages with sleeping in 2nd step, using normal activity > of the background writer. It is something like cost-based vacuum delay. > Background writer has two pointers, 'ALL' and 'LRU', indicating where to > write out in buffer pool. We can wait for the ALL clock-hand going around > to guarantee all pages to be written. > > Here is pseudo-code for the proposed method. The internal loop is just the > same as bgwriter's activity. > > PrepareCheckPoint(); -- do step 1 > Reset num_of_scanned_pages by ALL activity; > do { > BgBufferSync(); -- do a part of step 2 > sleep(bgwriter_delay); > } while (num_of_scanned_pages < shared_buffers); > CreateCheckPoint(); -- do step 3 and 4 > > > We may accelerate background writer to reduce works at checkpoint instead of > the method, but it introduces another performance problem; Extra pressure > is always put on the storage devices to keep the number of dirty pages low. > > > I'm working about adjusting the progress of checkpoint to checkpoint timeout > and wal segments limitation automatically to avoid overlap of two checkpoints. > I'll post a patch sometime soon. > > Comments and suggestions welcome. > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 12/19/06, ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> wrote:
Itagaki,
Did you had a chance to look into this any further? We, at EnterpriseDB, have done some testing on this patch (dbt2 runs) and it looks like we are getting the desired results, particularly so when we spread out both sync and write phases.
-- "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:
> I performed some simple tests, and I'll show the results below.
> (1) The default case
> 235 80 226 77 240
> (2) No write case
> 242 250 244 253 280
> (3) No checkpoint case
> 229 252 256 292 276
> (4) No fsync() case
> 236 112 215 216 221
> (5) No write by PostgreSQL, but fsync() by another program case
> 9 223 260 283 292
> (6) case (5) + O_SYNC by write_fsync
> 97 114 126 112 125
> (7) O_SYNC case
> 182 103 41 50 74
I posted a patch to PATCHES. Please try out it.
It does write() smoothly, but fsync() at a burst.
I suppose the result will be between (3) and (5).
Itagaki,
Did you had a chance to look into this any further? We, at EnterpriseDB, have done some testing on this patch (dbt2 runs) and it looks like we are getting the desired results, particularly so when we spread out both sync and write phases.
Inaam Rana
EnterpriseDB http://www.enterprisedb.com
"Inaam Rana" <inaamrana@gmail.com> wrote: > Did you had a chance to look into this any further? We, at EnterpriseDB, > have done some testing on this patch (dbt2 runs) and it looks like we are > getting the desired results, particularly so when we spread out both sync > and write phases. Thank you for testing! Yes, I'm cleaning the patch. I changed configuration parameters to delay each phase in checkpoints from setting absolute times (checkpoint_xxx_duration) to setting relative to checkpoint_timeout (checkpoint_xxx_percent). Delay factors strongly depend on total checkpoint lengths, so I think the relative method is better. I tested the patch on my machine, too. The belows are the results of pgbench with/without the patch. As you see, the patch is not a complete solution -- 12s of response time is not satisfiable yet -- but it was better than other possible settings we could choose now. pgbench -s100 -c16 -t100000 on the machine with 1GB ram and one SCSI drive | A | B | C | D | --------------------------+--------+--------+--------+--------+bgwriter_all_maxpages | 5 | 5 | 60 | 120 |checkpoint_write_percent | 50.0 | 0 | 0 | 0 |checkpoint_nap_percent | 10.0 | 0 | 0 | 0 |checkpoint_sync_percent | 20.0 | 0 | 0 | 0 | --------------------------+--------+--------+--------+--------+pgbench tps | 612.23 | 517.64 | 488.90 | 378.39|response average | 2.50ms | 2.89ms | 3.12ms | 3.99ms |response maximum | 12.23s |123.66s | 55.09s| 36.72s | --------------------------+--------+--------+--------+--------+ A: Proposed configurations with patch B: Default configurations C: Make bgwriter aggressive D: Make bgwriter more aggressive Other configurations: - full_page_writes = off - wal_buffers = 4MB - checkpoint_segments = 32 - checkpoint_timeout = 15min Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Itagaki, > Thank you for testing! Yes, I'm cleaning the patch. I changed > configuration parameters to delay each phase in checkpoints from setting > absolute times (checkpoint_xxx_duration) to setting relative to > checkpoint_timeout (checkpoint_xxx_percent). Delay factors strongly > depend on total checkpoint lengths, so I think the relative method is > better. Can I have a copy of the patch to add to the Sun testing queue? -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus <josh@agliodbs.com> wrote: > Can I have a copy of the patch to add to the Sun testing queue? This is the revised version of the patch. Delay factors in checkpoints can be specified by checkpoint_write_percent, checkpoint_nap_percent and checkpoint_sync_percent. They are relative to checkpoint_timeout. Also, checking of archive_timeout during checkpoints and some error handling routines were added. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
On 2/26/07, ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> wrote:
One of the issues we had during testing with original patch was db stop not working properly. I think you coded something to do a stop checkpoint in immediately but if a checkpoint is already in progress at that time, it would take its own time to complete.
Does this patch resolve that issue? Also, is it based on pg82stable or HEAD?
regards,
inaam
Josh Berkus <josh@agliodbs.com> wrote:
> Can I have a copy of the patch to add to the Sun testing queue?
This is the revised version of the patch. Delay factors in checkpoints
can be specified by checkpoint_write_percent, checkpoint_nap_percent
and checkpoint_sync_percent. They are relative to checkpoint_timeout.
Also, checking of archive_timeout during checkpoints and some error
handling routines were added.
One of the issues we had during testing with original patch was db stop not working properly. I think you coded something to do a stop checkpoint in immediately but if a checkpoint is already in progress at that time, it would take its own time to complete.
Does this patch resolve that issue? Also, is it based on pg82stable or HEAD?
regards,
inaam
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
--
Inaam Rana
EnterpriseDB http://www.enterprisedb.com
"Inaam Rana" <inaamrana@gmail.com> wrote: > One of the issues we had during testing with original patch was db stop not > working properly. I think you coded something to do a stop checkpoint in > immediately but if a checkpoint is already in progress at that time, it > would take its own time to complete. > Does this patch resolve that issue? Yes, I fixed the problem. If a checkpoint by user SQL or shutdown is waiting during an automatic checkpoint, the running checkpoint will be done without any delays. At the worst case, you have to wait two checkpoints, (a running automatic checkpoint and your explicit request) but nothing more of them. > Also, is it based on pg82stable or HEAD? It's based on HEAD. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center