Thread: Controlling Load Distributed Checkpoints
I'm again looking at way the GUC variables work in load distributed checkpoints patch. We've discussed them a lot already, but I don't think they're still quite right. Write-phase ----------- I like the way the write-phase is controlled in general. Writes are throttled so that we spend the specified percentage of checkpoint interval doing the writes. But we always write at a specified minimum rate to avoid spreading out the writes unnecessarily when there's little work to do. The original patch uses bgwriter_all_max_pages to set the minimum rate. I think we should have a separate variable, checkpoint_write_min_rate, in KB/s, instead. Nap phase --------- This is trickier. The purpose of the sleep between writes and fsyncs is to give the OS a chance to flush the pages to disk in it's own pace, hopefully limiting the affect on concurrent activity. The sleep shouldn't last too long, because any concurrent activity can be dirtying and writing more pages, and we might end up fsyncing more than necessary which is bad for performance. The optimal delay depends on many factors, but I believe it's somewhere between 0-30 seconds in any reasonable system. In the current patch, the duration of the sleep between the write and sync phases is controlled as a percentage of checkpoint interval. Given that the optimal delay is in the range of seconds, and checkpoint_timeout can be up to 60 minutes, the useful values of that percentage would be very small, like 0.5% or even less. Furthermore, the optimal value doesn't depend that much on the checkpoint interval, it's more dependent on your OS and memory configuration. We should therefore give the delay as a number of seconds instead of as a percentage of checkpoint interval. Sync phase ---------- This is also tricky. As with the nap phase, we don't want to spend too much time fsyncing, because concurrent activity will write more dirty pages and we might just end up doing more work. And we don't know how much work an fsync performs. The patch uses the file size as a measure of that, but as we discussed that doesn't necessarily have anything to do with reality. fsyncing a 1GB file with one dirty block isn't any more expensive than fsyncing a file with a single block. Another problem is the granularity of an fsync. If we fsync a 1GB file that's full of dirty pages, we can't limit the affect on other activity. The best we can do is to sleep between fsyncs, but sleeping more than a few seconds is hardly going to be useful, no matter how bad an I/O storm each fsync causes. Because of the above, I'm thinking we should ditch the checkpoint_sync_percentage variable, in favor of: checkpoint_fsync_period # duration of the fsync phase, in seconds checkpoint_fsync_delay # max. sleep between fsyncs, in milliseconds In all phases, the normal bgwriter activities are performed: lru-cleaning and switching xlog segments if archive_timeout expires. If a new checkpoint request arrives while the previous one is still in progress, we skip all the delays and finish the previous checkpoint as soon as possible. GUC summary and suggested default values ---------------------------------------- checkpoint_write_percent = 50 # % of checkpoint interval to spread out writes checkpoint_write_min_rate = 1000 # minimum I/O rate to write dirty buffers at checkpoint (KB/s) checkpoint_nap_duration = 2 # delay between write and sync phase, in seconds checkpoint_fsync_period = 30 # duration of the sync phase, in seconds checkpoint_fsync_delay = 500 # max. delay between fsyncs I don't like adding that many GUC variables, but I don't really see a way to tune them automatically. Maybe we could just hard-code the last one, it doesn't seem that critical, but that still leaves us 4 variables. Thoughts? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > GUC summary and suggested default values > ---------------------------------------- > checkpoint_write_percent = 50 # % of checkpoint interval to spread out writes > checkpoint_write_min_rate = 1000 # minimum I/O rate to write dirty > buffers at checkpoint (KB/s) I don't understand why this is a min_rate rather than a max_rate. > checkpoint_nap_duration = 2 # delay between write and sync phase, in seconds Not a comment on the choice of guc parameters, but don't we expect useful values of this to be much closer to 30 than 0? I understand it might not be exactly 30. Actually, it's not so much whether there's any write traffic to the data files during the nap that matters, it's whether there's more traffic during the nap than during the 30s or so prior to the nap. As long as it's a steady-state condition it shouldn't matter how long we wait, should it? > checkpoint_fsync_period = 30 # duration of the sync phase, in seconds > checkpoint_fsync_delay = 500 # max. delay between fsyncs -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > GUC summary and suggested default values > ---------------------------------------- > checkpoint_write_percent = 50 # % of checkpoint interval to spread out > writes > checkpoint_write_min_rate = 1000 # minimum I/O rate to write dirty > buffers at checkpoint (KB/s) > checkpoint_nap_duration = 2 # delay between write and sync phase, in > seconds > checkpoint_fsync_period = 30 # duration of the sync phase, in seconds > checkpoint_fsync_delay = 500 # max. delay between fsyncs > I don't like adding that many GUC variables, but I don't really see a > way to tune them automatically. If we don't know how to tune them, how will the users know? Having to add that many variables to control one feature says to me that we don't understand the feature. Perhaps what we need is to think about how it can auto-tune itself. regards, tom lane
On Wed, 6 Jun 2007, Tom Lane wrote: > If we don't know how to tune them, how will the users know? I can tell you a good starting set for them to on a Linux system, but you first have to let me know how much memory is in the OS buffer cache, the typical I/O rate the disks can support, how many buffers are expected to be written out by BGW/other backends at heaviest load, and the current setting for /proc/sys/vm/dirty_background_ratio. It's not a coincidence that there are patches applied to 8.3 or in the queue to measure all of the Postgres internals involved in that computation; I've been picking away at the edges of this problem. Getting this sort of tuning right takes that level of information about the underlying system. If there's a way to internally auto-tune the values this patch operates on (which I haven't found despite months of trying), it would be in the form of some sort of measurement/feedback loop based on how fast data is being written out. There really are way too many things involved to try and tune it based on anything else; the underlying OS/hardware mechanisms that determine how this will go are complicated enough that it might as well be a black box for most people. One of the things I've been fiddling with the design of is a testing program that simulates database activity at checkpoint time under load. I think running some tests like that is the most straightforward way to generate useful values for these tunables; it's much harder to try and determine them from within the backends because there's so much going on to keep track of. I view the LDC mechanism as being in the same state right now as the background writer: there are a lot of complicated knobs to tweak, they all do *something* useful for someone, and eliminating them will require a data-collection process across a much wider sample of data than can be collected quickly. If I had to make a guess how this will end up, I'd expect there to be more knobs in LDC than everyone would like for the 8.3 release, along with fairly verbose logging of what is happening at checkpoint time (that's why I've been nudging development in that area, along with making logs easier to aggregate). Collect up enough of that information, then you're in a position to talk about useful automatic tuning--right around the 8.4 timeframe I suspect. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, 6 Jun 2007, Heikki Linnakangas wrote: > The original patch uses bgwriter_all_max_pages to set the minimum rate. I > think we should have a separate variable, checkpoint_write_min_rate, in KB/s, > instead. Completely agreed. There shouldn't be any coupling with the background writer parameters, which may be set for a completely different set of priorities than the checkpoint has. I have to look at this code again to see why it's a min_rate instead of a max, that seems a little weird. > Nap phase: We should therefore give the delay as a number of seconds > instead of as a percentage of checkpoint interval. Again, the setting here should be completely decoupled from another GUC like the interval. My main complaint with the original form of this patch was how much it tried to syncronize the process with the interval; since I don't even have a system where that value is set to something, because it's all segment based instead, that whole idea was incompatible. The original patch tried to spread the load out as evenly as possible over the time available. I much prefer thinking in terms of getting it done as quickly as possible while trying to bound the I/O storm. > And we don't know how much work an fsync performs. The patch uses the file > size as a measure of that, but as we discussed that doesn't necessarily have > anything to do with reality. fsyncing a 1GB file with one dirty block isn't > any more expensive than fsyncing a file with a single block. On top of that, if you have a system with a write cache, the time an fsync takes can greatly depend on how full it is at the time, which there is no way to measure or even model easily. Is there any way to track how many dirty blocks went into each file during the checkpoint write? That's your best bet for guessing how long the fsync will take. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > On Wed, 6 Jun 2007, Heikki Linnakangas wrote: > >> The original patch uses bgwriter_all_max_pages to set the minimum >> rate. I think we should have a separate variable, >> checkpoint_write_min_rate, in KB/s, instead. > > Completely agreed. There shouldn't be any coupling with the background > writer parameters, which may be set for a completely different set of > priorities than the checkpoint has. I have to look at this code again > to see why it's a min_rate instead of a max, that seems a little weird. It's min rate, because it never writes slower than that, and it can write faster if the next checkpoint is due soon so that we wouldn't finish before it's time to start the next one. (Or to be precise, before the next checkpoint is closer than 100-(checkpoint_write_percent)% of the checkpoint interval) >> Nap phase: We should therefore give the delay as a number of seconds >> instead of as a percentage of checkpoint interval. > > Again, the setting here should be completely decoupled from another GUC > like the interval. My main complaint with the original form of this > patch was how much it tried to syncronize the process with the interval; > since I don't even have a system where that value is set to something, > because it's all segment based instead, that whole idea was incompatible. checkpoint_segments is taken into account as well as checkpoint_timeout. I used the term "checkpoint interval" to mean the real interval at which the checkpoints occur, whether it's because of segments or timeout. > The original patch tried to spread the load out as evenly as possible > over the time available. I much prefer thinking in terms of getting it > done as quickly as possible while trying to bound the I/O storm. Yeah, the checkpoint_min_rate allows you to do that. So there's two extreme ways you can use LDC: 1. Finish the checkpoint as soon as possible, without disturbing other activity too much. Set checkpoint_write_percent to a high number, and set checkpoint_min_rate to define "too much". 2. Disturb other activity as little as possible, as long as the checkpoint finishes in a reasonable time. Set checkpoint_min_rate to a low number, and checkpoint_write_percent to define "reasonable time" Are both interesting use cases, or is it enough to cater for just one of them? I think 2 is easier to tune. Defining the min_rate properly can be difficult and depends a lot on your hardware and application, but a default value of say 50% for checkpoint_write_percent to tune for use case 2 should work pretty well for most people. In any case, the checkpoint better finish before it's time to start another one. Or would you rather delay the next checkpoint, and let checkpoint take as long as it takes to finish at the min_rate? >> And we don't know how much work an fsync performs. The patch uses the >> file size as a measure of that, but as we discussed that doesn't >> necessarily have anything to do with reality. fsyncing a 1GB file with >> one dirty block isn't any more expensive than fsyncing a file with a >> single block. > > On top of that, if you have a system with a write cache, the time an > fsync takes can greatly depend on how full it is at the time, which > there is no way to measure or even model easily. > > Is there any way to track how many dirty blocks went into each file > during the checkpoint write? That's your best bet for guessing how long > the fsync will take. I suppose it's possible, but the OS has hopefully started flushing them to disk almost as soon as we started the writes, so even that isn't very good a measure. On a Linux system, one way to model it is that the OS flushes dirty buffers to disk at the same rate as we write them, but delayed by dirty_expire_centisecs. That should hold if the writes are spread out enough. Then the amount of dirty buffers in OS cache at the end of write phase is roughly constant, as long as the write phase lasts longer than dirty_expire_centisecs. If we take a nap of dirty_expire_centisecs after the write phase, the fsyncs should be effectively no-ops, except that they will flush any other writes the bgwriter lru-sweep and other backends performed during the nap. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Ühel kenal päeval, K, 2007-06-06 kell 11:03, kirjutas Tom Lane: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > GUC summary and suggested default values > > ---------------------------------------- > > checkpoint_write_percent = 50 # % of checkpoint interval to spread out > > writes > > checkpoint_write_min_rate = 1000 # minimum I/O rate to write dirty > > buffers at checkpoint (KB/s) > > checkpoint_nap_duration = 2 # delay between write and sync phase, in > > seconds > > checkpoint_fsync_period = 30 # duration of the sync phase, in seconds > > checkpoint_fsync_delay = 500 # max. delay between fsyncs > > > I don't like adding that many GUC variables, but I don't really see a > > way to tune them automatically. > > If we don't know how to tune them, how will the users know? He talked about doing it _automatically_. If the knobns are available, it will be possible to determine "good" values even by brute-force performance testing, given enough time and manpower is available. > Having to > add that many variables to control one feature says to me that we don't > understand the feature. The feature has lots of complex dependencies to things outside postgres, so learning to understand it takes time. Having the knows available helps as more people ar willing to do turn-the-knobs-and-test vs. recompile-and-test. > Perhaps what we need is to think about how it can auto-tune itself. Sure. ------------------- Hannu Krosing
Thinking about this whole idea a bit more, it occured to me that the current approach to write all, then fsync all is really a historical artifact of the fact that we used to use the system-wide sync call instead of fsyncs to flush the pages to disk. That might not be the best way to do things in the new load-distributed-checkpoint world. How about interleaving the writes with the fsyncs? 1. Scan all shared buffers, and build a list of all files with dirty pages, and buffers belonging to them 2. foreach(file in list) { foreach(buffer belonging to file) { write(); sleep(); /* to throttle the I/O rate */ } sleep(); /* to give theOS a chance to flush the writes at it's own pace */ fsync() } This would spread out the fsyncs in a natural way, making the knob to control the duration of the sync phase unnecessary. At some point we'll also need to fsync all files that have been modified since the last checkpoint, but don't have any dirty buffers in the buffer cache. I think it's a reasonable assumption that fsyncing those files doesn't generate a lot of I/O. Since the writes have been made some time ago, the OS has likely already flushed them to disk. Doing the 1st phase of just scanning the buffers to see which ones are dirty also effectively implements the optimization of not writing buffers that were dirtied after the checkpoint start. And grouping the writes per file gives the OS a better chance to group the physical writes. One problem is that currently the segmentation of relations to 1GB files is handled at a low level inside md.c, and we don't really have any visibility into that in the buffer manager. ISTM that some changes to the smgr interfaces would be needed for this to work well, though just doing it on a relation per relation basis would also be better than the current approach. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Thinking about this whole idea a bit more, it occured to me that the > current approach to write all, then fsync all is really a historical > artifact of the fact that we used to use the system-wide sync call > instead of fsyncs to flush the pages to disk. That might not be the best > way to do things in the new load-distributed-checkpoint world. > How about interleaving the writes with the fsyncs? I don't think it's a historical artifact at all: it's a valid reflection of the fact that we don't know enough about disk layout to do low-level I/O scheduling. Issuing more fsyncs than necessary will do little except guarantee a less-than-optimal scheduling of the writes. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> Thinking about this whole idea a bit more, it occured to me that the >> current approach to write all, then fsync all is really a historical >> artifact of the fact that we used to use the system-wide sync call >> instead of fsyncs to flush the pages to disk. That might not be the best >> way to do things in the new load-distributed-checkpoint world. > >> How about interleaving the writes with the fsyncs? > > I don't think it's a historical artifact at all: it's a valid reflection > of the fact that we don't know enough about disk layout to do low-level > I/O scheduling. Issuing more fsyncs than necessary will do little > except guarantee a less-than-optimal scheduling of the writes. I'm not proposing to issue any more fsyncs. I'm proposing to change the ordering so that instead of first writing all dirty buffers and then fsyncing all files, we'd write all buffers belonging to a file, fsync that file only, then write all buffers belonging to next file, fsync, and so forth. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> I don't think it's a historical artifact at all: it's a valid reflection >> of the fact that we don't know enough about disk layout to do low-level >> I/O scheduling. Issuing more fsyncs than necessary will do little >> except guarantee a less-than-optimal scheduling of the writes. > I'm not proposing to issue any more fsyncs. I'm proposing to change the > ordering so that instead of first writing all dirty buffers and then > fsyncing all files, we'd write all buffers belonging to a file, fsync > that file only, then write all buffers belonging to next file, fsync, > and so forth. But that means that the I/O to different files cannot be overlapped by the kernel, even if it would be more efficient to do so. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> Tom Lane wrote: >>> I don't think it's a historical artifact at all: it's a valid reflection >>> of the fact that we don't know enough about disk layout to do low-level >>> I/O scheduling. Issuing more fsyncs than necessary will do little >>> except guarantee a less-than-optimal scheduling of the writes. > >> I'm not proposing to issue any more fsyncs. I'm proposing to change the >> ordering so that instead of first writing all dirty buffers and then >> fsyncing all files, we'd write all buffers belonging to a file, fsync >> that file only, then write all buffers belonging to next file, fsync, >> and so forth. > > But that means that the I/O to different files cannot be overlapped by > the kernel, even if it would be more efficient to do so. True. On the other hand, if we issue writes in essentially random order, we might fill the kernel buffers with random blocks and the kernel needs to flush them to disk as almost random I/O. If we did the writes in groups, the kernel has better chance at coalescing them. I tend to agree that if the goal is to finish the checkpoint as quickly as possible, the current approach is better. In the context of load distributed checkpoints, however, it's unlikely the kernel can do any significant overlapping since we're trickling the writes anyway. Do we need both strategies? I'm starting to feel we should give up on smoothing the fsyncs and distribute the writes only, for 8.3. As we get more experience with that and it's shortcomings, we can enhance our checkpoints further in 8.4. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 7 Jun 2007, Heikki Linnakangas wrote: > So there's two extreme ways you can use LDC: > 1. Finish the checkpoint as soon as possible, without disturbing other > activity too much > 2. Disturb other activity as little as possible, as long as the > checkpoint finishes in a reasonable time. > Are both interesting use cases, or is it enough to cater for just one of > them? I think 2 is easier to tune. The motivation for the (1) case is that you've got a system that's dirtying the buffer cache very fast in normal use, where even the background writer is hard pressed to keep the buffer pool clean. The checkpoint is the most powerful and efficient way to clean up many dirty buffers out of such a buffer cache in a short period of time so that you're back to having room to work in again. In that situation, since there are many buffers to write out, you'll also be suffering greatly from fsync pauses. Being able to synchronize writes a little better with the underlying OS to smooth those out is a huge help. I'm completely biased because of the workloads I've been dealing with recently, but I consider (2) so much easier to tune for that it's barely worth worrying about. If your system is so underloaded that you can let the checkpoints take their own sweet time, I'd ask if you have enough going on that you're suffering very much from checkpoint performance issues anyway. I'm used to being in a situation where if you don't push out checkpoint data as fast as physically possible, you end up fighting with the client backends for write bandwidth once the LRU point moves past where the checkpoint has written out to already. I'm not sure how much always running the LRU background writer will improve that situation. > On a Linux system, one way to model it is that the OS flushes dirty buffers > to disk at the same rate as we write them, but delayed by > dirty_expire_centisecs. That should hold if the writes are spread out enough. If they're really spread out, sure. There is congestion avoidance code inside the Linux kernel that makes dirty_expire_centisecs not quite work the way it is described under load. All you can say in the general case is that when dirty_expire_centisecs has passed, the kernel badly wants to write the buffers out as quickly as possible; that could still be many seconds after the expiration time on a busy system, or on one with slow I/O. On every system I've ever played with Postgres write performance on, I discovered that the memory-based parameters like dirty_background_ratio were really driving write behavior, and I almost ignore the expire timeout now. Plotting the "Dirty:" value in /proc/meminfo as you're running tests is extremely informative for figuring out what Linux is really doing underneath the database writes. The influence of the congestion code is why I made the comment about watching how long writes are taking to gauge how fast you can dump data onto the disks. When you're suffering from one of the congestion mechanisms, the initial writes start blocking, even before the fsync. That behavior is almost undocumented outside of the relevant kernel source code. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
"Greg Smith" <gsmith@gregsmith.com> writes: > I'm completely biased because of the workloads I've been dealing with recently, > but I consider (2) so much easier to tune for that it's barely worth worrying > about. If your system is so underloaded that you can let the checkpoints take > their own sweet time, I'd ask if you have enough going on that you're suffering > very much from checkpoint performance issues anyway. I'm used to being in a > situation where if you don't push out checkpoint data as fast as physically > possible, you end up fighting with the client backends for write bandwidth once > the LRU point moves past where the checkpoint has written out to already. I'm > not sure how much always running the LRU background writer will improve that > situation. I think you're working from a faulty premise. There's no relationship between the volume of writes and how important the speed of checkpoint is. In either scenario you should assume a system that is close to the max i/o bandwidth. The only question is which task the admin would prefer take the hit for maxing out the bandwidth, the transactions or the checkpoint. You seem to have imagined that letting the checkpoint take longer will slow down transactions. In fact that's precisely the effect we're trying to avoid. Right now we're seeing tests where Postgres stops handling *any* transactions for up to a minute. In virtually any real world scenario that would simply be unacceptable. That one-minute outage is a direct consequence of trying to finish the checkpoint as quick as possible. If we spread it out then it might increase the average i/o load if you sum it up over time, but then you just need a faster i/o controller. The only scenario where you would prefer the absolute lowest i/o rate summed over time would be if you were close to maxing out your i/o bandwidth, couldn't buy a faster controller, and response time was not a factor, only sheer volume of transactions processed mattered. That's a much less common scenario than caring about the response time. The flip side of having to worry about response time buying a faster controller doesn't even help. It would shorten the duration of the checkpoint but not eliminate it. A 30-second outage every half hour is just as unacceptable as a 1-minute outage every half hour. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Thu, 7 Jun 2007, Gregory Stark wrote: > You seem to have imagined that letting the checkpoint take longer will slow > down transactions. And you seem to have imagined that I have so much spare time that I'm just making stuff up to entertain myself and sow confusion. I observed some situations where delaying checkpoints too long ends up slowing down both transaction rate and response time, using earlier variants of the LDC patch and code with similar principles I wrote. I'm trying to keep the approach used here out of the worst of the corner cases I ran into, or least to make it possible for people in those situations to have some ability to tune out of the bad spots. I am unfortunately not free to disclose all those test results, and since that project is over I can't see how the current LDC compares to what I tested at the time. I plainly stated I had a bias here, one that's not even close to the average case. My concern here was that Heikki would end up optimizing in a direction where a really wide spread across the active checkpoint interval was strongly preferred. I wanted to offer some suggestions on the type of situation where that might not be true, but where a different tuning of LDC would still be an improvement over the current behavior. There are some tuning knobs there that I don't want to see go away until there's been a wider range of tests to prove they aren't effective. > Right now we're seeing tests where Postgres stops handling *any* transactions > for up to a minute. In virtually any real world scenario that would simply be > unacceptable. No doubt; I've seen things get close to that bad myself, both on the high and low end. I collided with the issue in a situation of "maxing out your i/o bandwidth, couldn't buy a faster controller" at one point, which is what kicked off my working in this area. It turned out there were still some software tunables left that pulled the worst case down to the 2-5 second range instead. With more checkpoint_segments to decrease the frequency, that was just enough to make the problem annoying rather than crippling. But after that, I could easily imagine a different application scenario where the behavior you describe is the best case. This is really a serious issue with the current design of the database, one that merely changes instead of going away completely if you throw more hardware at it. I'm perversely glad to hear this is torturing more people than just me as it improves the odds the situation will improve. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
> This is really a serious issue with the current design of the database, > one that merely changes instead of going away completely if you throw > more hardware at it. I'm perversely glad to hear this is torturing more > people than just me as it improves the odds the situation will improve. It tortures pretty much any high velocity postgresql db of which there are more and more every day. Joshua D. Drake > > -- > * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
All, This brings up another point. With the increased number of .conf options, the file is getting hard to read again. I'd like to do another reorganization, but I don't really want to break people's diff scripts. Should I worry about that? --Josh
Re: .conf File Organization WAS: Controlling Load Distributed Checkpoints
From
"Joshua D. Drake"
Date:
Josh Berkus wrote: > All, > > This brings up another point. With the increased number of .conf > options, the file is getting hard to read again. I'd like to do another > reorganization, but I don't really want to break people's diff scripts. > Should I worry about that? As a point of feedback, autovacuum and vacuum should be together. Joshua D. Drake > > --Josh > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
Josh Berkus <josh@agliodbs.com> writes: > This brings up another point. With the increased number of .conf > options, the file is getting hard to read again. I'd like to do another > reorganization, but I don't really want to break people's diff scripts. Do you have a better organizing principle than what's there now? regards, tom lane
Greg Smith wrote: > On Thu, 7 Jun 2007, Heikki Linnakangas wrote: > >> So there's two extreme ways you can use LDC: >> 1. Finish the checkpoint as soon as possible, without disturbing other >> activity too much >> 2. Disturb other activity as little as possible, as long as the >> checkpoint finishes in a reasonable time. >> Are both interesting use cases, or is it enough to cater for just one >> of them? I think 2 is easier to tune. > > The motivation for the (1) case is that you've got a system that's > dirtying the buffer cache very fast in normal use, where even the > background writer is hard pressed to keep the buffer pool clean. The > checkpoint is the most powerful and efficient way to clean up many dirty > buffers out of such a buffer cache in a short period of time so that > you're back to having room to work in again. In that situation, since > there are many buffers to write out, you'll also be suffering greatly > from fsync pauses. Being able to synchronize writes a little better > with the underlying OS to smooth those out is a huge help. ISTM the bgwriter just isn't working hard enough in that scenario. Assuming we get the lru autotuning patch in 8.3, do you think there's still merit in using the checkpoints that way? > I'm completely biased because of the workloads I've been dealing with > recently, but I consider (2) so much easier to tune for that it's barely > worth worrying about. If your system is so underloaded that you can let > the checkpoints take their own sweet time, I'd ask if you have enough > going on that you're suffering very much from checkpoint performance > issues anyway. I'm used to being in a situation where if you don't push > out checkpoint data as fast as physically possible, you end up fighting > with the client backends for write bandwidth once the LRU point moves > past where the checkpoint has written out to already. I'm not sure how > much always running the LRU background writer will improve that situation. I'd think it eliminates the problem. Assuming we keep the LRU cleaning running as usual, I don't see how writing faster during checkpoints could ever be beneficial for concurrent activity. The more you write, the less bandwidth there's available for others. Doing the checkpoint as quickly as possible might be slightly better for average throughput, but that's a different matter. > On every system I've ever played with Postgres write performance on, I > discovered that the memory-based parameters like dirty_background_ratio > were really driving write behavior, and I almost ignore the expire > timeout now. Plotting the "Dirty:" value in /proc/meminfo as you're > running tests is extremely informative for figuring out what Linux is > really doing underneath the database writes. Interesting. I haven't touched any of the kernel parameters yet in my tests. It seems we need to try different parameters and see how the dynamics change. But we must also keep in mind that average DBA doesn't change any settings, and might not even be able or allowed to. That means the defaults should work reasonably well without tweaking the OS settings. > The influence of the congestion code is why I made the comment about > watching how long writes are taking to gauge how fast you can dump data > onto the disks. When you're suffering from one of the congestion > mechanisms, the initial writes start blocking, even before the fsync. > That behavior is almost undocumented outside of the relevant kernel > source code. Yeah, that's controlled by dirty_ratio, if I've understood the parameters correctly. If we spread out the writes enough, we shouldn't hit that limit or congestion. That's the point of the patch. Do you have time / resources to do testing? You've clearly spent a lot of time on this, and I'd be very interested to see some actual numbers from your tests with various settings. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Jun 08, 2007 at 09:50:49AM +0100, Heikki Linnakangas wrote: > dynamics change. But we must also keep in mind that average DBA doesn't > change any settings, and might not even be able or allowed to. That > means the defaults should work reasonably well without tweaking the OS > settings. Do you mean "change the OS settings" or something else? (I'm not sure it's true in any case, because shared memory kernel settings have to be fiddled with in many instances, but I thought I'd ask for clarification.) A -- Andrew Sullivan | ajs@crankycanuck.ca Users never remark, "Wow, this software may be buggy and hard to use, but at least there is a lot of code underneath." --Damien Katz
Andrew Sullivan wrote: > On Fri, Jun 08, 2007 at 09:50:49AM +0100, Heikki Linnakangas wrote: > >> dynamics change. But we must also keep in mind that average DBA doesn't >> change any settings, and might not even be able or allowed to. That >> means the defaults should work reasonably well without tweaking the OS >> settings. > > Do you mean "change the OS settings" or something else? (I'm not > sure it's true in any case, because shared memory kernel settings > have to be fiddled with in many instances, but I thought I'd ask for > clarification.) Yes, that's what I meant. An average DBA is not likely to change OS settings. You're right on the shmmax setting, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 8 Jun 2007, Andrew Sullivan wrote: > Do you mean "change the OS settings" or something else? (I'm not > sure it's true in any case, because shared memory kernel settings > have to be fiddled with in many instances, but I thought I'd ask for > clarification.) In a situation where a hosting provider of some sort is providing PostgreSQL, they should know that parameters like SHMMAX need to be increased before customers can create a larger installation. You'd expect they'd take care of that as part of routine server setup. What wouldn't be reasonable is to expect them to tune obscure parts of the kernel just for your application. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, Jun 08, 2007 at 10:33:50AM -0400, Greg Smith wrote: > they'd take care of that as part of routine server setup. What wouldn't > be reasonable is to expect them to tune obscure parts of the kernel just > for your application. Well, I suppose it'd depend on what kind of hosting environment you're in (if I'm paying for dedicated hosting, you better believe I'm going to insist they tune the kernel the way I want), but you're right that in shared hosting for $25/mo, it's not going to happen. A -- Andrew Sullivan | ajs@crankycanuck.ca "The year's penultimate month" is not in truth a good way of saying November. --H.W. Fowler
Andrew Sullivan wrote: > On Fri, Jun 08, 2007 at 10:33:50AM -0400, Greg Smith wrote: > > they'd take care of that as part of routine server setup. What wouldn't > > be reasonable is to expect them to tune obscure parts of the kernel just > > for your application. > > Well, I suppose it'd depend on what kind of hosting environment > you're in (if I'm paying for dedicated hosting, you better believe > I'm going to insist they tune the kernel the way I want), but you're > right that in shared hosting for $25/mo, it's not going to happen. And consider other operating systems that don't have the same knobs. We should tune as best we can first without kernel knobs. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > Thinking about this whole idea a bit more, it occured to me that the > > current approach to write all, then fsync all is really a historical > > artifact of the fact that we used to use the system-wide sync call > > instead of fsyncs to flush the pages to disk. That might not be the best > > way to do things in the new load-distributed-checkpoint world. > > > How about interleaving the writes with the fsyncs? > > I don't think it's a historical artifact at all: it's a valid reflection > of the fact that we don't know enough about disk layout to do low-level > I/O scheduling. Issuing more fsyncs than necessary will do little > except guarantee a less-than-optimal scheduling of the writes. If we extended relations by more than 8k at a time, we would know a lot more about disk layout, at least on filesystems with a decent amount of free space. -- Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim C. Nasby wrote: > On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote: >> Heikki Linnakangas <heikki@enterprisedb.com> writes: >>> Thinking about this whole idea a bit more, it occured to me that the >>> current approach to write all, then fsync all is really a historical >>> artifact of the fact that we used to use the system-wide sync call >>> instead of fsyncs to flush the pages to disk. That might not be the best >>> way to do things in the new load-distributed-checkpoint world. >>> How about interleaving the writes with the fsyncs? >> I don't think it's a historical artifact at all: it's a valid reflection >> of the fact that we don't know enough about disk layout to do low-level >> I/O scheduling. Issuing more fsyncs than necessary will do little >> except guarantee a less-than-optimal scheduling of the writes. > > If we extended relations by more than 8k at a time, we would know a lot > more about disk layout, at least on filesystems with a decent amount of > free space. I doubt it makes that much difference. If there was a significant amount of fragmentation, we'd hear more complaints about seq scan performance. The issue here is that we don't know which relations are on which drives and controllers, how they're striped, mirrored etc. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> wrote: > True. On the other hand, if we issue writes in essentially random order, > we might fill the kernel buffers with random blocks and the kernel needs > to flush them to disk as almost random I/O. If we did the writes in > groups, the kernel has better chance at coalescing them. If the kernel can treat sequential writes better than random writes, is it worth sorting dirty buffers in block order per file at the start of checkpoints? Here is the pseudo code: buffers_to_be_written = SELECT buf_id, tag FROM BufferDescriptors WHERE (flags & BM_DIRTY) != 0 ORDER BY tag.rnode,tag.blockNum; for { buf_id, tag } in buffers_to_be_written: if BufferDescriptors[buf_id].tag == tag: FlushBuffer(&BufferDescriptors[buf_id]) We can also avoid writing buffers newly dirtied after the checkpoint was started with this method. > I tend to agree that if the goal is to finish the checkpoint as quickly > as possible, the current approach is better. In the context of load > distributed checkpoints, however, it's unlikely the kernel can do any > significant overlapping since we're trickling the writes anyway. Some kernels or storage subsystems treat all I/Os too fairly so that user transactions waiting for reads are blocked by checkpoints writes. It is unavoidable behavior though, but we can split writes in small batches. > I'm starting to feel we should give up on smoothing the fsyncs and > distribute the writes only, for 8.3. As we get more experience with that > and it's shortcomings, we can enhance our checkpoints further in 8.4. I agree with the only writes distribution for 8.3. The new parameters introduced by it (checkpoint_write_percent and checkpoint_write_min_rate) will continue to be alive without major changes in the future, but other parameters seem to be volatile. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote: > If the kernel can treat sequential writes better than random writes, is > it worth sorting dirty buffers in block order per file at the start of > checkpoints? I think it has the potential to improve things. There are three obvious and one subtle argument against it I can think of: 1) Extra complexity for something that may not help. This would need some good, robust benchmarking improvements to justify its use. 2) Block number ordering may not reflect actual order on disk. While true, it's got to be better correlated with it than writing at random. 3) The OS disk elevator should be dealing with this issue, particularly because it may really know the actual disk ordering. Here's the subtle thing: by writing in the same order the LRU scan occurs in, you are writing dirty buffers in the optimal fashion to eliminate client backend writes during BuferAlloc. This makes the checkpoint a really effective LRU clearing mechanism. Writing in block order will change that. I spent some time trying to optimize the elevator part of this operation, since I knew that on the system I was using block order was actual order. I found that under Linux, the behavior of the pdflush daemon that manages dirty memory had a more serious impact on writing behavior at checkpoint time than playing with the elevator scheduling method did. The way pdflush works actually has several interesting implications for how to optimize this patch. For example, how writes get blocked when the dirty memory reaches certain thresholds means that you may not get the full benefit of the disk elevator at checkpoint time the way most would expect. Since much of that was basically undocumented, I had to write my own analysis of the actual workings, which is now available at http://www.westnet.com/~gsmith/content/linux-pdflush.htm I hope that anyone who wants more information about how Linux kernel parameters like dirty_background_ratio actually work, and how they impact the writing strategy, should find that article uniquely helpful. > Some kernels or storage subsystems treat all I/Os too fairly so that > user transactions waiting for reads are blocked by checkpoints writes. In addition to that (which I've seen happen quite a bit), in the Linux case another fairness issue is that the code that handles writes allows a single process writing a lot of data to block writes for everyone else. That means that in addition to being blocked on actual reads, if a client backend starts a write in order to complete a buffer allocation to hold new information, that can grind to a halt because of the checkpoint process as well. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
ITAGAKI Takahiro wrote: > Heikki Linnakangas <heikki@enterprisedb.com> wrote: > >> True. On the other hand, if we issue writes in essentially random order, >> we might fill the kernel buffers with random blocks and the kernel needs >> to flush them to disk as almost random I/O. If we did the writes in >> groups, the kernel has better chance at coalescing them. > > If the kernel can treat sequential writes better than random writes, > is it worth sorting dirty buffers in block order per file at the start > of checkpoints? Here is the pseudo code: > > buffers_to_be_written = > SELECT buf_id, tag FROM BufferDescriptors > WHERE (flags & BM_DIRTY) != 0 ORDER BY tag.rnode, tag.blockNum; > for { buf_id, tag } in buffers_to_be_written: > if BufferDescriptors[buf_id].tag == tag: > FlushBuffer(&BufferDescriptors[buf_id]) > > We can also avoid writing buffers newly dirtied after the checkpoint was > started with this method. That's worth testing, IMO. Probably won't happen for 8.3, though. >> I tend to agree that if the goal is to finish the checkpoint as quickly >> as possible, the current approach is better. In the context of load >> distributed checkpoints, however, it's unlikely the kernel can do any >> significant overlapping since we're trickling the writes anyway. > > Some kernels or storage subsystems treat all I/Os too fairly so that user > transactions waiting for reads are blocked by checkpoints writes. It is > unavoidable behavior though, but we can split writes in small batches. That's really the heart of our problems. If the kernel had support for prioritizing the normal backend activity and LRU cleaning over the checkpoint I/O, we wouldn't need to throttle the I/O ourselves. The kernel has the best knowledge of what it can and can't do, and how busy the I/O subsystems are. Recent Linux kernels have some support for read I/O priorities, but not for writes. I believe the best long term solution is to add that support to the kernel, but it's going to take a long time until that's universally available, and we have a lot of platforms to support. >> I'm starting to feel we should give up on smoothing the fsyncs and >> distribute the writes only, for 8.3. As we get more experience with that >> and it's shortcomings, we can enhance our checkpoints further in 8.4. > > I agree with the only writes distribution for 8.3. The new parameters > introduced by it (checkpoint_write_percent and checkpoint_write_min_rate) > will continue to be alive without major changes in the future, but other > parameters seem to be volatile. I'm going to start testing with just distributing the writes. Let's see how far that gets us. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom, > Do you have a better organizing principle than what's there now? It's mostly detail stuff: putting VACUUM and Autovac together, breaking up some subsections that now have too many options in them into grouped. Client Connection Defaults has somehow become a catchall secton for *any* USERSET variable, regardless of purpose. I'd like to trim it back down and assign some of those variables to appropriate sections. On the more hypothetical basis I was thinking of adding a section at the top with the 7-9 most common options that people *need* to set; this would make PostgreSQL.conf much more accessable but would result in duplicate options which might cause some issues. -- Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus <josh@agliodbs.com> writes: > On the more hypothetical basis I was thinking of adding a section at the top > with the 7-9 most common options that people *need* to set; this would make > PostgreSQL.conf much more accessable but would result in duplicate options > which might cause some issues. Doesn't sound like a good idea, but maybe there's a case for a comment there saying "these are the most important ones to look at"? regards, tom lane
Tom, > Doesn't sound like a good idea, but maybe there's a case for a comment > there saying "these are the most important ones to look at"? Yeah, probably need to do that. Seems user-unfriendly, but loading a foot gun by having some options appear twice in the file seems much worse. I'll also add some notes on how to set these values. -- Josh Berkus PostgreSQL @ Sun San Francisco
On Sun, Jun 10, 2007 at 08:49:24PM +0100, Heikki Linnakangas wrote: > Jim C. Nasby wrote: > >On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote: > >>Heikki Linnakangas <heikki@enterprisedb.com> writes: > >>>Thinking about this whole idea a bit more, it occured to me that the > >>>current approach to write all, then fsync all is really a historical > >>>artifact of the fact that we used to use the system-wide sync call > >>>instead of fsyncs to flush the pages to disk. That might not be the best > >>>way to do things in the new load-distributed-checkpoint world. > >>>How about interleaving the writes with the fsyncs? > >>I don't think it's a historical artifact at all: it's a valid reflection > >>of the fact that we don't know enough about disk layout to do low-level > >>I/O scheduling. Issuing more fsyncs than necessary will do little > >>except guarantee a less-than-optimal scheduling of the writes. > > > >If we extended relations by more than 8k at a time, we would know a lot > >more about disk layout, at least on filesystems with a decent amount of > >free space. > > I doubt it makes that much difference. If there was a significant amount > of fragmentation, we'd hear more complaints about seq scan performance. > > The issue here is that we don't know which relations are on which drives > and controllers, how they're striped, mirrored etc. Actually, isn't pre-allocation one of the tricks that Greenplum uses to get it's seqscan performance? -- Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Heikki Linnakangas wrote: > Jim C. Nasby wrote: >> On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote: >>> Heikki Linnakangas <heikki@enterprisedb.com> writes: >>>> Thinking about this whole idea a bit more, it occured to me that the >>>> current approach to write all, then fsync all is really a historical >>>> artifact of the fact that we used to use the system-wide sync call >>>> instead of fsyncs to flush the pages to disk. That might not be the >>>> best way to do things in the new load-distributed-checkpoint world. >>>> How about interleaving the writes with the fsyncs? >>> I don't think it's a historical artifact at all: it's a valid reflection >>> of the fact that we don't know enough about disk layout to do low-level >>> I/O scheduling. Issuing more fsyncs than necessary will do little >>> except guarantee a less-than-optimal scheduling of the writes. >> >> If we extended relations by more than 8k at a time, we would know a lot >> more about disk layout, at least on filesystems with a decent amount of >> free space. > > I doubt it makes that much difference. If there was a significant amount > of fragmentation, we'd hear more complaints about seq scan performance. OTOH, extending a relation that uses N pages by something like min(ceil(N/1024), 1024)) pages might help some filesystems to avoid fragmentation, and hardly introduce any waste (about 0.1% in the worst case). So if it's not too hard to do it might be worthwhile, even if it turns out that most filesystems deal well with the current allocation pattern. greetings, Florian Pflug
>> >If we extended relations by more than 8k at a time, we would know a lot >> >more about disk layout, at least on filesystems with a decent amount of >> >free space. >> >> I doubt it makes that much difference. If there was a significant amount >> of fragmentation, we'd hear more complaints about seq scan performance. >> >> The issue here is that we don't know which relations are on which drives >> and controllers, how they're striped, mirrored etc. > > Actually, isn't pre-allocation one of the tricks that Greenplum uses to > get it's seqscan performance? My tests here show that, at least on reiserfs, after a few hours of benchmark torture (this represents several million write queries), table files become significantly fragmented. I believe the table and index files get extended more or less simultaneously and end up somehow a bit mixed up on disk. Seq scan perf suffers. reiserfs doesn't have an excellent fragmentation behaviour... NTFS is worse than hell in this respect. So, pre-alloc could be a good idea. Brutal Defrag (cp /var/lib/postgresql to somewhere and back) gets seq scan perf back to disk throughput. Also, by the way, InnoDB uses a BTree organized table. The advantage is that data is always clustered on the primary key (which means you have to use something as your primary key that isn't necessary "natural", you have to choose it to get good clustering, and you can't always do it right, so it somehow, in the end, sucks rather badly). Anyway, seq-scan on InnoDB is very slow because, as the btree grows (just like postgres indexes) pages are split and scanning the pages in btree order becomes a mess of seeks. So, seq scan in InnoDB is very very slow unless periodic OPTIMIZE TABLE is applied. (caveat to the postgres TODO item "implement automatic table clustering"...)
Greg Smith <gsmith@gregsmith.com> wrote: > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote: > > If the kernel can treat sequential writes better than random writes, is > > it worth sorting dirty buffers in block order per file at the start of > > checkpoints? I wrote and tested the attached sorted-writes patch base on Heikki's ldc-justwrites-1.patch. There was obvious performance win on OLTP workload. tests | pgbench | DBT-2 response time (avg/90%/max) ---------------------------+---------+----------------------------------- LDC only | 181 tps | 1.12 / 4.38 / 12.13 s + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s (*) Don't write buffers that were dirtied after starting the checkpoint. machine : 2GB-ram, SCSI*4 RAID-5 pgbench : -s400 -t40000 -c10 (about 5GB of database) DBT-2 : 60WH (about 6GB of database) > I think it has the potential to improve things. There are three obvious > and one subtle argument against it I can think of: > > 1) Extra complexity for something that may not help. This would need some > good, robust benchmarking improvements to justify its use. Exactly. I think we need a discussion board for I/O performance issues. Can I use Developers Wiki for this purpose? Since performance graphs and result tables are important for the discussion, so it might be better than mailing lists, that are text-based. > 2) Block number ordering may not reflect actual order on disk. While > true, it's got to be better correlated with it than writing at random. > 3) The OS disk elevator should be dealing with this issue, particularly > because it may really know the actual disk ordering. Yes, both are true. However, I think there is pretty high correlation in those orderings. In addition, we should use filesystem to assure those orderings correspond to each other. For example, pre-allocation of files might help us, as has often been discussed. > Here's the subtle thing: by writing in the same order the LRU scan occurs > in, you are writing dirty buffers in the optimal fashion to eliminate > client backend writes during BuferAlloc. This makes the checkpoint a > really effective LRU clearing mechanism. Writing in block order will > change that. The issue will probably go away after we have LDC, because it writes LRU buffers during checkpoints. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
"PFC" <lists@peufeu.com> writes: > Anyway, seq-scan on InnoDB is very slow because, as the btree grows (just > like postgres indexes) pages are split and scanning the pages in btree order > becomes a mess of seeks. So, seq scan in InnoDB is very very slow unless > periodic OPTIMIZE TABLE is applied. (caveat to the postgres TODO item > "implement automatic table clustering"...) Heikki already posted a patch which goes a long way towards implementing what I think this patch refers to: trying to maintaining the cluster ordering on updates and inserts. It does it without changing the basic table structure at all. On updates and inserts it consults the indexam of the clustered index to ask if for a suggested block. If the index's suggested block has enough free space then the tuple is put there. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
"ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp> writes: > Exactly. I think we need a discussion board for I/O performance issues. > Can I use Developers Wiki for this purpose? Since performance graphs and > result tables are important for the discussion, so it might be better > than mailing lists, that are text-based. I would suggest keeping the discussion on mail and including links to refer to charts and tables in the wiki. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
ITAGAKI Takahiro wrote: > Greg Smith <gsmith@gregsmith.com> wrote: >> On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote: >>> If the kernel can treat sequential writes better than random writes, is >>> it worth sorting dirty buffers in block order per file at the start of >>> checkpoints? > > I wrote and tested the attached sorted-writes patch base on Heikki's > ldc-justwrites-1.patch. There was obvious performance win on OLTP workload. > > tests | pgbench | DBT-2 response time (avg/90%/max) > ---------------------------+---------+----------------------------------- > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s > > (*) Don't write buffers that were dirtied after starting the checkpoint. > > machine : 2GB-ram, SCSI*4 RAID-5 > pgbench : -s400 -t40000 -c10 (about 5GB of database) > DBT-2 : 60WH (about 6GB of database) Wow, I didn't expect that much gain from the sorted writes. How was LDC configured? >> 3) The OS disk elevator should be dealing with this issue, particularly >> because it may really know the actual disk ordering. Yeah, but we don't give the OS that much chance to coalesce writes when we spread them out. >> Here's the subtle thing: by writing in the same order the LRU scan occurs >> in, you are writing dirty buffers in the optimal fashion to eliminate >> client backend writes during BuferAlloc. This makes the checkpoint a >> really effective LRU clearing mechanism. Writing in block order will >> change that. > > The issue will probably go away after we have LDC, because it writes LRU > buffers during checkpoints. I think so too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 14 Jun 2007, ITAGAKI Takahiro wrote: > I think we need a discussion board for I/O performance issues. Can I use > Developers Wiki for this purpose? Since performance graphs and result > tables are important for the discussion, so it might be better than > mailing lists, that are text-based. I started pushing some of my stuff over to there recently to make it easier to edit and other people can expand with their expertise. http://developer.postgresql.org/index.php/Buffer_Cache%2C_Checkpoints%2C_and_the_BGW is what I've done so far on this particular topic. What I would like to see on the Wiki first are pages devoted to how to run the common benchmarks people use for useful performance testing. A recent thread on one of the lists reminded me how easy it is to get worthless results out of DBT2 if you don't have any guidance on that. I've already got a stack of documentation about how to wrestle with pgbench and am generating more. The problem with using the Wiki as the main focus is that when you get to the point that you want to upload detailed test results, that interface really isn't appropriate for it. For example, in the last day I've collected up data from about 400 short tests runs that generated 800 graphs. It's all organized as HTML so you can drill down into the specific tests that executed oddly. Heikki's DBT2 resuls are similar; not as many files, because he's running longer tests, but the navigation is even more complicated. There is no way to easily put that type and level of information into the Wiki page. You really just need a web server to copy the results onto. Then the main problem you have to be concerned about is a repeat of the OSDL situation, where all the results just dissapear if their hosting sponsor goes away. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, 2007-06-14 at 16:39 +0900, ITAGAKI Takahiro wrote: > Greg Smith <gsmith@gregsmith.com> wrote: > > > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote: > > > If the kernel can treat sequential writes better than random writes, is > > > it worth sorting dirty buffers in block order per file at the start of > > > checkpoints? > > I wrote and tested the attached sorted-writes patch base on Heikki's > ldc-justwrites-1.patch. There was obvious performance win on OLTP workload. > > tests | pgbench | DBT-2 response time (avg/90%/max) > ---------------------------+---------+----------------------------------- > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s > > (*) Don't write buffers that were dirtied after starting the checkpoint. > > machine : 2GB-ram, SCSI*4 RAID-5 > pgbench : -s400 -t40000 -c10 (about 5GB of database) > DBT-2 : 60WH (about 6GB of database) I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage of writes has been saved by doing that? We would expect a small percentage of blocks only and so that shouldn't make a significant difference. I thought we discussed this before, about a year ago. It would be easy to get that wrong and to avoid writing a block that had been re-dirtied after the start of checkpoint, but was already dirty beforehand. How long was the write phase of the checkpoint, how long between checkpoints? I can see the sorted writes having an effect because the OS may not receive blocks within a sufficient time window to fully optimise them. That effect would grow with increasing sizes of shared_buffers and decrease with size of controller cache. How big was the shared buffers setting? What OS scheduler are you using? The effect would be greatest when using Deadline. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On 6/14/07, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, 2007-06-14 at 16:39 +0900, ITAGAKI Takahiro wrote: > > Greg Smith <gsmith@gregsmith.com> wrote: > > > > > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote: > > > > If the kernel can treat sequential writes better than random writes, is > > > > it worth sorting dirty buffers in block order per file at the start of > > > > checkpoints? > > > > I wrote and tested the attached sorted-writes patch base on Heikki's > > ldc-justwrites-1.patch. There was obvious performance win on OLTP workload. > > > > tests | pgbench | DBT-2 response time (avg/90%/max) > > ---------------------------+---------+----------------------------------- > > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s > > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s > > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s > > > > (*) Don't write buffers that were dirtied after starting the checkpoint. > > > > machine : 2GB-ram, SCSI*4 RAID-5 > > pgbench : -s400 -t40000 -c10 (about 5GB of database) > > DBT-2 : 60WH (about 6GB of database) > > I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage > of writes has been saved by doing that? We would expect a small > percentage of blocks only and so that shouldn't make a significant > difference. I thought we discussed this before, about a year ago. It > would be easy to get that wrong and to avoid writing a block that had > been re-dirtied after the start of checkpoint, but was already dirty > beforehand. How long was the write phase of the checkpoint, how long > between checkpoints? > > I can see the sorted writes having an effect because the OS may not > receive blocks within a sufficient time window to fully optimise them. > That effect would grow with increasing sizes of shared_buffers and > decrease with size of controller cache. How big was the shared buffers > setting? What OS scheduler are you using? The effect would be greatest > when using Deadline. Linux has some instrumentation that might be useful for this testing, echo 1 > /proc/sys/vm/block_dump Will have the kernel log all physical IO (disable syslog writing to disk before turning it on if you don't want the system to blow up). Certainly the OS elevator should be working well enough to not see that much of an improvement. Perhaps frequent fsync behavior is having unintended interaction with the elevator? ... It might be worthwhile to contact some Linux kernel developers and see if there is some misunderstanding.
On Thu, 14 Jun 2007, Gregory Maxwell wrote: > Linux has some instrumentation that might be useful for this testing, > echo 1 > /proc/sys/vm/block_dump That bit was developed for tracking down who was spinning the hard drive up out of power saving mode, and I was under the impression that very rough feature isn't useful at all here. I just tried to track down again where I got that impression from, and I think it was this thread: http://linux.slashdot.org/comments.pl?sid=231817&cid=18832379 This mentions general issues figuring out who was responsible for a write and specifically mentions how you'll have to reconcile two different paths if fsync is mixed in. Not saying it won't work, it's just obvious using the block_dump output isn't a simple job. (For anyone who would like an intro to this feature, try http://www.linuxjournal.com/node/7539/print and http://toadstool.se/journal/2006/05/27/monitoring-filesystem-activity-under-linux-with-block_dump ) -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
> > tests | pgbench | DBT-2 response time > (avg/90%/max) > > > ---------------------------+---------+-------------------------------- > > ---------------------------+---------+--- > > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s > > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s > > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s > > > > (*) Don't write buffers that were dirtied after starting > the checkpoint. > > > > machine : 2GB-ram, SCSI*4 RAID-5 > > pgbench : -s400 -t40000 -c10 (about 5GB of database) > > DBT-2 : 60WH (about 6GB of database) > > I'm very surprised by the BM_CHECKPOINT_NEEDED results. What > percentage of writes has been saved by doing that? We would > expect a small percentage of blocks only and so that > shouldn't make a significant difference. I thought we Wouldn't pages that are dirtied during the checkpoint also usually be rather hot ? Thus if we lock one of those for writing, the chances are high that a client needs to wait for the lock ? A write os call should usually be very fast, but when the IO gets bottlenecked it might easily become slower. Probably the recent result, that it saves ~53% of the writes, is sufficient explanation though. Very nice results :-) Looks like we want all of it including the sort. Andreas
"Simon Riggs" <simon@2ndquadrant.com> wrote: > > tests | pgbench | DBT-2 response time (avg/90%/max) > > ---------------------------+---------+----------------------------------- > > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s > > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s > > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s > > I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage > of writes has been saved by doing that? > How long was the write phase of the checkpoint, how long > between checkpoints? > > I can see the sorted writes having an effect because the OS may not > receive blocks within a sufficient time window to fully optimise them. > That effect would grow with increasing sizes of shared_buffers and > decrease with size of controller cache. How big was the shared buffers > setting? What OS scheduler are you using? The effect would be greatest > when using Deadline. I didn't tune OS parameters, used default values. In terms of cache amounts, postgres buffers were larger than kernel write pool and controller cache. that's why the OS could not optimise writes enough in checkpoint, I think. - 200MB <- RAM * dirty_background_ratio - 128MB <- Controller cache - 2GB <- postgres shared_buffers I forget to gather detail I/O information in the tests. I'll retry it and report later. RAM 2GB Controller cache 128MB shared_buffers 1GB checkpoint_timeout = 15min checkpoint_write_percent = 50.0 RHEL4 (Linux 2.6.9-42.0.2.EL) vm.dirty_background_ratio = 10 vm.dirty_ratio = 40 vm.dirty_expire_centisecs = 3000 vm.dirty_writeback_centisecs = 500 Using cfq io scheduler Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Fri, 2007-06-15 at 18:33 +0900, ITAGAKI Takahiro wrote: > "Simon Riggs" <simon@2ndquadrant.com> wrote: > > > > tests | pgbench | DBT-2 response time (avg/90%/max) > > > ---------------------------+---------+----------------------------------- > > > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s > > > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s > > > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s > > > > I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage > > of writes has been saved by doing that? > > How long was the write phase of the checkpoint, how long > > between checkpoints? > > > > I can see the sorted writes having an effect because the OS may not > > receive blocks within a sufficient time window to fully optimise them. > > That effect would grow with increasing sizes of shared_buffers and > > decrease with size of controller cache. How big was the shared buffers > > setting? What OS scheduler are you using? The effect would be greatest > > when using Deadline. > > I didn't tune OS parameters, used default values. > In terms of cache amounts, postgres buffers were larger than kernel > write pool and controller cache. that's why the OS could not optimise > writes enough in checkpoint, I think. > > - 200MB <- RAM * dirty_background_ratio > - 128MB <- Controller cache > - 2GB <- postgres shared_buffers > > I forget to gather detail I/O information in the tests. > I'll retry it and report later. > > RAM 2GB > Controller cache 128MB > shared_buffers 1GB > checkpoint_timeout = 15min > checkpoint_write_percent = 50.0 > > RHEL4 (Linux 2.6.9-42.0.2.EL) > vm.dirty_background_ratio = 10 > vm.dirty_ratio = 40 > vm.dirty_expire_centisecs = 3000 > vm.dirty_writeback_centisecs = 500 > Using cfq io scheduler Sounds like sorting the buffers before checkpoint is going to be a win once we go above about ~128MB. We can do a simple test on NBuffers, rather than have a sort_blocks_at_checkoint (!) GUC. But it does seem there is a win for larger settings of shared_buffers. Does performance go up in the non-sorted case if we make shared_buffers smaller? Sounds like it might. We should check that first. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Added to TODO: * Consider sorting writes during checkpoint http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php --------------------------------------------------------------------------- ITAGAKI Takahiro wrote: > Greg Smith <gsmith@gregsmith.com> wrote: > > > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote: > > > If the kernel can treat sequential writes better than random writes, is > > > it worth sorting dirty buffers in block order per file at the start of > > > checkpoints? > > I wrote and tested the attached sorted-writes patch base on Heikki's > ldc-justwrites-1.patch. There was obvious performance win on OLTP workload. > > tests | pgbench | DBT-2 response time (avg/90%/max) > ---------------------------+---------+----------------------------------- > LDC only | 181 tps | 1.12 / 4.38 / 12.13 s > + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 / 9.26 s > + Sorted writes | 224 tps | 0.36 / 0.80 / 8.11 s > > (*) Don't write buffers that were dirtied after starting the checkpoint. > > machine : 2GB-ram, SCSI*4 RAID-5 > pgbench : -s400 -t40000 -c10 (about 5GB of database) > DBT-2 : 60WH (about 6GB of database) > > > > I think it has the potential to improve things. There are three obvious > > and one subtle argument against it I can think of: > > > > 1) Extra complexity for something that may not help. This would need some > > good, robust benchmarking improvements to justify its use. > > Exactly. I think we need a discussion board for I/O performance issues. > Can I use Developers Wiki for this purpose? Since performance graphs and > result tables are important for the discussion, so it might be better > than mailing lists, that are text-based. > > > > 2) Block number ordering may not reflect actual order on disk. While > > true, it's got to be better correlated with it than writing at random. > > 3) The OS disk elevator should be dealing with this issue, particularly > > because it may really know the actual disk ordering. > > Yes, both are true. However, I think there is pretty high correlation > in those orderings. In addition, we should use filesystem to assure > those orderings correspond to each other. For example, pre-allocation > of files might help us, as has often been discussed. > > > > Here's the subtle thing: by writing in the same order the LRU scan occurs > > in, you are writing dirty buffers in the optimal fashion to eliminate > > client backend writes during BuferAlloc. This makes the checkpoint a > > really effective LRU clearing mechanism. Writing in block order will > > change that. > > The issue will probably go away after we have LDC, because it writes LRU > buffers during checkpoints. > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +