Thread: sorted writes for checkpoints
One of the items on the Wiki ToDo list is sorted writes for checkpoints. The consensus seemed to be that this should be done by adding hook(s) into the main code, and then a contrib module to work with those hooks. Is there an existing contrib module that one could best look to for inspiration on how to go about doing this? I have the sorted checkpoint working under a guc, but don't know where to start on converting it to a contrib module instead. Cheers, Jeff
Excerpts from Jeff Janes's message of vie oct 29 00:00:24 -0300 2010: > One of the items on the Wiki ToDo list is sorted writes for > checkpoints. The consensus seemed to be that this should be done by > adding hook(s) into the main code, and then a contrib module to work > with those hooks. Is there an existing contrib module that one could > best look to for inspiration on how to go about doing this? I have > the sorted checkpoint working under a guc, but don't know where to > start on converting it to a contrib module instead. Hmm, see contrib/auto_explain? -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 29.10.2010 06:00, Jeff Janes wrote: > One of the items on the Wiki ToDo list is sorted writes for > checkpoints. The consensus seemed to be that this should be done by > adding hook(s) into the main code, and then a contrib module to work > with those hooks. Is there an existing contrib module that one could > best look to for inspiration on how to go about doing this? I have > the sorted checkpoint working under a guc, but don't know where to > start on converting it to a contrib module instead. I don't think it's a good idea to have this as a hook. Bgwriter shouldn't need to load external code, and checkpoint robustness should dependend on user-written code. IIRC Tom Lane didn't even like pallocing the memory for the list of dirty pages at checkpoint time because that might cause an out-of-memory error. Calling a function in a contrib module is much much worse. Simon's argument in the thread that the todo item points to (http://archives.postgresql.org/pgsql-patches/2008-07/msg00123.php) is basically that we don't know what the best algorithm is yet and benchmarking is a lot of work, so let's just let people do whatever they feel like until we settle on the best approach. I think we need to bite the bullet and do some benchmarking, and commit one carefully vetted patch to the backend. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Oct 29, 2010 at 3:23 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Simon's argument in the thread that the todo item points to > (http://archives.postgresql.org/pgsql-patches/2008-07/msg00123.php) is > basically that we don't know what the best algorithm is yet and benchmarking > is a lot of work, so let's just let people do whatever they feel like until > we settle on the best approach. I think we need to bite the bullet and do > some benchmarking, and commit one carefully vetted patch to the backend. When I submitted the patch, I tested it on disk-based RAID-5 machine: http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php But there were no additional benchmarking reports at that time. We still need benchmarking before we re-examine the feature. For example, SSD and SSD-RAID was not popular at that time, but now they might be considerable. I think direct patching to the core is enough at the first testing, and we will decide the interface according to the result. If one algorithm win in all cases, we could just include it in the core, and then extensibility would not need. -- Itagaki Takahiro
On Fri, Oct 29, 2010 at 2:58 AM, Itagaki Takahiro <itagaki.takahiro@gmail.com> wrote: > On Fri, Oct 29, 2010 at 3:23 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Simon's argument in the thread that the todo item points to >> (http://archives.postgresql.org/pgsql-patches/2008-07/msg00123.php) is >> basically that we don't know what the best algorithm is yet and benchmarking >> is a lot of work, so let's just let people do whatever they feel like until >> we settle on the best approach. I think we need to bite the bullet and do >> some benchmarking, and commit one carefully vetted patch to the backend. > > When I submitted the patch, I tested it on disk-based RAID-5 machine: > http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php > But there were no additional benchmarking reports at that time. We still > need benchmarking before we re-examine the feature. For example, SSD and > SSD-RAID was not popular at that time, but now they might be considerable. There are really two separate things here: (1) trying to do all the writes to file A before you start doing writes to file B, and (2) trying to write out blocks to each file in ascending logical block number order I'm much more convinced of the value of #1 than I am of the value of #2. If we do #1, we can then spread out the checkpoint fsyncs in a meaningful way without fearing that we'll need to fsync the same file a second time for the same checkpoint. We've gotten some pretty specific reports of problems in this area recently, so it seems likely that there is some value to be had there. On the other hand, #2 is only a win if sorting the blocks in numerical order causes the OS to write them in a better order than it would otherwise have done. We've had recent reports that our block-at-a-time relation extension policy is leading to severe fragmentation on certain filesystems, so I'm a bit skeptical about the value of this (though, of course, that can be overturned if we can collect meaningful evidence). > I think direct patching to the core is enough at the first > testing, and we will decide the interface according to the > result. If one algorithm win in all cases, we could just > include it in the core, and then extensibility would not need. I agree with this, and with Heikki's remarks also. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Simon's argument in the thread that the todo item points to > (http://archives.postgresql.org/pgsql-patches/2008-07/msg00123.php) is > basically that we don't know what the best algorithm is yet and > benchmarking is a lot of work, so let's just let people do whatever they > feel like until we settle on the best approach. I think we need to bite > the bullet and do some benchmarking, and commit one carefully vetted > patch to the backend. Yeah, I tend to agree. We've used hooks in the past to allow people to add on non-critical functionality. Fooling with the behavior of checkpoints is far from noncritical. Furthermore, it's really hard to see what a sane hook API would even look like. As Robert comments, part of any win here would likely come from controlling the timing of fsyncs, not just writes. Controlling all that at arm's length from the code that actually does it seems likely to be messy and inefficient. Another point is that I don't see any groundswell of demand out there for custom checkpoint algorithms. If we did manage to create a hook API, how likely is it there would ever be more than one plugin? regards, tom lane
Itagaki Takahiro wrote: > When I submitted the patch, I tested it on disk-based RAID-5 machine: > http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php > But there were no additional benchmarking reports at that time. We still > need benchmarking before we re-examine the feature. For example, SSD and > SSD-RAID was not popular at that time, but now they might be considerable. > I did multiple rounds of benchmarking that, just none of it showed any improvement so I didn't bother reporting them in detail. I have recently figured out why the performance testing I did of that earlier patch probably failed to produce useful results on my system when I was testing it back then though. It relates to trivia around how ext3 handles fsync that's well understood now (the whole cache flushes out when one comes in), but wasn't back then yet. We have a working set of patches here that both rewrite the checkpoint logic to avoid several larger problems with how it works now, as well as adding instrumentation that makes it possible to directly measure and graph whether methods such as sorting writes provide any improvement or not to the process. My hope is to have those all ready for initial submission as part of CommitFest 2010-11, as the main feature addition from myself toward improving 9.1. I have a bunch of background information about this I'm presenting at PGWest next week, after which I'll start populating the wiki with more details and begin packaging the code too. I had hoped to revisit the checkpoint sorting details after that. Jeff or yourself are welcome to try your own tests in that area, I could use the help. But I think my measurement patches will help you with that considerably once I release them in another couple of weeks. Seeing a graph of latency sync times for each file is very informative for figuring out whether a change did something useful, more so than just staring at total TPS results. Such latency graphs are what I've recently started to do here, with some server-side changes that then feed into gnuplot. The idea of making something like the sorting logic into a pluggable hook seems like a waste of time to me, particulary given that the earlier implementation really needed to be allocated a dedicated block of shared memory to work well IMHO (and I believe that's still the case). That area isn't where the real problems are at here anyway, especially on large memory systems. How the sync logic works is the increasingly troublesome part of the checkpoint code, because the problem it has to deal with grows proportionately to the size of the write cache on the system. Typical production servers I deal with have about 8X as much RAM now as they did in 2007 when I last investigated write sorting. Regular hard drives sure haven't gotten 8X faster since then, and battery-backed caches (which used to have enough memory to absorb a large portion of a checkpoint burst) have at best doubled in size. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
On Fri, Oct 29, 2010 at 6:17 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Oct 29, 2010 at 2:58 AM, Itagaki Takahiro > <itagaki.takahiro@gmail.com> wrote: >> On Fri, Oct 29, 2010 at 3:23 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> Simon's argument in the thread that the todo item points to >>> (http://archives.postgresql.org/pgsql-patches/2008-07/msg00123.php) is >>> basically that we don't know what the best algorithm is yet and benchmarking >>> is a lot of work, so let's just let people do whatever they feel like until >>> we settle on the best approach. I think we need to bite the bullet and do >>> some benchmarking, and commit one carefully vetted patch to the backend. >> >> When I submitted the patch, I tested it on disk-based RAID-5 machine: >> http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php >> But there were no additional benchmarking reports at that time. We still >> need benchmarking before we re-examine the feature. For example, SSD and >> SSD-RAID was not popular at that time, but now they might be considerable. > > There are really two separate things here: > > (1) trying to do all the writes to file A before you start doing > writes to file B, and > (2) trying to write out blocks to each file in ascending logical block > number order > > I'm much more convinced of the value of #1 than I am of the value of > #2. If we do #1, we can then spread out the checkpoint fsyncs in a > meaningful way without fearing that we'll need to fsync the same file > a second time for the same checkpoint. If the OS/FS is behaving such that it is important to spread out fsyncs, then wouldn't that same behavior also make it less important to avoid fsync a second time? If the OS is squirreling away a preposterous amount of dirty buffers in its cache, and then panicking to dump them all when it gets the fsync, then I think you would need to spread out the fsyncs within a file, and not just between files. > We've gotten some pretty > specific reports of problems in this area recently, so it seems likely > that there is some value to be had there. On the other hand, #2 is > only a win if sorting the blocks in numerical order causes the OS to > write them in a better order than it would otherwise have done. Assuming the ordering is useful, the only way the OS can do as good a job as the checkpoint code can, is if the OS stores the entire checkpoint worth of data as dirty blocks and doesn't start writing until an fsync comes in. This strikes me as a pathologically configured OS/FS. (And would explain problems with fsyncs) > We've > had recent reports that our block-at-a-time relation extension policy > is leading to severe fragmentation on certain filesystems, so I'm a > bit skeptical about the value of this (though, of course, that can be > overturned if we can collect meaningful evidence). Some FS are better about that than others at that. It would probably depend on the exact workload, and pgbench would probably favor large contiguous extents to an unrealistic degree. So I don't know the best way to gather that evidence. Cheers, Jeff
On Sat, Nov 6, 2010 at 7:25 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> There are really two separate things here: >> >> (1) trying to do all the writes to file A before you start doing >> writes to file B, and >> (2) trying to write out blocks to each file in ascending logical block >> number order >> >> I'm much more convinced of the value of #1 than I am of the value of >> #2. If we do #1, we can then spread out the checkpoint fsyncs in a >> meaningful way without fearing that we'll need to fsync the same file >> a second time for the same checkpoint. > > If the OS/FS is behaving such that it is important to spread out > fsyncs, then wouldn't that same behavior also make it less important > to avoid fsync a second time? > > If the OS is squirreling away a preposterous amount of dirty buffers > in its cache, and then panicking to dump them all when it gets the > fsync, then I think you would need to spread out the fsyncs within a > file, and not just between files. Well, presumably, there's some amount of dirty data that the OS can write out in one shot without causing a perceptible stall (otherwise we're hosed no matter what). The question is where that threshold is.With one fsync per file, we'll try to write at most1 GB at once, and often less. As you say, there's a possibility that that's still too much, but right now we could be trying to dump 30+ GB to disk if the OS has a lot of dirty pages in the buffer cache, so getting down to 1 GB or less at a time should be a big improvement even if it doesn't solve the problem completely. >> We've gotten some pretty >> specific reports of problems in this area recently, so it seems likely >> that there is some value to be had there. On the other hand, #2 is >> only a win if sorting the blocks in numerical order causes the OS to >> write them in a better order than it would otherwise have done. > > Assuming the ordering is useful, the only way the OS can do as good a > job as the checkpoint code can, is if the OS stores the entire > checkpoint worth of data as dirty blocks and doesn't start writing > until an fsync comes in. This strikes me as a pathologically > configured OS/FS. (And would explain problems with fsyncs) The OS would only need to store and reorder one file's worth of blocks, if we wrote the data for one file and called fsync, wrote the data for another file and called fsync, etc. >> We've >> had recent reports that our block-at-a-time relation extension policy >> is leading to severe fragmentation on certain filesystems, so I'm a >> bit skeptical about the value of this (though, of course, that can be >> overturned if we can collect meaningful evidence). > > Some FS are better about that than others at that. It would probably > depend on the exact workload, and pgbench would probably favor large > contiguous extents to an unrealistic degree. So I don't know the best > way to gather that evidence. Well, the basic idea would be to try some different workloads on different filesystems and try go get some feeling for how often sorting by block number wins and how often it loses. But as I say I think the biggest problem is that we're often trying to write too much dirty data to disk at once. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Jeff Janes wrote: > Assuming the ordering is useful, the only way the OS can do as good a > job as the checkpoint code can, is if the OS stores the entire > checkpoint worth of data as dirty blocks and doesn't start writing > until an fsync comes in. This strikes me as a pathologically > configured OS/FS. (And would explain problems with fsyncs) > This can be exactly the situation with ext3 on Linux, which I believe is one reason the write sorting patch didn't go anywhere last time it came up--that's certainly what I tested it on. The slides for my talk "Righting Your Writes" are now up at http://projects.2ndquadrant.com/talks and an example showing this is on page 9. I'm hoping to get the 3 patches shown in action or described in that talk submitted to the list before the next CommitFest. You really need timing of individual sync calls to figure out what's going on here, and what happens is completely dependent on filesystem. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Sun, Nov 7, 2010 at 4:13 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Jeff Janes wrote: > >> Assuming the ordering is useful, the only way the OS can do as good a >> job as the checkpoint code can, is if the OS stores the entire >> checkpoint worth of data as dirty blocks and doesn't start writing >> until an fsync comes in. This strikes me as a pathologically >> configured OS/FS. (And would explain problems with fsyncs) >> > > This can be exactly the situation with ext3 on Linux, which I believe is one > reason the write sorting patch didn't go anywhere last time it came > up--that's certainly what I tested it on. Interesting. I think the default mount options for ext3 is to do a journal sync at least every 5 seconds, which should also flush out dirty OS buffers, preventing them from building up to such an extent. Is that default being changed here, or does it simply not work the way I think it does? Thanks for the link to the slides. Cheers, Jeff