Re: sorted writes for checkpoints - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: sorted writes for checkpoints |
Date | |
Msg-id | AANLkTikGKKLFhYF1HQGGwV1-BpHHEaecLXqYFh9AAzL_@mail.gmail.com Whole thread Raw |
In response to | Re: sorted writes for checkpoints (Jeff Janes <jeff.janes@gmail.com>) |
List | pgsql-hackers |
On Sat, Nov 6, 2010 at 7:25 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> There are really two separate things here: >> >> (1) trying to do all the writes to file A before you start doing >> writes to file B, and >> (2) trying to write out blocks to each file in ascending logical block >> number order >> >> I'm much more convinced of the value of #1 than I am of the value of >> #2. If we do #1, we can then spread out the checkpoint fsyncs in a >> meaningful way without fearing that we'll need to fsync the same file >> a second time for the same checkpoint. > > If the OS/FS is behaving such that it is important to spread out > fsyncs, then wouldn't that same behavior also make it less important > to avoid fsync a second time? > > If the OS is squirreling away a preposterous amount of dirty buffers > in its cache, and then panicking to dump them all when it gets the > fsync, then I think you would need to spread out the fsyncs within a > file, and not just between files. Well, presumably, there's some amount of dirty data that the OS can write out in one shot without causing a perceptible stall (otherwise we're hosed no matter what). The question is where that threshold is.With one fsync per file, we'll try to write at most1 GB at once, and often less. As you say, there's a possibility that that's still too much, but right now we could be trying to dump 30+ GB to disk if the OS has a lot of dirty pages in the buffer cache, so getting down to 1 GB or less at a time should be a big improvement even if it doesn't solve the problem completely. >> We've gotten some pretty >> specific reports of problems in this area recently, so it seems likely >> that there is some value to be had there. On the other hand, #2 is >> only a win if sorting the blocks in numerical order causes the OS to >> write them in a better order than it would otherwise have done. > > Assuming the ordering is useful, the only way the OS can do as good a > job as the checkpoint code can, is if the OS stores the entire > checkpoint worth of data as dirty blocks and doesn't start writing > until an fsync comes in. This strikes me as a pathologically > configured OS/FS. (And would explain problems with fsyncs) The OS would only need to store and reorder one file's worth of blocks, if we wrote the data for one file and called fsync, wrote the data for another file and called fsync, etc. >> We've >> had recent reports that our block-at-a-time relation extension policy >> is leading to severe fragmentation on certain filesystems, so I'm a >> bit skeptical about the value of this (though, of course, that can be >> overturned if we can collect meaningful evidence). > > Some FS are better about that than others at that. It would probably > depend on the exact workload, and pgbench would probably favor large > contiguous extents to an unrealistic degree. So I don't know the best > way to gather that evidence. Well, the basic idea would be to try some different workloads on different filesystems and try go get some feeling for how often sorting by block number wins and how often it loses. But as I say I think the biggest problem is that we're often trying to write too much dirty data to disk at once. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: