Re: sorted writes for checkpoints - Mailing list pgsql-hackers

From Robert Haas
Subject Re: sorted writes for checkpoints
Date
Msg-id AANLkTikGKKLFhYF1HQGGwV1-BpHHEaecLXqYFh9AAzL_@mail.gmail.com
Whole thread Raw
In response to Re: sorted writes for checkpoints  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
On Sat, Nov 6, 2010 at 7:25 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> There are really two separate things here:
>>
>> (1) trying to do all the writes to file A before you start doing
>> writes to file B, and
>> (2) trying to write out blocks to each file in ascending logical block
>> number order
>>
>> I'm much more convinced of the value of #1 than I am of the value of
>> #2.  If we do #1, we can then spread out the checkpoint fsyncs in a
>> meaningful way without fearing that we'll need to fsync the same file
>> a second time for the same checkpoint.
>
> If the OS/FS is behaving such that it is important to spread out
> fsyncs, then wouldn't that same behavior also make it less important
> to avoid fsync a second time?
>
> If the OS is squirreling away a preposterous amount of dirty buffers
> in its cache, and then panicking to dump them all when it gets the
> fsync, then I think you would need to spread out the fsyncs within a
> file, and not just between files.

Well, presumably, there's some amount of dirty data that the OS can
write out in one shot without causing a perceptible stall (otherwise
we're hosed no matter what).  The question is where that threshold is.With one fsync per file, we'll try to write at
most1 GB at once, and 
often less.  As you say, there's a possibility that that's still too
much, but right now we could be trying to dump 30+ GB to disk if the
OS has a lot of dirty pages in the buffer cache, so getting down to 1
GB or less at a time should be a big improvement even if it doesn't
solve the problem completely.

>> We've gotten some pretty
>> specific reports of problems in this area recently, so it seems likely
>> that there is some value to be had there.  On the other hand, #2 is
>> only a win if sorting the blocks in numerical order causes the OS to
>> write them in a better order than it would otherwise have done.
>
> Assuming the ordering is useful, the only way the OS can do as good a
> job as the checkpoint code can, is if the OS stores the entire
> checkpoint worth of data as dirty blocks and doesn't start writing
> until an fsync comes in.  This strikes me as a pathologically
> configured OS/FS.  (And would explain problems with fsyncs)

The OS would only need to store and reorder one file's worth of
blocks, if we wrote the data for one file and called fsync, wrote the
data for another file and called fsync, etc.

>> We've
>> had recent reports that our block-at-a-time relation extension policy
>> is leading to severe fragmentation on certain filesystems, so I'm a
>> bit skeptical about the value of this (though, of course, that can be
>> overturned if we can collect meaningful evidence).
>
> Some FS are better about that than others at that.  It would probably
> depend on the exact workload, and pgbench would probably favor large
> contiguous extents to an unrealistic degree.  So I don't know the best
> way to gather that evidence.

Well, the basic idea would be to try some different workloads on
different filesystems and try go get some feeling for how often
sorting by block number wins and how often it loses.  But as I say I
think the biggest problem is that we're often trying to write too much
dirty data to disk at once.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: improved parallel make support
Next
From: Robert Haas
Date:
Subject: Re: IA64 versus effective stack limit