On 08/30/2014 09:45 PM, Andres Freund wrote:
> On 2014-08-30 14:16:10 -0400, Tom Lane wrote:
>> Andres Freund <andres@2ndquadrant.com> writes:
>>> On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
>>>> A possible compromise is to sort a limited number of
>>>> buffers ---- say, collect a few thousand dirty buffers then sort, dump and
>>>> fsync them, repeat as needed.
>>
>>> Yea, that's what I suggested nearby. But I don't really like it, because
>>> it robs us of the the chance to fsync() a relfilenode immediately after
>>> having synced all its buffers.
>>
>> Uh, how so exactly? You could still do that. Yeah, you might fsync a rel
>> once per sort-group and not just once per checkpoint, but it's not clear
>> that that's a loss as long as the group size isn't tiny.
>
> Because it wouldn't have the benefit of sycing the minimal amount of
> data anymore. If lots of other relfilenodes have been synced inbetween
> the amount of newly dirtied pages in the os' buffercache (written by
> backends, bgwriter) for a individual relfilenode is much higher.
I wonder how much of the benefit from sorting comes from sorting the
pages within each file, and how much just from grouping all the writes
of each file together. In other words, how much difference is there
between sorting, and fsyncing between each file, or the crude patch I
posted earlier.
If we're going to fsync between each file, there's no need to sort all
the buffers at once. It's enough to pick one file as the target - like
in my crude patch - and sort only the buffers for that file. Then fsync
that file and move on to the next file. That requires scanning the
buffers multiple times, but I think that's OK.
- Heikki