On Wednesday, January 11, 2012 03:14:31 AM Robert Haas wrote:
> Greg Smith muttered a while ago about wanting to do something with
> sync_file_range to improve checkpoint behavior on Linux. I thought he
> was talking about trying to sync only the range of blocks known to be
> dirty, which didn't seem like a very exciting idea, but after looking
> at the man page for sync_file_range, I think I understand what he was
> really going for: sync_file_range allows you to hint the Linux kernel
> that you'd like it to clean a certain set of pages. I further recall
> from Greg's previous comments that in the scenarios he's seen,
> checkpoint I/O spikes are caused not so much by the data written out
> by the checkpoint itself but from the other dirty data in the kernel
> buffer cache. Based on that, I whipped up the attached patch, which,
> if sync_file_range is available, simply iterates through everything
> that will eventually be fsync'd before beginning the write phase and
> tells the Linux kernel to put them all under write-out.
I played around with this before and my problem was that sync_file_range is not
really a hint. It actually starts writeback *directly* and only returns when
the io is placed inside the queue (at least thats the way it was back then).
Which very quickly leads to it blocking all the time...
Andres