Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 01/15/2014 07:50 AM, Dave Chinner wrote:
>> FWIW [and I know you're probably sick of hearing this by now], but
>> the blk-io throttling works almost perfectly with applications that
>> use direct IO.....
> For checkpoint writes, direct I/O actually would be reasonable.
> Bypassing the OS cache is a good thing in that case - we don't want the
> written pages to evict other pages from the OS cache, as we already have
> them in the PostgreSQL buffer cache.
But in exchange for that, we'd have to deal with selecting an order to
write pages that's appropriate depending on the filesystem layout,
other things happening in the system, etc etc. We don't want to build
an I/O scheduler, IMO, but we'd have to.
> Writing one page at a time with O_DIRECT from a single process might be
> quite slow, so we'd probably need to use writev() or asynchronous I/O to
> work around that.
Yeah, and if the system has multiple spindles, we'd need to be issuing
multiple O_DIRECT writes concurrently, no?
What we'd really like for checkpointing is to hand the kernel a boatload
(several GB) of dirty pages and say "how about you push all this to disk
over the next few minutes, in whatever way seems optimal given the storage
hardware and system situation. Let us know when you're done." Right now,
because there's no way to negotiate such behavior, we're reduced to having
to dribble out the pages (in what's very likely a non-optimal order) and
hope that the kernel is neither too lazy nor too aggressive about cleaning
dirty pages in its caches.
regards, tom lane