Re: Why we are going to have to go DirectIO - Mailing list pgsql-hackers
From | KONDO Mitsumasa |
---|---|
Subject | Re: Why we are going to have to go DirectIO |
Date | |
Msg-id | 52A55D87.4040700@lab.ntt.co.jp Whole thread Raw |
In response to | Re: Why we are going to have to go DirectIO (Greg Stark <stark@mit.edu>) |
List | pgsql-hackers |
(2013/12/05 23:42), Greg Stark wrote: > On Thu, Dec 5, 2013 at 8:35 AM, KONDO Mitsumasa > <kondo.mitsumasa@lab.ntt.co.jp> wrote: >> Yes. And using something efficiently DirectIO is more difficult than >> BufferedIO. >> If we change write() flag with direct IO in PostgreSQL, it will execute >> hardest ugly randomIO. > > Using DirectIO presumes you're using libaio or threads to implement > prefetching and asynchronous I/O scheduling. > > I think in the long term there are only two ways to go here. Either a) > we use DirectIO and implement an I/O scheduler in Postgres or b) We > use mmap and use new system calls to give the kernel all the > information Postgres has available to it to control the I/O scheduler. I agree with part of (b) method. I think MMAP API isn't purpose for controling I/O as others saying. And I think posix_fadivse(), sync_file_range() and fallocate() is easier way to be realized better I/O sheduler in Postgres. These systemcall doesn't cause data corruption at all, and we can just use existing implementaion. They effect only perfomance. My survey of posix_fadvise() and sync_file_range() is here. It's simple rule. #Almost my explaining is written in linux man:-) * Optimize readahead in OS [ posix_fadvise() ] These options is for mainly read perfomance. - POSIX_FADV_SEQUENTIAL flag -> Readahead parameter in OS becomes maximum. - POSIX_FADV_RANDOM flag -> Don't usereadahead parameter in OS. It can calculate the file cache frequency and efficiency for using the file cache. -POSIX_FADV_NORMAL -> Readahead parameter in OS optimized dynamically in each situasions. If you doesn't judge strategyof disk controlling, we can select this option. It might be good working in almost cases. * Contorol dirty or clean buffer in OS [ posix_fadvise() and sync_file_range() ] These optinos is for write and read perfomancecontroling in OS file caches. - POSIX_FADV_DONTNEED -> Drop the file cache. If it is dirty, write disk and drop file cache. If it isn't dirty,it only drop from OS file cache. - sync_file_range() -> If you want to write dirty buffer to disk and remain filecache in OS, you can select this system-call. And it can contorol amount of write size. - POSIX_FADV_NOREUSE ->If you think that the file cache will not be needed, we can set this option. The file cache will be drop soon. - POSIX_FADV_WILLNEED -> If you think that the file cache will be important, we can set this option. The file cache willbe tend to remain in OS file caches. That's all. Kernel in OS cannot predict IO pattern perfectly in each midlleware, therefore it is optimized by general heuristic algorithms. I think it is right way. However, PostgreSQL can predict IO pattern in part of planner, executer and checkpointer, so we had better set optimum posix_fadvise() flag or sync_file_range() before/after execute general IO systemcall. I think that they will be good IO contoroling and scheduling method without unreliable implementations. Regards, -- Mitsumasa KONDO NTT Open Source Software Center
pgsql-hackers by date: