Re: Why we are going to have to go DirectIO - Mailing list pgsql-hackers

From KONDO Mitsumasa
Subject Re: Why we are going to have to go DirectIO
Date
Msg-id 52A55D87.4040700@lab.ntt.co.jp
Whole thread Raw
In response to Re: Why we are going to have to go DirectIO  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
(2013/12/05 23:42), Greg Stark wrote:
> On Thu, Dec 5, 2013 at 8:35 AM, KONDO Mitsumasa
> <kondo.mitsumasa@lab.ntt.co.jp> wrote:
>> Yes. And using something efficiently DirectIO is more difficult than
>> BufferedIO.
>> If we change write() flag with direct IO in PostgreSQL, it will execute
>> hardest ugly randomIO.
>
> Using DirectIO presumes you're using libaio or threads to implement
> prefetching and asynchronous I/O scheduling.
>
> I think in the long term there are only two ways to go here. Either a)
> we use DirectIO and implement an I/O scheduler in Postgres or b) We
> use mmap and use new system calls to give the kernel all the
> information Postgres has available to it to control the I/O scheduler.
I agree with part of (b) method. I think MMAP API isn't purpose for controling 
I/O as others saying. And I think posix_fadivse(), sync_file_range() and 
fallocate() is easier way to be realized better I/O sheduler in Postgres. These 
systemcall doesn't cause data corruption at all, and we can just use existing 
implementaion. They effect only perfomance.

My survey of posix_fadvise() and sync_file_range() is here. It's simple rule.
#Almost my explaining is written in linux man:-)

* Optimize readahead in OS [ posix_fadvise() ]  These options is for mainly read perfomance.
  - POSIX_FADV_SEQUENTIAL flag    -> Readahead parameter in OS becomes maximum.  - POSIX_FADV_RANDOM flag    -> Don't
usereadahead parameter in OS. It can calculate the file cache       frequency and efficiency for using the file cache.
-POSIX_FADV_NORMAL    -> Readahead parameter in OS optimized dynamically in each situasions. If       you doesn't judge
strategyof disk controlling, we can select this       option. It might be good working in almost cases.
 

* Contorol dirty or clean buffer in OS [ posix_fadvise() and sync_file_range() ]  These optinos is for write and read
perfomancecontroling in OS file caches.
 
  - POSIX_FADV_DONTNEED   -> Drop the file cache. If it is dirty, write disk and drop file cache.      If it isn't
dirty,it only drop from OS file cache.  - sync_file_range()   -> If you want to write dirty buffer to disk and remain
filecache in OS, you   can select this system-call. And it can contorol amount of write size.  - POSIX_FADV_NOREUSE
->If you think that the file cache will not be needed, we can set this   option. The file cache will be drop soon.  -
POSIX_FADV_WILLNEED  -> If you think that the file cache will be important, we can set this   option. The file cache
willbe tend to remain in OS file caches.
 


That's all.

Kernel in OS cannot predict IO pattern perfectly in each midlleware, therefore it 
is optimized by general heuristic algorithms. I think it is right way. However, 
PostgreSQL can predict IO pattern in part of planner, executer and checkpointer, 
so we had better set optimum posix_fadvise() flag or sync_file_range() 
before/after execute general IO systemcall. I think that they will be good IO 
contoroling and scheduling method without unreliable implementations.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: plpgsql_check_function - rebase for 9.3
Next
From: "Etsuro Fujita"
Date:
Subject: Re: Show lossy heap block info in EXPLAIN ANALYZE for bitmap heap scan