Re: Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers
From | Mel Gorman |
---|---|
Subject | Re: Linux kernel impact on PostgreSQL performance |
Date | |
Msg-id | 20140115100844.GG4963@suse.de Whole thread Raw |
In response to | Re: Linux kernel impact on PostgreSQL performance (Jeff Janes <jeff.janes@gmail.com>) |
Responses |
Re: Linux kernel impact on PostgreSQL performance
|
List | pgsql-hackers |
On Tue, Jan 14, 2014 at 09:30:19AM -0800, Jeff Janes wrote: > > > What's not so simple, is figuring out what policy to use. Remember, > > > you cannot tell the kernel to put some page in its page cache without > > > reading it or writing it. So, once you make the kernel forget a page, > > > evicting it from shared buffers becomes quite expensive. > > > > posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by > > forcing readahead. > > > But telling the kernel to forget a page, then telling it to read it in > again from disk because it might be needed again in the near future is > itself very expensive. We would need to hand the page to the kernel so it > has it without needing to go to disk to get it. > Yes, this is the unnecessary IO cost I was thinking of. > > > If you evict it prematurely then you do get kinda > > screwed because you pay the IO cost to read it back in again even if you > > had enough memory to cache it. Maybe this is the type of kernel-postgres > > interaction that is annoying you. > > > > If you don't evict, the kernel eventually steps in and evicts the wrong > > thing. If you do evict and it was unnecessarily you pay an IO cost. > > > > That could be something we look at. There are cases buried deep in the > > VM where pages get shuffled to the end of the LRU and get tagged for > > reclaim as soon as possible. Maybe you need access to something like > > that via posix_fadvise to say "reclaim this page if you need memory but > > leave it resident if there is no memory pressure" or something similar. > > Not exactly sure what that interface would look like or offhand how it > > could be reliably implemented. > > > > I think the "reclaim this page if you need memory but leave it resident if > there is no memory pressure" hint would be more useful for temporary > working files than for what was being discussed above (shared buffers). > When I do work that needs large temporary files, I often see physical > write IO spike but physical read IO does not. I interpret that to mean > that the temporary data is being written to disk to satisfy either > dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS > cache and so disk reads are not needed to satisfy it. So a hint that says > "this file will never be fsynced so please ignore dirty_*bytes and > dirty_expire_centisecs. It would be good to know if dirty_expire_centisecs or dirty ratio|bytes were the problem here. An interface that forces a dirty page to stay dirty regardless of the global system would be a major hazard. It potentially allows the creator of the temporary file to stall all other processes dirtying pages for an unbounded period of time. I proposed in another part of the thread a hint for open inodes to have the background writer thread ignore dirty pages belonging to that inode. Dirty limits and fsync would still be obeyed. It might also be workable for temporary files but the proposal could be full of holes. Your alternative here is to create a private anonymous mapping as they are not subject to dirty limits. This is only a sensible option if the temporarily data is guaranteeed to be relatively small. If the shared buffers, page cache and your temporary data exceed the size of RAM then data will get discarded or your temporary data will get pushed to swap and performance will hit the floor. FWIW, the performance of some IO "benchmarks" used to depend on whether they could create, write and delete files before any of the data actually hit the disk -- pretty much exactly the type of behaviour you are looking for. -- Mel Gorman SUSE Labs
pgsql-hackers by date: