Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |
Date | |
Msg-id | 52D56DE1.6070009@vmware.com Whole thread Raw |
In response to | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
On 01/14/2014 06:08 PM, Tom Lane wrote: > Trond Myklebust <trondmy@gmail.com> writes: >> On Jan 14, 2014, at 10:39, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> "Don't be aggressive" isn't good enough. The prohibition on early write >>> has to be absolute, because writing a dirty page before we've done >>> whatever else we need to do results in a corrupt database. It has to >>> be treated like a write barrier. > >> Then why are you dirtying the page at all? It makes no sense to tell the kernel “we’re changing this page in the pagecache, but we don’t want you to change it on disk”: that’s not consistent with the function of a page cache. > > As things currently stand, we dirty the page in our internal buffers, > and we don't write it to the kernel until we've written and fsync'd the > WAL data that needs to get to disk first. The discussion here is about > whether we could somehow avoid double-buffering between our internal > buffers and the kernel page cache. To be honest, I think the impact of double buffering in real-life applications is greatly exaggerated. If you follow the usual guideline and configure shared_buffers to 25% of available RAM, at worst you're wasting 25% of RAM to double buffering. That's significant, but it's not the end of the world, and it's a problem that can be compensated by simply buying more RAM. Of course, if someone can come up with an easy way to solve that, that'd be great, but if it means giving up other advantages that we get from relying on the OS page cache, then -1 from me. The usual response to the "why don't you just use O_DIRECT?" is that it'd require reimplementing a lot of I/O infrastructure, but misses an IMHO more important point: it would require setting shared_buffers a lot higher to get the same level of performance you get today. That has a number of problems: 1. It becomes a lot more important to tune shared_buffers correctly. Set it too low, and you're not taking advantage of all the RAM available. Set it too high, and you'll start swapping, totally killing performance. I can already hear consultants rubbing their hands, waiting for the rush of customers that will need expert help to determine the optimal shared_buffers setting. 2. Memory spent on the buffer cache can't be used for other things. For example, an index build can temporarily allocate several gigabytes of memory; if that memory is allocated to the shared buffer cache, it can't be used for that purpose. Yeah, we could change that, and allow borrowing pages from the shared buffer cache for other purposes, but that means more work and more code. 3. Memory used for the shared buffer cache can't be used by other processes (without swapping). It becomes a lot harder to be a good citizen on a system that's not entirely dedicated to PostgreSQL. So not only would we need to re-implement I/O infrastructure, we'd also need to make memory management a lot smarter and a lot more flexible. We'd need a lot more information on what else is running on the system and how badly they need memory. > I personally think there is no chance of using mmap for that; the > semantics of mmap are pretty much dictated by POSIX and they don't work > for this. Agreed. It would be possible to use mmap() for pages that are not modified, though. When you're not modifying, you could mmap() the data you need, and bypass the PostgreSQL buffer cache that way. The interaction with the buffer cache becomes complicated, because you couldn't use the buffer cache's locks etc., and some pages might have a never version in the buffer cache than on-disk, but it might be doable. - Heikki
pgsql-hackers by date: