Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers
From | Dave Chinner |
---|---|
Subject | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |
Date | |
Msg-id | 20140114010946.GA3431@dastard Whole thread Raw |
In response to | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (Greg Stark <stark@mit.edu>) |
Responses |
Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |
List | pgsql-hackers |
On Mon, Jan 13, 2014 at 09:29:02PM +0000, Greg Stark wrote: > On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > For one, postgres doesn't use mmap for files (and can't without major > > new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has > > horrible consequences for performance/scalability - very quickly you > > contend on locks in the kernel. > > I may as well dump this in this thread. We've discussed this in person > a few times, including at least once with Ted T'so when he visited > Dublin last year. > > The fundamental conflict is that the kernel understands better the > hardware and other software using the same resources, Postgres > understands better its own access patterns. We need to either add > interfaces so Postgres can teach the kernel what it needs about its > access patterns or add interfaces so Postgres can find out what it > needs to know about the hardware context. In my experience applications don't need to know anything about the underlying storage hardware - all they need is for someone to tell them the optimal IO size and alignment to use. > The more ambitious and interesting direction is to let Postgres tell > the kernel what it needs to know to manage everything. To do that we > would need the ability to control when pages are flushed out. This is > absolutely necessary to maintain consistency. Postgres would need to > be able to mark pages as unflushable until some point in time in the > future when the journal is flushed. We discussed various ways that > interface could work but it would be tricky to keep it low enough > overhead to be workable. IMO, the concept of allowing userspace to pin dirty page cache pages in memory is just asking for trouble. Apart from the obvious memory reclaim and OOM issues, some filesystems won't be able to move their journals forward until the data is flushed. i.e. ordered mode data writeback on ext3 will have all sorts of deadlock issues that result from pinning pages and then issuing fsync() on another file which will block waiting for the pinned pages to be flushed. Indeed, what happens if you do pin_dirty_pages(fd); .... fsync(fd);? If fsync() blocks because there are pinned pages, and there's no other thread to unpin them, then that code just deadlocked. If fsync() doesn't block and skips the pinned pages, then we haven't done an fsync() at all, and so violated the expectation that users have that after fsync() returns their data is safe on disk. And if we return an error to fsync(), then what the hell does the user do if it is some other application we don't know about that has pinned the pages? And if the kernel unpins them after some time, then we just violated the application's consistency guarantees.... Hmmmm. What happens if the process crashes after pinning the dirty pages? How do we even know what process pinned the dirty pages so we can clean up after it? What happens if the same page is pinned by multiple processes? What happens on truncate/hole punch if the partial pages in the range that need to be zeroed and written are pinned? What happens if we do direct IO to a range with pinned, unflushable pages in the page cache? These are all complex corner cases that are introduced by allowing applications to pin dirty pages in memory. I've only spent a few minutes coming up with these, and I'm sure there's more of them. As such, I just don't see that allowing userspace to pin dirty page cache pages in memory being a workable solution. > The less exciting, more conservative option would be to add kernel > interfaces to teach Postgres about things like raid geometries. Then /sys/block/<dev>/queue/* contains all the information that is exposed to filesystems to optimise layout for storage geometry. Some filesystems can already expose the relevant parts of this information to userspace, others don't. What I think we really need to provide is a generic interface similar to the old XFS_IOC_DIOINFO ioctl that can be used to expose IO characteristics to applications in a simple, easy to gather manner. Something like: struct io_info {u64 minimum_io_size; /* sector size */u64 maximum_io_size; /* currently 2GB */u64 optimal_io_size; /* stripe unit/width */u64 optimal_io_alignment; /* stripe unit/width */u64 mem_alignment; /* PAGE_SIZE */u32 queue_depth; /* max IO concurrency */ }; > Postgres could use directio and decide to do prefetching based on the > raid geometry, Underlying storage array raid geometry and optimal IO sizes for the filesystem may be different. Hence you want what the filesystem considers optimal, not what the underlying storage is configured with. Indeed, a filesystem might be able to supply per-file IO characteristics depending on where it is located in the filesystem (think tiered storage).... > how much available i/o bandwidth and iops is available, > etc. The kernel doesn't really know what a device is capable of - it can only measure what the current IO workload is achieving - and it can change based on the IO workload characteristics. Hence applications can track this as well as the kernel does if they need this information for any reason. > Reimplementing i/o schedulers and all the rest of the work that the Nobody needs to reimplement IO schedulers in userspace. Direct IO still goes through the block layers where all that merging and IO scheduling occurs. > kernel provides inside Postgres just seems like something outside our > competency and that none of us is really excited about doing. That argument goes both ways - providing fine-grained control over the page cache contents to userspace doesn't get me excited, either. In fact, it scares the living daylights out of me. It's complex, it's fragile and it introduces constraints into everything we do in the kernel. Any one of those reasons is grounds for saying no to a proposal, but this idea hits the trifecta.... I'm not saying that O_DIRECT is easy or perfect, but it seems to me to be a more robust, secure, maintainable and simpler solution than trying to give applications direct control over complex internal kernel structures and algorithms. Cheers, Dave. -- Dave Chinner david@fromorbit.com
pgsql-hackers by date: