Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17) - Mailing list pgsql-hackers
From | Mel Gorman |
---|---|
Subject | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17) |
Date | |
Msg-id | 20140117163148.GA4963@suse.de Whole thread Raw |
In response to | Re: Linux kernel impact on PostgreSQL performance (summary v1 2014-1-15) (Mel Gorman <mgorman@suse.de>) |
Responses |
Re: Re: [Lsf-pc] Linux kernel impact on PostgreSQL
performance (summary v2 2014-1-17)
Re: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17) |
List | pgsql-hackers |
On Wed, Jan 15, 2014 at 02:14:08PM +0000, Mel Gorman wrote: > > One assumption would be that Postgres is perfectly happy with the current > > kernel behaviour in which case our discussion here is done. > > It has been demonstrated that this statement was farcical. The thread is > massive just from interaction with the LSF/MM program committee. I'm hoping > that there will be Postgres representation at LSF/MM this year to bring > the issues to a wider audience. I expect that LSF/MM can only commit to > one person attending the whole summit due to limited seats but we could > be more more flexible for the Postgres track itself so informal meetings > can be arranged for the evenings and at collab summit. > We still have not decided on a person that can definitely attend but we'll get back to that shortly. I wanted to revise the summary mail so that there is a record that can be easily digested without trawling through archives. As before if I missed something important, prioritised poorly or emphasised incorrectly then shout at me. On testing of modern kernels ---------------------------- Josh Berkus claims that most people are using Postgres with 2.6.19 and consequently there may be poor awareness of recent kernel developments. This is a disturbingly large window of opportunity for problems to have been introduced. Minimally, Postgres has concerns about IO-related stalls which may or may not exist in current kernels. There were indications that large writes starve reads. There have been variants of this style of bug in the past but it's unclear what the exact shape of this problem is and if IO-less dirty throttling affected it. It is possible that Postgres was burned in the past by data being written back from reclaim context in low memory situations. That would have looked like massive stalls with drops in IO throughput but it was fixed in relatively recent kernels. Any data on historical tests would be helpful. Alternatively, a pgbench-based reproduction test could potentially be used by people in the kernel community that track performance over time and have access to a suitable testing rig. Postgres bug reports and LKML ----------------------------- It is claimed that LKML does not welcome bug reports but it's less clear what the basis of this claim is. Is it because the reports are ignored? A possible explanation is that they are simply getting lost in the LKML noise and there would be better luck if the bug report was cc'd to a specific subsystem list. A second possibility is the bug report is against an old kernel and unless it is reproduced on a recent kernel the bug report will be ignored. Finally it is possible that there is not enough data available to debug the problem. The worst explanation is that to date the problem has not been fixable but the details of this have been lost and are now unknown. Is is possible that some of these bug reports can be refreshed so at least there is a chance they get addressed? Apparently there were changes to the reclaim algorithms that crippled performance without any sysctls. The problem may be compounded by the introduction of adaptive replacement cache in the shape of the thrash detection patches currently being reviewed. Postgres investigated the use of ARC in the past and ultimately abandoned it. Details are in the archives (http://www.Postgres.org/search/?m=1&q=arc&l=1&d=-1&s=r). I have not read then, just noting they exist for future reference. Sysctls to control VM behaviour are not popular as such tuning parameters are often used as an excuse to not properly fix the problem. Would it be possible to describe a test case that shows 2.6.19 performing well and a modern kernel failing? That would give the VM people a concrete basis to work from to either fix the problem or identify exactly what sysctls are required to make this work. I am confident that any bug related to VM reclaim in this area has been lost. At least, I recall no instances of it being discussed on linux-mm and it has not featured on LSF/MM during the last years. IO Scheduling ------------- Kevin Grittner has stated that it is known that the DEADLINE and NOOP schedulers perform better than any alternatives for most database loads. It would be desirable to quantify this for some test case and see can the default scheduler cope in some way. The deadline scheduler makes sense to a large extent though. Postgres is sensitive to large latencies due to IO write spikes. It is at least plausible that deadline would give more deterministic behaviour for parallel reads in the presence of large writes assuming there were not ordering problems between the reads/writes and the underlying filesystem. For reference, these IO spikes can be massive. If the shared buffer is completely dirtied in a short space of time then it could be 20-25% of RAM being dirtied and writeback required in typical configurations. There have been cases where it was worked around by limiting the size of the shared buffer to a small enough size so that it can be written back quickly. There are other tuning options available such as altering when dirty background writing starts within the kernel but that will not help if the dirtying happens in a very short space of time. Dave Chinner described the considerations as follows There's no absolute rule here, but the threshold for backgroundwriteback needs to consider the amount of dirty data beinggenerated,the rate at which it can be retired and the checkpoint period theapplication is configured with. i.e. it needsto be slow enough tonot cause serious read IO perturbations, but still fast enough thatit avoids peaks at synchronisationpoints. And most importantly, itneeds to be fast enought that it can complete writeback of all thedirty datain a checkpoint before the next checkpoint is triggered. In general, I find that threshold to be somewhere around 2-5sworth of data writeback - enough to keep a good amount of writecombiningand the IO pipeline full as work is done, but no more. e.g. if your workload results in writeback rates of 500MB/s,then I'd be setting the dirty limit somewhere around 1-2GB asaninitial guess. It's basically a simple trade off bufferingspace for writeback latency. Some applications perform wellwithincreased buffering space (e.g. 10-20s of writeback) while othersperform better with extremely low writeback latency(e.g. 0.5-1s). Some of this may have been addressed in recent changes with IO-less dirty throttling. When considering stalls related to excessive IO it will be important to check if the kernel was later than 3.2 and what the underlying filesystem was. Again, it really should be possible to demonstrate this with a test case, one driven by pgbench maybe? Workload would generate a bunch of test data, dirty a large percentage of it and try to sync. Metrics would be measuring average read-only query latency when reading in parallel to the write, average latencies from the underlying storage, IO queue lengths etc and comparing default IO scheduler with deadline or noop. NUMA Optimisations ------------------ The primary one that showed up was zone_reclaim_mode. Enabling that parameter is a disaster for many workloads and apparently Postgres is one. It might be time to revisit leaving that thing disabled by default and explicitly requiring that NUMA-aware workloads that are correctly partitioned enable it. Otherwise NUMA considerations are not that much of a concern right now. Direct IO, buffered IO, double buffering and wishlists ------------------------------------------------------ The general position of Postgres is that the kernel knows more about storage geometries and IO scheduling that an application can or should know. It would be preferred to have interfaces that allow Postgres to give hints to the kernel about how and when data should be written back. The alternative is exposing details of the underlying storage to userspace so Postgres can implement a full IO scheduler using direct IO. It has been asserted on the kernel side that the optimal IO size and alignment is the most important detail should be all the details that are required in the majority of cases. While some database vendors have this option, the Postgres community do not have the resources to implement something of this magnitude. They also have tried direct IO in the past in the areas where it should have mattered and had mixed results. I can understand Postgres preference for using the kernel to handle these details for them. They are a cross-platform application and the kernel should not be washing its hands of the problem and hiding behind direct IO as a solution. Ted Ts'o summarises the issues as The high order bit is what's the right thing to do when databaseprogrammers come to kernel engineers saying, we want to do<FOO>and the performance sucks. Do we say, "Use O_DIRECT, dummy", notwithstanding Linus's past comments on the issue? Or do we havesome general design principles that we tell database engineers thatthey should do for better performance,and then all developers forall of the file systems can then try to optimize for a set of newAPI's, or recommendedways of using the existing API's? In an effort to avoid depending on direct IO there were some proposals and/or wishlist items. These are listed in order of likliehood to be implemented and usefulness to Postgres. 1. Hint to asynchronously queue writeback now in preparation for a fsync in the near future. Postgres dirties a largeamount of data and asks the kernel to push it to disk over the next few minutes. Postgres still is requiredto fsync later but the fsync time should be minimised. vm.dirty_writeback_centisecs is unreliable for this. One possibility would be an fadvise call that queues the data for writeback by a flusher thread now and returnsimmediately 2. Hint that a page is a prime candidate for reclaim but only if there is reclaim pressure. This avoids a problem wherefadvise(DONTNEED) discards a page only to have a read/write or WILLNEED hint immediately read it back in again.The requirements are similar to the volatile range hinting but they do not use mmap() currently and would needa file-descriptor based interface. Robert Hass had some concerns with the general concept and described themthusly This is an interesting idea but it stinks of impracticality.Essentially when the last buffer pin on a page is dropped we'dhaveto mark it as discardable, and then the next person wantingto pin it would have to check whether it's still there. But thesystem call overhead of calling vrange() every time the last pinon a page was dropped would probably hose us. Well, I guess it could be done lazily: make periodic sweeps throughshared_buffers, looking for pages that haven't been touchedin awhile, and vrange() them. That's quite a bit of new mechanism,but in theory it could work out to a win. vrange()would haveto scale well to millions of separate ranges, though. Will it?And a lot depends on whether the kernelmakes the right decisionabout whether to chunk data from our vrange() vs. any other pageit could have reclaimed. 3. Hint that a page should be dropped immediately when IO completes. There is already something like this buried inthe kernel internals and sometimes called "immediate reclaim" which comes into play when pages are bgin invalidated.It should just be a case of investigating if that is visible to userspace, if not why not and do it in a semi-sensible fashion. 4. 8kB atomic write with OS support to avoid writing full page images in the WAL. This is a feature that is likelyto be delivered anyway and one that Postgres is interested in. 5. Only writeback some pages if explicitly synced or dirty limits are violated. Jeff Janes states that he has problemswith large temporary files that generate IO spikes when the data starts hitting the platter even though thedata does not need to be preserved. Jim Nasby agreed and commented that he "also frequently see this, and it hasan even larger impact if pgsql_tmp is on the same filesystem as WAL. Which *theoretically* shouldn't matter with aBBU controller, except that when the kernel suddenly +decides your *temporary* data needs to hit the media you'rescrewed." One proposal that may address this is Allow a process with an open fd to hint that pages managed by thisinode will have dirty-sticky pages. Pages will be ignoredby dirtybackground writing unless there is an fsync call or dirty page limitsare hit. The hint is cleared when noprocess has the file open. 6. Only writeback pages if explicitly synced. Postgres has strict write ordering requirements. In the words of TomLane -- "As things currently stand, we dirty the page in our internal buffers, and we don't write it to the kerneluntil we've written and fsync'd the WAL data that needs to get to disk first". mmap() would avoid double bufferingbut it has no control about the write ordering which is a show-stopper. As Andres Freund described; Postgres' durability works by guaranteeing that our journalentries (called WAL := Write Ahead Log) are written & synced todiskbefore the corresponding entries of tables and indexes reachthe disk. That also allows to group together many random-writesintoa few contiguous writes fdatasync()ed at once. Only duringa checkpointing phase the big bulk of the datais then (slowly,in the background) synced to disk. I don't see how that's doablewith holding all pages in mmap()ed buffers. There are also concerns there would be an absurd number of mappings. The problem with this sort of dirty pinning interface is that it can deadlock the kernel if all dirty pages in thesystem cannot be written back by the kernel. James Bottomley stated No, I'm sorry, that's never going to be possible. No user spaceapplication has all the facts. If we give you an interfacetoforce unconditional holding of dirty pages in core you'll livelockthe system eventually because you made a wrongdecision to holdtoo many dirty pages. However, it was very clearly stated that the writing ordering is critical. If the kernel breaks the requirementthen the database can get trashed in the event of a power failure. This led to a discussion on write barriers which the kernel uses internally but there are scaling concerns bothwith the number of constraints that would exist and the requirement that Postgres use mapped buffers. There were few solid conclusions on this. It would need major reworking on all sides and it would handing controlof system safety to userspace which is going to cause layering violations. This whole idea may be a bust butit is still worth recording. Greg Stark outlined the motivation best as follows; Ted T'so was concerned this would all be a massive layering violationand I have to admit that's a huge risk. It would takesome cleverAPI engineering to come with a clean set of primitives to expressthe kind of ordering guarantees we need withoutbeing too tied toPostgres's specific implementation. The reason I think it's moreinteresting though is that Postgres'sjournalling and checkpointingarchitecture is pretty bog-standard CS stuff and there are hundredsor thousands ofpieces of software out there that do pretty muchthe same work and trying to do it efficiently with fsync or O_DIRECTislike working with both hands tied to your feet. 7. Allow userspace process to insert data into the kernel page cache without marking the page dirty. This would allowthe application to request that the OS use the application copy of data as page cache if it does not have acopy already. The difficulty here is that the application has no way of knowing if something else has altered theunderlying file in the meantime via something like direct IO. Granted, such activity has probably corrupted the database already but initial reactions are that this is not a safe interface and there are coherency concerns. Dave Chinner asked "why, exactly, do you even need the kernel page cache here?" when Postgres already knows howand when data should be written back to disk. The answer boiled down to "To let kernel do the job that it is goodat, namely managing the write-back of dirty buffers to disk and to manage (possible) read-ahead pages". Postgres has some ordering requirements but it does not want to be responsible for all cache replacement and IO scheduling.Hannu Krosing summarised it best as Again, as said above the linux file system is doing fine. What wewant is a few ways to interact with it to let it do evenbetterwhen working with Postgres by telling it some stuff it otherwisewould have to second guess and by sometimes givingit back somecache pages which were copied away for potential modifying butended up clean in the end. And let the linux kernel decide if and how long to keep these pagesin its cache using its superior knowledge of disk subsystemandabout what else is going on in the system in general. 8. Allow copy-on-write of page-cache pages to anonymous. This would limit the double ram usage to some extent. It'snot as simple as having a MAP_PRIVATE mapping of a file-backed page because presumably they want this data ina shared buffer shared between Postgres processes. The implementation details of something like this are hairy becauseit's mmap()-like but not mmap() as it does not have the same writeback semantics due to the write orderingrequirements Postgres has for database integrity. Completely nuts and this was not mentioned on the list, but arguably you could try implementing something like thisas a character device that allows MAP_SHARED with ioctls with ioctls controlling that file and offset backs pageswithin the mapping. A new mapping would be forced resident and read-only. A write would COW the page. It's a crazy way of doing something like this but avoids a lot of overhead. Even considering the stupid solution might makethe general solution a bit more obvious. For reference, Tom Lane comprehensively described the problems with mmap at http://www.Postgres.org/message-id/17515.1389715715@sss.pgh.pa.us There were some variants of how something like this could be achieved but no finalised proposal at the time of writing. 9. Hint that a page in an anonymous buffer is a copy of a page cache page and invalidate the page cache page on COW.This limits the amount of double buffering. It's in as a low priority item as it's unclear if it's really necessaryand also I suspect the implementation would be very heavy because of the amount of information we'd have to track in the kernel. It is important to note in general that Postgres has a problem with some files being written back too aggressively and other files not written back aggressively enough. Temp files for purposes such as sorting should have writeback deferred as long as possible. Data file writes that must complete before portions of the WAL can be discarded should begin writeback early so the final fsync does not stall for too long. As Dave Chinner says IOWs, there are two very different IO and caching requirementsin play here and tuning the kernel for one actively degradestheperformance of the other. Robert Hass categorised the IO patterns as follows - WAL files are written (and sometimes read) sequentially and fsync'd very frequently and it's always good to write thedata out to disk as soon as possible - Temp files are written and read sequentially and never fsync'd. They should only be written to disk when memory pressuredemands it (but are a good candidate when that situation comes up) - Data files are read and written randomly. They are fsync'd at checkpoint time; between checkpoints, it's best not towrite them sooner than necessary, but when the checkpoint arrives, they all need to get out to the disk without bringingthe system to a standstill At LSF/MM last year there was a discussion on whether userspace should hint that files are "hot" or "cold" so the underlying layers could decide to relocate some data to faster storage. I tuned out a bit during the discussion and did not track what happened with it since but I guess that any developments of that sort would be of interest to the Postgres community. Some of these wish lists still need polish but could potentially be discussed further at LSF/MM with a wider audience as well as on the lists. Then in a of unicorns and ponies it's a case of picking some of these hinting wishlists, seeing what it takes to implement it in kernel and testing it with a suitably patched version of postgres running a test case driven by something (pgbench presumably). -- Mel Gorman SUSE Labs
pgsql-hackers by date: