Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers
From | Gavin Flower |
---|---|
Subject | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |
Date | |
Msg-id | 52D58A00.3040802@archidevsys.co.nz Whole thread Raw |
In response to | Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (Dave Chinner <david@fromorbit.com>) |
Responses |
Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance |
List | pgsql-hackers |
<div class="moz-cite-prefix">On 14/01/14 14:09, Dave Chinner wrote:<br /></div><blockquote cite="mid:20140114010946.GA3431@dastard"type="cite"><pre wrap="">On Mon, Jan 13, 2014 at 09:29:02PM +0000, Greg Stark wrote: </pre><blockquote type="cite"><pre wrap="">On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund <a class="moz-txt-link-rfc2396E"href="mailto:andres@2ndquadrant.com"><andres@2ndquadrant.com></a> wrote: </pre></blockquote></blockquote> [...]<br /><blockquote cite="mid:20140114010946.GA3431@dastard" type="cite"><blockquotetype="cite"></blockquote><blockquote type="cite"><pre wrap="">The more ambitious and interesting directionis to let Postgres tell the kernel what it needs to know to manage everything. To do that we would need the ability to control when pages are flushed out. This is absolutely necessary to maintain consistency. Postgres would need to be able to mark pages as unflushable until some point in time in the future when the journal is flushed. We discussed various ways that interface could work but it would be tricky to keep it low enough overhead to be workable. </pre></blockquote><pre wrap=""> IMO, the concept of allowing userspace to pin dirty page cache pages in memory is just asking for trouble. Apart from the obvious memory reclaim and OOM issues, some filesystems won't be able to move their journals forward until the data is flushed. i.e. ordered mode data writeback on ext3 will have all sorts of deadlock issues that result from pinning pages and then issuing fsync() on another file which will block waiting for the pinned pages to be flushed. Indeed, what happens if you do pin_dirty_pages(fd); .... fsync(fd);? If fsync() blocks because there are pinned pages, and there's no other thread to unpin them, then that code just deadlocked. If fsync() doesn't block and skips the pinned pages, then we haven't done an fsync() at all, and so violated the expectation that users have that after fsync() returns their data is safe on disk. And if we return an error to fsync(), then what the hell does the user do if it is some other application we don't know about that has pinned the pages? And if the kernel unpins them after some time, then we just violated the application's consistency guarantees.... </pre></blockquote> [...]<br /><br /> What if Postgres could tell the kernel how strongly that it wanted to hold on to thepages? <br /><br /> Say a byte (this is arbitrary, it could be a single hint bit which meant "please, Please, PLEASE don'tflush, if that is okay with you Mr Kernel..."), so strength would be S = (unsigned byte value)/256, so 0 <= S <1.<br /><br /><tt>S = 0 flush now.</tt><tt><br /></tt><tt>0 < S < 1</tt><tt> flush if the 'need' is greaterthan the S<br /></tt><tt>S = 1 never flush (note a value of 1 cannot occur, as max S = 255/256)</tt><br /><br/> Postgres could use low non-zero S values if it thinks that pages <i>might</i> still be useful later, and very highvalues when it is <i>more certain</i>. I am sure Postgres must sometimes know when some pages are more important toheld onto than others, hence my feeling that S should be more than one bit.<br /><br /> The kernel might simply flush pagesstarting at ones with low values of S working upwards until it has freed enough memory to resolve its memory pressure. So an explicit numerical value of 'need' (as implied above) is not required. Also any practical implementationwould not use 'S' as a float/double, but use integer values for 'S' & 'need' - assuming that 'need' didhave to be an actual value, which I suspect would not be reequired.<br /><br /> This way the kernel is free to flush allsuch pages, when sufficient need arises - yet usually, when there is sufficient memory, the pages will be held unflushed.<br/><br /><br /> Cheers,<br /> Gavin<br />
pgsql-hackers by date: