Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers

From Gavin Flower
Subject Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date
Msg-id 52D58A00.3040802@archidevsys.co.nz
Whole thread Raw
In response to Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance  (Dave Chinner <david@fromorbit.com>)
Responses Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
List pgsql-hackers
<div class="moz-cite-prefix">On 14/01/14 14:09, Dave Chinner wrote:<br /></div><blockquote
cite="mid:20140114010946.GA3431@dastard"type="cite"><pre wrap="">On Mon, Jan 13, 2014 at 09:29:02PM +0000, Greg Stark
wrote:
</pre><blockquote type="cite"><pre wrap="">On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund <a
class="moz-txt-link-rfc2396E"href="mailto:andres@2ndquadrant.com"><andres@2ndquadrant.com></a> wrote:
 
</pre></blockquote></blockquote> [...]<br /><blockquote cite="mid:20140114010946.GA3431@dastard"
type="cite"><blockquotetype="cite"></blockquote><blockquote type="cite"><pre wrap="">The more ambitious and interesting
directionis to let Postgres tell
 
the kernel what it needs to know to manage everything. To do that we
would need the ability to control when pages are flushed out. This is
absolutely necessary to maintain consistency. Postgres would need to
be able to mark pages as unflushable until some point in time in the
future when the journal is flushed. We discussed various ways that
interface could work but it would be tricky to keep it low enough
overhead to be workable.
</pre></blockquote><pre wrap="">
IMO, the concept of allowing userspace to pin dirty page cache
pages in memory is just asking for trouble. Apart from the obvious
memory reclaim and OOM issues, some filesystems won't be able to
move their journals forward until the data is flushed. i.e. ordered
mode data writeback on ext3 will have all sorts of deadlock issues
that result from pinning pages and then issuing fsync() on another
file which will block waiting for the pinned pages to be flushed.

Indeed, what happens if you do pin_dirty_pages(fd); .... fsync(fd);?
If fsync() blocks because there are pinned pages, and there's no
other thread to unpin them, then that code just deadlocked. If
fsync() doesn't block and skips the pinned pages, then we haven't
done an fsync() at all, and so violated the expectation that users
have that after fsync() returns their data is safe on disk. And if
we return an error to fsync(), then what the hell does the user do
if it is some other application we don't know about that has pinned
the pages? And if the kernel unpins them after some time, then we
just violated the application's consistency guarantees....

</pre></blockquote> [...]<br /><br /> What if Postgres could tell the kernel how strongly that it wanted to hold on to
thepages? <br /><br /> Say a byte (this is arbitrary, it could be a single hint bit which meant "please, Please, PLEASE
don'tflush, if that is okay with you Mr Kernel..."), so strength would be S = (unsigned byte value)/256, so 0 <= S
<1.<br /><br /><tt>S = 0      flush now.</tt><tt><br /></tt><tt>0 < S < 1</tt><tt>  flush if the 'need' is
greaterthan the S<br /></tt><tt>S = 1      never flush (note a value of 1 cannot occur, as max S = 255/256)</tt><br
/><br/> Postgres could use low non-zero S values if it thinks that pages <i>might</i> still be useful later, and very
highvalues when it is <i>more certain</i>.  I am sure Postgres must sometimes know when some pages are more important
toheld onto than others, hence my feeling that S should be more than one bit.<br /><br /> The kernel might simply flush
pagesstarting at ones with low values of S working upwards until it has freed enough memory to resolve its memory
pressure. So an explicit numerical value of 'need' (as implied above) is not required.  Also any practical
implementationwould not use 'S' as a float/double, but use integer values for 'S' & 'need' - assuming that 'need'
didhave to be an actual value, which I suspect would not be reequired.<br /><br /> This way the kernel is free to flush
allsuch pages, when sufficient need arises - yet usually, when there is sufficient memory, the pages will be held
unflushed.<br/><br /><br /> Cheers,<br /> Gavin<br /> 

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: shared memory message queues
Next
From: Alvaro Herrera
Date:
Subject: Re: Add force option to dropdb