Re: [Lsf-pc] Re: Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17) - Mailing list pgsql-hackers

From Mel Gorman
Subject Re: [Lsf-pc] Re: Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17)
Date
Msg-id 20140117183425.GC4963@suse.de
Whole thread Raw
In response to Re: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17)  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: [Lsf-pc] Re: Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17)
List pgsql-hackers
On Fri, Jan 17, 2014 at 06:14:37PM +0100, Andres Freund wrote:
> Hi Mel,
> 
> On 2014-01-17 16:31:48 +0000, Mel Gorman wrote:
> > Direct IO, buffered IO, double buffering and wishlists
> > ------------------------------------------------------
> >    3. Hint that a page should be dropped immediately when IO completes.
> >       There is already something like this buried in the kernel internals
> >       and sometimes called "immediate reclaim" which comes into play when
> >       pages are bgin invalidated. It should just be a case of investigating
> >       if that is visible to userspace, if not why not and do it in a
> >       semi-sensible fashion.
> 
> "bgin invalidated"?
> 

s/bgin/being/

I admit that "invalidated" in this context is very vague and I did
not explain myself. This paragraph should remind anyone familiar with
VM internals about what happens when invalidate_mapping_pages calls
deactivate_page and how PageReclaim pages are treated by both page reclaim
and end_page_writeback handler. It's similar but not identical to what
Postgres wants and is a reasonable starting position for an implementation.

> Generally, +1 on the capability to achieve such a behaviour from
> userspace.
> 
> >    7. Allow userspace process to insert data into the kernel page cache
> >       without marking the page dirty. This would allow the application
> >       to request that the OS use the application copy of data as page
> >       cache if it does not have a copy already. The difficulty here
> >       is that the application has no way of knowing if something else
> >       has altered the underlying file in the meantime via something like
> >       direct IO. Granted, such activity has probably corrupted the database
> >       already but initial reactions are that this is not a safe interface
> >       and there are coherency concerns.
> 
> I was one of the people suggesting that capability in this thread (after
> pondering about it on the back on my mind for quite some time), and I
> first though it would never be acceptable for pretty much those
> reasons.
> But on second thought I don't think that line of argument makes too much
> sense. If such an API would require write permissions on the file -
> which it surely would - it wouldn't allow an application to do anything
> it previously wasn't able to.
> And I don't see the dangers of concurrent direct IO as anything
> new. Right now the page's contents reside in userspace memory and aren't
> synced in any way with either the page cache or the actual on disk
> state. And afaik there are already several data races if a file is
> modified and read both via the page cache and direct io.
> 

All of this is true.  The objections may not hold up over time and it may
be seem much more reasonable when/if the easier stuff is addressed.

> The scheme that'd allow us is the following:
> When postgres reads a data page, it will continue to first look up the
> page in its shared buffers, if it's not there, it will perform a page
> cache backed read, but instruct that read to immediately remove from the
> page cache afterwards (new API or, posix_fadvise() or whatever).
> As long
> as it's in shared_buffers, postgres will not need to issue new reads, so
> there's no no benefit keeping it in the page cache.
> If the page is dirtied, it will be written out normally telling the
> kernel to forget about the caching the page (using 3) or possibly direct
> io).
> When a page in postgres's buffers (which wouldn't be set to very large
> values) isn't needed anymore and *not* dirty, it will seed the kernel
> page cache with the current data.
> 

Ordinarily the initial read page could be discarded with fadvise but
the later write would cause the data to be read back in again which is a
waste. The details of avoiding that re-read are tricky from a core kernel
perspective because ordinarily the kernel at that point does not know if
the write is a full complete aligned write of an underlying filesystem
structure or not.  It may need a different write path which potentially
leads into needing changes to the address_space operations on a filesystem
basis -- that would get messy and be a Linux-specific extension. I have
not researched this properly at all, I could be way off but I have a
feeling the details get messy.

> Now, such a scheme wouldn't likely be zero-copy, but it would avoid
> double buffering.

It wouldn't be zero copy because minimally the data needs to be handed
over the filesystem for writing to the disk and the interface for that is
offset,length based, not page based. Maybe sometimes it will be zero copy
but it would be a filesystem-specific thing.

> I think the cost of buffer copying has been overstated
> in this thread... he major advantage is that all that could easily
> implemented in a very localized manner, without hurting other OSs and it
> could easily degrade on kernels not providing that capability, which
> would surely be the majority of installations for the next couple of
> cases.
> 
> So, I think such an interface would be hugely beneficial - and I'd be
> surprised if other applications couldn't reuse it. And I don't think
> it'd be all that hard to implement on the kernel side?
> 

Unfortunately I think this does get messy from a kernel perspective because
we are not guaranteed in the *general* case that we're dealing with a full
page write. As before, I have not researched this properly so I'll
update the summary at some stage in case someone can put in the proper
search and see a decent solution.

> >       Dave Chinner asked "why, exactly, do you even need the kernel page
> >       cache here?"  when Postgres already knows how and when data should
> >       be written back to disk. The answer boiled down to "To let kernel do
> >       the job that it is good at, namely managing the write-back of dirty
> >       buffers to disk and to manage (possible) read-ahead pages". Postgres
> >       has some ordering requirements but it does not want to be responsible
> >       for all cache replacement and IO scheduling. Hannu Krosing summarised
> >       it best as
> 
> The other part is that using the page cache for the majority of warm,
> but not burning hot pages, allows the kernel to much more sensibly adapt
> to concurrent workloads requiring memory in some form or other (possibly
> giving it to other VMs when mostly idle and such).
> 
> >    8. Allow copy-on-write of page-cache pages to anonymous. This would limit
> >       the double ram usage to some extent. It's not as simple as having a
> >       MAP_PRIVATE mapping of a file-backed page because presumably they want
> >       this data in a shared buffer shared between Postgres processes. The
> >       implementation details of something like this are hairy because it's
> >       mmap()-like but not mmap() as it does not have the same writeback
> >       semantics due to the write ordering requirements Postgres has for
> >       database integrity.
> 
> >    9. Hint that a page in an anonymous buffer is a copy of a page cache
> >        page and invalidate the page cache page on COW. This limits the
> >        amount of double buffering. It's in as a low priority item as it's
> >        unclear if it's really necessary and also I suspect the implementation
> >        would be very heavy because of the amount of information we'd have
> >        to track in the kernel.
> > 
> 
> I don't see this kind of proposals going anywhere. The amounts of
> changes to postgres and the kernel sound prohibitive to me, besides the
> utter crummyiness.
> 

Agreed. I'm including them just because they were discussed. Someone
else might read it and think "that is a terrible idea but what might work
instead is ...."

-- 
Mel Gorman
SUSE Labs



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: currawong is not a happy animal
Next
From: Heikki Linnakangas
Date:
Subject: Re: GIN improvements part 1: additional information