On 2014-01-17 18:34:25 +0000, Mel Gorman wrote:
> > The scheme that'd allow us is the following:
> > When postgres reads a data page, it will continue to first look up the
> > page in its shared buffers, if it's not there, it will perform a page
> > cache backed read, but instruct that read to immediately remove from the
> > page cache afterwards (new API or, posix_fadvise() or whatever).
> > As long
> > as it's in shared_buffers, postgres will not need to issue new reads, so
> > there's no no benefit keeping it in the page cache.
> > If the page is dirtied, it will be written out normally telling the
> > kernel to forget about the caching the page (using 3) or possibly direct
> > io).
> > When a page in postgres's buffers (which wouldn't be set to very large
> > values) isn't needed anymore and *not* dirty, it will seed the kernel
> > page cache with the current data.
> >
>
> Ordinarily the initial read page could be discarded with fadvise but
> the later write would cause the data to be read back in again which is a
> waste. The details of avoiding that re-read are tricky from a core kernel
> perspective because ordinarily the kernel at that point does not know if
> the write is a full complete aligned write of an underlying filesystem
> structure or not. It may need a different write path which potentially
> leads into needing changes to the address_space operations on a filesystem
> basis -- that would get messy and be a Linux-specific extension. I have
> not researched this properly at all, I could be way off but I have a
> feeling the details get messy.
Hm. This is surprising me a bit - and I bet it does hurt postgres
noticeably if that's the case since the most frequently modified buffers
will only be written out to the OS once every checkpoint but never be
read-in. So they are likely not to be hot enough to stay cached under
cache-pressure.
So this would be a generally beneficial feature - and I doubt it's only
postgres that'd benefit.
> > Now, such a scheme wouldn't likely be zero-copy, but it would avoid
> > double buffering.
>
> It wouldn't be zero copy because minimally the data needs to be handed
> over the filesystem for writing to the disk and the interface for that is
> offset,length based, not page based. Maybe sometimes it will be zero copy
> but it would be a filesystem-specific thing.
Exactly.
Greetings,
Andres Freund
-- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services