Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date
Msg-id 20140115184113.GK2686@tamriel.snowman.net
Whole thread Raw
In response to Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance  (Claudio Freire <klaussfreire@gmail.com>)
Responses Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
List pgsql-hackers
* Claudio Freire (klaussfreire@gmail.com) wrote:
> But, still, the implementation is very similar to what postgres needs:
> sharing a physical page for two distinct logical pages, efficiently,
> with efficient copy-on-write.

Agreed, except that KSM seems like it'd be slow/lazy about it and I'm
guessing there's a reason the pagecache isn't included normally..

> So it'd be just a matter of removing that limitation regarding page
> cache and shared pages.

Any idea why that limitation is there?

> If you asked me, I'd implement it as copy-on-write on the page cache
> (not the user page). That ought to be low-overhead.

Not entirely sure I'm following this- if it's a shared page, it doesn't
matter who starts writing to it, as soon as that happens, it need to get
copied.  Perhaps you mean that the application should keep the
"original" and that the page-cache should get the "copy" (or, really,
perhaps just forget about the page existing at that point- we won't want
it again...).

Would that be a way to go, perhaps?  This does go back to the "make it
act like mmap, but not *be* mmap", but the idea would be:

open(..., O_ZEROCOPY_READ)
read() - Goes to PG's shared buffers, pagecache and PG share the page
page fault (PG writes to it) - pagecache forgets about the page
write() / fsync() - operate as normal

The differences here from O_DIRECT are that the pagecache will keep the
page while clean (absolutely valuable from PG's perspective- we might
have to evict the page from shared buffers sooner than the kernel does),
and the write()'s happen at the kernel's pace, allowing for
write-combining, etc, until an fsync() happens, of course.

This isn't the "big win" of dealing with I/O issues during checkpoints
that we'd like to see, but it certainly feels like it'd be an
improvement over the current double-buffering situation at least.
Thanks,
    Stephen

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: tests for client programs
Next
From: Josh Berkus
Date:
Subject: Why conf.d should be default, and auto.conf and recovery.conf should be in it