Thread: Linux kernel impact on PostgreSQL performance

From:
Mel Gorman
Date:

Hi,

I'm the chair for Linux Storage, Filesystem and Memory Management Summit 2014
(LSF/MM). A CFP was sent out last month (https://lwn.net/Articles/575681/)
that you may have seen already.

In recent years we have had at least one topic that was shared between
all three tracks that was lead by a person outside of the usual kernel
development community. I am checking if the PostgreSQL community
would be willing to volunteer someone to lead a topic discussing
PostgreSQL performance with recent kernels or to highlight regressions
or future developments you feel are potentially a problem. With luck
someone suitable is already travelling to the collaboration summit
(http://events.linuxfoundation.org/events/collaboration-summit) and it
would not be too inconvenient to drop in for LSF/MM as well.

There are two reasons why I'm suggesting this. First, PostgreSQL was the
basis of a test used to highlight a scheduler problem around kernel 3.6
but otherwise in my experience it is rare that PostgreSQL is part of a
bug report.  I am skeptical this particular bug report was a typical use
case for PostgreSQL (pgbench, read-only, many threads, very small in-memory
database). I wonder why reports related to PostgreSQL are not more common.
One assumption would be that PostgreSQL is perfectly happy with the current
kernel behaviour in which case our discussion here is done.

This brings me to the second reason -- there is evidence
that the PostgreSQL community is not happy with the current
direction of kernel development. The most obvious example is this thread
http://postgresql.1045698.n5.nabble.com/Why-we-are-going-to-have-to-go-DirectIO-td5781471.html
but I suspect there are others. The thread alleges that the kernel community
are in the business of pushing hackish changes into the IO stack without
much thought or testing although the linked article describes a VM and not
a storage problem. I'm not here to debate the kernels regression testing
or development methodology but LSF/MM is one place where a large number
of people involved with the IO layers will be attending.  If you have a
concrete complaint then here is a soap box.

Does the PostgreSQL community have a problem with recent kernels,
particularly with respect to the storage, filesystem or memory management
layers? If yes, do you have some data that can highlight this and can you
volunteer someone to represent your interests to the kernel community? Are
current developments in the IO layer counter to the PostgreSQL requirements?
If so, what developments, why are they a problem, do you have a suggested
alternative or some idea of what we should watch out for? The track topic
would be up to you but just as a hint, we'd need something a lot more
concrete than "you should test more".

-- 
Mel Gorman
SUSE Labs



From:
Josh Berkus
Date:

Mel,

> I'm the chair for Linux Storage, Filesystem and Memory Management Summit 2014
> (LSF/MM). A CFP was sent out last month (https://lwn.net/Articles/575681/)
> that you may have seen already.
> 
> In recent years we have had at least one topic that was shared between
> all three tracks that was lead by a person outside of the usual kernel
> development community. I am checking if the PostgreSQL community
> would be willing to volunteer someone to lead a topic discussing
> PostgreSQL performance with recent kernels or to highlight regressions
> or future developments you feel are potentially a problem. With luck
> someone suitable is already travelling to the collaboration summit
> (http://events.linuxfoundation.org/events/collaboration-summit) and it
> would not be too inconvenient to drop in for LSF/MM as well.

We can definitely get someone there.  I'll certainly be there; I'm
hoping to get someone who has closer involvement with our kernel
interaction as well.

> There are two reasons why I'm suggesting this. First, PostgreSQL was the
> basis of a test used to highlight a scheduler problem around kernel 3.6
> but otherwise in my experience it is rare that PostgreSQL is part of a
> bug report.  I am skeptical this particular bug report was a typical use
> case for PostgreSQL (pgbench, read-only, many threads, very small in-memory
> database). I wonder why reports related to PostgreSQL are not more common.
> One assumption would be that PostgreSQL is perfectly happy with the current
> kernel behaviour in which case our discussion here is done.

To be frank, it's because most people are still running on 2.6.19, and
as a result are completely unaware of recent developments.  Second,
because there's no obvious place to complain to ... lkml doesn't welcome
bug reports, and where else do you go?

> Does the PostgreSQL community have a problem with recent kernels,
> particularly with respect to the storage, filesystem or memory management
> layers? If yes, do you have some data that can highlight this and can you
> volunteer someone to represent your interests to the kernel community? 

Yes, and yes.

> Are
> current developments in the IO layer counter to the PostgreSQL requirements?
> If so, what developments, why are they a problem, do you have a suggested
> alternative or some idea of what we should watch out for? 

Mostly the issue is changes to the IO scheduler which improve one use
case at the expense of others, or set defaults which emphasize desktop
hardware over server hardware.

What also came up with the recent change to LRU is that the Postgres
community apparently has more experience than the Linux community with
buffer-clearing algorithms, and we ought to share that.

> The track topic
> would be up to you but just as a hint, we'd need something a lot more
> concrete than "you should test more".

How about "don't add major IO behavior changes with no
backwards-compatibility switches"?  ;-)

Seriously, one thing I'd like to get out of Collab would be a reasonable
regimen for testing database performance on Linux kernels.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



From:
Kevin Grittner
Date:

Josh Berkus <> wrote:

>> Does the PostgreSQL community have a problem with recent
>> kernels, particularly with respect to the storage, filesystem or
>> memory management layers?

> How about "don't add major IO behavior changes with no
> backwards-compatibility switches"?  ;-)

I notice, Josh, that you didn't mention the problems many people
have run into with Transparent Huge Page defrag and with NUMA
access.  Is that because there *are* configuration options that
allow people to get decent performance once the issue is diagnosed?
It seems like maybe there could be a better way to give a heads-up
on hazards in a new kernel to the database world, but I don't know
quite what that would be.  For all I know, it is already available
if you know where to look.

> Seriously, one thing I'd like to get out of Collab would be a
> reasonable regimen for testing database performance on Linux
> kernels.

... or perhaps you figure this is what would bring such issues to
the community's attention before people are bitten in production
environments?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Josh Berkus
Date:

On 01/13/2014 10:51 AM, Kevin Grittner wrote:
>> How about "don't add major IO behavior changes with no
>> backwards-compatibility switches"?  ;-)
> 
> I notice, Josh, that you didn't mention the problems many people
> have run into with Transparent Huge Page defrag and with NUMA
> access.  Is that because there *are* configuration options that
> allow people to get decent performance once the issue is diagnosed?
> It seems like maybe there could be a better way to give a heads-up
> on hazards in a new kernel to the database world, but I don't know
> quite what that would be.  For all I know, it is already available
> if you know where to look.

Well, it was the lack of sysctl options which takes the 2Q change from
"annoyance" to "potential disaster".  We can't ever get away from the
possibility that the Postgres use-case might be the minority use-case,
and we might have to use non-default options.  It's when those options
aren't present *at all* that we're stuck.

However, I agree that a worthwhile thing to talk about is having some
better channel to notify the Postgres (and other DB) communities about
major changes to IO and Memory management.

Wanna go to Collab?

>> Seriously, one thing I'd like to get out of Collab would be a
>> reasonable regimen for testing database performance on Linux
>> kernels.
> 
> ... or perhaps you figure this is what would bring such issues to
> the community's attention before people are bitten in production
> environments?

That, too.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



From:
Robert Haas
Date:

On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner <> wrote:
> I notice, Josh, that you didn't mention the problems many people
> have run into with Transparent Huge Page defrag and with NUMA
> access.

Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
setting zone_reclaim_mode; is there some other problem besides that?

The other thing that comes to mind is the kernel's caching behavior.
We've talked a lot over the years about the difficulties of getting
the kernel to write data out when we want it to and to not write data
out when we don't want it to.  When it writes data back to disk too
aggressively, we get lousy throughput because the same page can get
written more than once when caching it for longer would have allowed
write-combining.  When it doesn't write data to disk aggressively
enough, we get huge latency spikes at checkpoint time when we call
fsync() and the kernel says "uh, what? you wanted that data *on the
disk*? sorry boss!" and then proceeds to destroy the world by starving
the rest of the system for I/O for many seconds or minutes at a time.
We've made some desultory attempts to use sync_file_range() to improve
things here, but I'm not sure that's really the right tool, and if it
is we don't know how to use it well enough to obtain consistent
positive results.

On a related note, there's also the problem of double-buffering.  When
we read a page into shared_buffers, we leave a copy behind in the OS
buffers, and similarly on write-out.  It's very unclear what to do
about this, since the kernel and PostgreSQL don't have intimate
knowledge of what each other are doing, but it would be nice to solve
somehow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Claudio Freire
Date:

On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas <> wrote:
> On a related note, there's also the problem of double-buffering.  When
> we read a page into shared_buffers, we leave a copy behind in the OS
> buffers, and similarly on write-out.  It's very unclear what to do
> about this, since the kernel and PostgreSQL don't have intimate
> knowledge of what each other are doing, but it would be nice to solve
> somehow.


There you have a much harder algorithmic problem.

You can basically control duplication with fadvise and WONTNEED. The
problem here is not the kernel and whether or not it allows postgres
to be smart about it. The problem is... what kind of smarts
(algorithm) to use.



From:
Jim Nasby
Date:

On 1/13/14, 2:19 PM, Claudio Freire wrote:
> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas <> wrote:
>> On a related note, there's also the problem of double-buffering.  When
>> we read a page into shared_buffers, we leave a copy behind in the OS
>> buffers, and similarly on write-out.  It's very unclear what to do
>> about this, since the kernel and PostgreSQL don't have intimate
>> knowledge of what each other are doing, but it would be nice to solve
>> somehow.
>
>
> There you have a much harder algorithmic problem.
>
> You can basically control duplication with fadvise and WONTNEED. The
> problem here is not the kernel and whether or not it allows postgres
> to be smart about it. The problem is... what kind of smarts
> (algorithm) to use.

Isn't this a fairly simple matter of when we read a page into shared buffers tell the kernel do forget that page? And a
corollaryto that for when we dump a page out of shared_buffers (here kernel, please put this back into your cache).
 
-- 
Jim C. Nasby, Data Architect                       
512.569.9461 (cell)                         http://jim.nasby.net



From:
Claudio Freire
Date:

On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby <> wrote:
> On 1/13/14, 2:19 PM, Claudio Freire wrote:
>>
>> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas <>
>> wrote:
>>>
>>> On a related note, there's also the problem of double-buffering.  When
>>> we read a page into shared_buffers, we leave a copy behind in the OS
>>> buffers, and similarly on write-out.  It's very unclear what to do
>>> about this, since the kernel and PostgreSQL don't have intimate
>>> knowledge of what each other are doing, but it would be nice to solve
>>> somehow.
>>
>>
>>
>> There you have a much harder algorithmic problem.
>>
>> You can basically control duplication with fadvise and WONTNEED. The
>> problem here is not the kernel and whether or not it allows postgres
>> to be smart about it. The problem is... what kind of smarts
>> (algorithm) to use.
>
>
> Isn't this a fairly simple matter of when we read a page into shared buffers
> tell the kernel do forget that page? And a corollary to that for when we
> dump a page out of shared_buffers (here kernel, please put this back into
> your cache).


That's my point. In terms of kernel-postgres interaction, it's fairly simple.

What's not so simple, is figuring out what policy to use. Remember,
you cannot tell the kernel to put some page in its page cache without
reading it or writing it. So, once you make the kernel forget a page,
evicting it from shared buffers becomes quite expensive.



From:
Jim Nasby
Date:

On 1/13/14, 2:27 PM, Claudio Freire wrote:
> On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby <> wrote:
>> On 1/13/14, 2:19 PM, Claudio Freire wrote:
>>>
>>> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas <>
>>> wrote:
>>>>
>>>> On a related note, there's also the problem of double-buffering.  When
>>>> we read a page into shared_buffers, we leave a copy behind in the OS
>>>> buffers, and similarly on write-out.  It's very unclear what to do
>>>> about this, since the kernel and PostgreSQL don't have intimate
>>>> knowledge of what each other are doing, but it would be nice to solve
>>>> somehow.
>>>
>>>
>>>
>>> There you have a much harder algorithmic problem.
>>>
>>> You can basically control duplication with fadvise and WONTNEED. The
>>> problem here is not the kernel and whether or not it allows postgres
>>> to be smart about it. The problem is... what kind of smarts
>>> (algorithm) to use.
>>
>>
>> Isn't this a fairly simple matter of when we read a page into shared buffers
>> tell the kernel do forget that page? And a corollary to that for when we
>> dump a page out of shared_buffers (here kernel, please put this back into
>> your cache).
>
>
> That's my point. In terms of kernel-postgres interaction, it's fairly simple.
>
> What's not so simple, is figuring out what policy to use. Remember,
> you cannot tell the kernel to put some page in its page cache without
> reading it or writing it. So, once you make the kernel forget a page,
> evicting it from shared buffers becomes quite expensive.

Well, if we were to collaborate with the kernel community on this then presumably we can do better than that for
eviction...even to the extent of "here's some data from this range in this file. It's (clean|dirty). Put it in your
cache.Just trust me on this."
 
-- 
Jim C. Nasby, Data Architect                       
512.569.9461 (cell)                         http://jim.nasby.net



From:
Claudio Freire
Date:

On Mon, Jan 13, 2014 at 5:32 PM, Jim Nasby <> wrote:
>>
>> That's my point. In terms of kernel-postgres interaction, it's fairly
>> simple.
>>
>> What's not so simple, is figuring out what policy to use. Remember,
>> you cannot tell the kernel to put some page in its page cache without
>> reading it or writing it. So, once you make the kernel forget a page,
>> evicting it from shared buffers becomes quite expensive.
>
>
> Well, if we were to collaborate with the kernel community on this then
> presumably we can do better than that for eviction... even to the extent of
> "here's some data from this range in this file. It's (clean|dirty). Put it
> in your cache. Just trust me on this."


If I had a kernel developer hat, I'd put it on to say: I don't think
allowing that last bit is wise for a kernel.

It would violate oh-so-many separation rules and open an oh-so-big can-o-worms.



From:
Andres Freund
Date:

On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
> On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner <> wrote:
> > I notice, Josh, that you didn't mention the problems many people
> > have run into with Transparent Huge Page defrag and with NUMA
> > access.
> 
> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
> setting zone_reclaim_mode; is there some other problem besides that?

I think that fixes some of the worst instances, but I've seen machines
spending horrible amounts of CPU (& BUS) time in page reclaim
nonetheless. If I analyzed it correctly it's in RAM << working set
workloads where RAM is pretty large and most of it is used as page
cache. The kernel ends up spending a huge percentage of time finding and
potentially defragmenting pages when looking for victim buffers.

> On a related note, there's also the problem of double-buffering.  When
> we read a page into shared_buffers, we leave a copy behind in the OS
> buffers, and similarly on write-out.  It's very unclear what to do
> about this, since the kernel and PostgreSQL don't have intimate
> knowledge of what each other are doing, but it would be nice to solve
> somehow.

I've wondered before if there wouldn't be a chance for postgres to say
"my dear OS, that the file range 0-8192 of file x contains y, no need to
reread" and do that when we evict a page from s_b but I never dared to
actually propose that to kernel people...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



From:
Jim Nasby
Date:

On 1/13/14, 2:37 PM, Claudio Freire wrote:
> On Mon, Jan 13, 2014 at 5:32 PM, Jim Nasby <> wrote:
>>>
>>> That's my point. In terms of kernel-postgres interaction, it's fairly
>>> simple.
>>>
>>> What's not so simple, is figuring out what policy to use. Remember,
>>> you cannot tell the kernel to put some page in its page cache without
>>> reading it or writing it. So, once you make the kernel forget a page,
>>> evicting it from shared buffers becomes quite expensive.
>>
>>
>> Well, if we were to collaborate with the kernel community on this then
>> presumably we can do better than that for eviction... even to the extent of
>> "here's some data from this range in this file. It's (clean|dirty). Put it
>> in your cache. Just trust me on this."
>
>
> If I had a kernel developer hat, I'd put it on to say: I don't think
> allowing that last bit is wise for a kernel.
>
> It would violate oh-so-many separation rules and open an oh-so-big can-o-worms.

Yeah, if it were me I'd probably want to keep a hash of the page and it's address and only accept putting a page back
intothe kernel if it matched my hash. Otherwise you'd just have to treat it as a write.
 
-- 
Jim C. Nasby, Data Architect                       
512.569.9461 (cell)                         http://jim.nasby.net



From:
Andres Freund
Date:

On 2014-01-13 15:53:36 -0500, Trond Myklebust wrote:
> > I've wondered before if there wouldn't be a chance for postgres to say
> > "my dear OS, that the file range 0-8192 of file x contains y, no need to
> > reread" and do that when we evict a page from s_b but I never dared to
> > actually propose that to kernel people...
> 
> O_DIRECT was specifically designed to solve the problem of double buffering between applications and the kernel. Why
areyou not able to use that in these situations?
 

Because we like to handle the OS handle part of postgres' caching. For
one, it makes servers with several applications/databases much more
realistic without seriously overallocating memory, for another it's a
huge chunk of platform dependent code to get good performance
everywhere.
The above was explicitly not to avoid double buffering but to move a
buffer away from postgres' own buffers to the kernel's buffers once it's
not 100% clear we need it in buffers anymore.

Part of the reason this is being discussed is because previously people
suggested going the direct IO route and some people (most prominently
J. Corbet in http://archives.postgresql.org/message-id/20131204083345.31c60dd1%40lwn.net
) and others disagreed because that goes the route of reinventing
storage layers everywhere without improving the common codepaths.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



From:
Robert Haas
Date:

On Mon, Jan 13, 2014 at 3:53 PM, Trond Myklebust <> wrote:
> O_DIRECT was specifically designed to solve the problem of double buffering between applications and the kernel. Why
areyou not able to use that in these situations?
 

O_DIRECT was apparently designed by a deranged monkey on some serious
mind-controlling substances.  But don't take it from me, I have it on
good authority:

http://yarchive.net/comp/linux/o_direct.html

One might even say the best authority.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Jeff Janes
Date:

On Mon, Jan 13, 2014 at 12:32 PM, Jim Nasby <> wrote:
On 1/13/14, 2:27 PM, Claudio Freire wrote:
On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby <> wrote:
On 1/13/14, 2:19 PM, Claudio Freire wrote:

On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas <>
wrote:

On a related note, there's also the problem of double-buffering.  When
we read a page into shared_buffers, we leave a copy behind in the OS
buffers, and similarly on write-out.  It's very unclear what to do
about this, since the kernel and PostgreSQL don't have intimate
knowledge of what each other are doing, but it would be nice to solve
somehow.



There you have a much harder algorithmic problem.

You can basically control duplication with fadvise and WONTNEED. The
problem here is not the kernel and whether or not it allows postgres
to be smart about it. The problem is... what kind of smarts
(algorithm) to use.


Isn't this a fairly simple matter of when we read a page into shared buffers
tell the kernel do forget that page? And a corollary to that for when we
dump a page out of shared_buffers (here kernel, please put this back into
your cache).


That's my point. In terms of kernel-postgres interaction, it's fairly simple.

What's not so simple, is figuring out what policy to use.

I think the above is pretty simple for both interaction (allow us to inject a clean page into the file page cache) and policy (forget it after you hand it to us, then remember it again when we hand it back to you clean).  And I think it would pretty likely be an improvement over what we currently do.  But I think it is probably the wrong way to get the improvement.  I think the real problem is that we don't trust ourselves to manage more of the memory ourselves.  

As far as I know, we still don't have a publicly disclosable and readily reproducible test case for the reports of performance degradation when we have more than 8GB in shared_buffers.   If we had one of those, we could likely reduce the double buffering problem by fixing our own scalability issues and therefore taking responsibility for more of the data ourselves.



Remember,
you cannot tell the kernel to put some page in its page cache without
reading it or writing it. So, once you make the kernel forget a page,
evicting it from shared buffers becomes quite expensive.

Well, if we were to collaborate with the kernel community on this then presumably we can do better than that for eviction... even to the extent of "here's some data from this range in this file. It's (clean|dirty). Put it in your cache. Just trust me on this."

Which, in the case of it being clean, amounts to "Here is data we don't want in memory any more because we think it is cold.  But we don't trust ourselves, so please hold on to it anyway."  That might be a tough sell to the kernel people.

 Cheers,

Jeff
From:
James Bottomley
Date:

On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
> On 1/13/14, 2:27 PM, Claudio Freire wrote:
> > On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby <> wrote:
> >> On 1/13/14, 2:19 PM, Claudio Freire wrote:
> >>>
> >>> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas <>
> >>> wrote:
> >>>>
> >>>> On a related note, there's also the problem of double-buffering.  When
> >>>> we read a page into shared_buffers, we leave a copy behind in the OS
> >>>> buffers, and similarly on write-out.  It's very unclear what to do
> >>>> about this, since the kernel and PostgreSQL don't have intimate
> >>>> knowledge of what each other are doing, but it would be nice to solve
> >>>> somehow.
> >>>
> >>>
> >>>
> >>> There you have a much harder algorithmic problem.
> >>>
> >>> You can basically control duplication with fadvise and WONTNEED. The
> >>> problem here is not the kernel and whether or not it allows postgres
> >>> to be smart about it. The problem is... what kind of smarts
> >>> (algorithm) to use.
> >>
> >>
> >> Isn't this a fairly simple matter of when we read a page into shared buffers
> >> tell the kernel do forget that page? And a corollary to that for when we
> >> dump a page out of shared_buffers (here kernel, please put this back into
> >> your cache).
> >
> >
> > That's my point. In terms of kernel-postgres interaction, it's fairly simple.
> >
> > What's not so simple, is figuring out what policy to use. Remember,
> > you cannot tell the kernel to put some page in its page cache without
> > reading it or writing it. So, once you make the kernel forget a page,
> > evicting it from shared buffers becomes quite expensive.
> 
> Well, if we were to collaborate with the kernel community on this then
> presumably we can do better than that for eviction... even to the
> extent of "here's some data from this range in this file. It's (clean|
> dirty). Put it in your cache. Just trust me on this."

This should be the madvise() interface (with MADV_WILLNEED and
MADV_DONTNEED) is there something in that interface that is
insufficient?

James





From:
Trond Myklebust
Date:

On Jan 13, 2014, at 15:40, Andres Freund <> wrote:

> On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
>> On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner <> wrote:
>>> I notice, Josh, that you didn't mention the problems many people
>>> have run into with Transparent Huge Page defrag and with NUMA
>>> access.
>>
>> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
>> setting zone_reclaim_mode; is there some other problem besides that?
>
> I think that fixes some of the worst instances, but I've seen machines
> spending horrible amounts of CPU (& BUS) time in page reclaim
> nonetheless. If I analyzed it correctly it's in RAM << working set
> workloads where RAM is pretty large and most of it is used as page
> cache. The kernel ends up spending a huge percentage of time finding and
> potentially defragmenting pages when looking for victim buffers.
>
>> On a related note, there's also the problem of double-buffering.  When
>> we read a page into shared_buffers, we leave a copy behind in the OS
>> buffers, and similarly on write-out.  It's very unclear what to do
>> about this, since the kernel and PostgreSQL don't have intimate
>> knowledge of what each other are doing, but it would be nice to solve
>> somehow.
>
> I've wondered before if there wouldn't be a chance for postgres to say
> "my dear OS, that the file range 0-8192 of file x contains y, no need to
> reread" and do that when we evict a page from s_b but I never dared to
> actually propose that to kernel people...

O_DIRECT was specifically designed to solve the problem of double buffering between applications and the kernel. Why
areyou not able to use that in these situations? 

Cheers,  Trond


From:
Kevin Grittner
Date:

Josh Berkus <> wrote:

> Wanna go to Collab?

I don't think that works out for me, but thanks for suggesting it.

I'd be happy to brainstorm with anyone who does go about issues to
discuss; although the ones I keep running into have already been
mentioned.

Regarding the problems others have mentioned, there are a few
features that might be a very big plus for us.  Additional ways of
hinting pages might be very useful.  If we had a way to specify how
many dirty pages were cached in PostgreSQL, the OS would count
those for calculations for writing dirty pages, and we could avoid
the "write avalanche" which is currently so tricky to avoid without
causing repeated writes to the same page.  Or perhaps instead a way
to hint a page as dirty so that the OS could not only count those,
but discard the obsolete data from its cache if it is not already
dirty at the OS level, and lower the write priority if it is dirty
(to improve the odds of collapsing multiple writes).  If there was
a way to use DONTNEED or something similar with the ability to
rescind it if the page was still happened to be in the OS cache,
that might help for when we discard a still-clean page from our
buffers.  And I seem to have a vague memory of there being cases
where the OS is first reading pages when we ask to write them,
which seems like avoidable I/O.  (I'm not sure about that one,
though.)

Also, something like THP support should really have sysctl support
rather than requiring people to put echo commands into scripts and
tie those into runlevel changes.  That's pretty ugly for something
which has turned out to be necessary so often.

I don't get too excited about changes to the default schedulers --
it's been pretty widely known for a long time that DEADLINE or NOOP
perform better than any alternatives for most database loads.
Anyone with a job setting up Linux machines to be used for database
servers should know to cover that.  As long as those two don't get
broken, I'm good.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Andres Freund
Date:

On 2014-01-13 12:34:35 -0800, James Bottomley wrote:
> On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
> > Well, if we were to collaborate with the kernel community on this then
> > presumably we can do better than that for eviction... even to the
> > extent of "here's some data from this range in this file. It's (clean|
> > dirty). Put it in your cache. Just trust me on this."
> 
> This should be the madvise() interface (with MADV_WILLNEED and
> MADV_DONTNEED) is there something in that interface that is
> insufficient?

For one, postgres doesn't use mmap for files (and can't without major
new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
horrible consequences for performance/scalability - very quickly you
contend on locks in the kernel.
Also, that will mark that page dirty, which isn't what we want in this
case. One major usecase is transplanting a page comming from postgres'
buffers into the kernel's buffercache because the latter has a much
better chance of properly allocating system resources across independent
applications running.

Oh, and the kernel's page-cache management while far from perfect,
actually scales much better than postgres'.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



From:
Greg Stark
Date:

On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund <> wrote:
> For one, postgres doesn't use mmap for files (and can't without major
> new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> horrible consequences for performance/scalability - very quickly you
> contend on locks in the kernel.


I may as well dump this in this thread. We've discussed this in person
a few times, including at least once with Ted T'so when he visited
Dublin last year.

The fundamental conflict is that the kernel understands better the
hardware and other software using the same resources, Postgres
understands better its own access patterns. We need to either add
interfaces so Postgres can teach the kernel what it needs about its
access patterns or add interfaces so Postgres can find out what it
needs to know about the hardware context.

The more ambitious and interesting direction is to let Postgres tell
the kernel what it needs to know to manage everything. To do that we
would need the ability to control when pages are flushed out. This is
absolutely necessary to maintain consistency. Postgres would need to
be able to mark pages as unflushable until some point in time in the
future when the journal is flushed. We discussed various ways that
interface could work but it would be tricky to keep it low enough
overhead to be workable.

The less exciting, more conservative option would be to add kernel
interfaces to teach Postgres about things like raid geometries. Then
Postgres could use directio and decide to do prefetching based on the
raid geometry, how much available i/o bandwidth and iops is available,
etc.

Reimplementing i/o schedulers and all the rest of the work that the
kernel provides inside Postgres just seems like something outside our
competency and that none of us is really excited about doing.

-- 
greg



From:
Josh Berkus
Date:

Everyone,

I am looking for one or more hackers to go to Collab with me to discuss
this.  If you think that might be you, please let me know and I'll look
for funding for your travel.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



From:
Mel Gorman
Date:

On Mon, Jan 13, 2014 at 03:15:16PM -0500, Robert Haas wrote:
> On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner <> wrote:
> > I notice, Josh, that you didn't mention the problems many people
> > have run into with Transparent Huge Page defrag and with NUMA
> > access.
> 

Ok, there are at least three potential problems there that you may or
may not have run into.

First, THP when it was first introduced was a bit of a disaster. In 3.0,
it was *very* heavy handed and would trash the system reclaiming memory
to satisfy an allocation. When it did this, it would also writeback a
bunch of data and block on it to boot. It was not the smartest move of
all time but was improved over time and in some cases the patches were
also backported by 3.0.101. This is a problem that should have
alleviated over time.

The general symptoms of the problem would be massive stalls and
monitoring the /proc/PID/stack of interesting processes would show it to
be somewhere in do_huge_pmd_anonymous_page -> alloc_page_nodemask ->
try_to_free_pages -> migrate_pages or something similar. You may have
worked around it by disabling THP with a command line switch or
/sys/kernel/mm/transparent_hugepage/enabled in the past.

This is "not meant to happen" any more or at least it has been a while
since a bug was filed against me in this area. There are corner cases
though. If the underlying filesystem is NFS, the problem might still be
experienced.

That is the simple case.

You might have also hit the case where THPages filled with zeros did not
use the zero page. That would have looked like a larger footprint than
anticipated and lead to another range of problems. This is also addressed
since but maybe not recently enough. It's less likely this is your problem
though as I expect you actually use your buffers, not leave them filled
with zeros.

You mention NUMA but that's trickier to figure out that problem without more
context.  THP can cause unexpected interleaving between NUMA nodes. Memory
that would have been local on a 4K page boundary becomes remote accesses
when THP is enabled and performance would be hit (maybe 3-5% depending on
the machine). It's not the only possibility though. If memory was being
used sparsely and THP was in use then the overall memory footprint may be
higher than it should be. This potentially would cause allocations to spill
over to remote nodes while kswapd wakes up to reclaim local memory. That
would lead to weird buffer aging inversion problems. This is a hell of a
lot of guessing though and we'd need a better handle on the reproduction
case to pin it down.

> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
> setting zone_reclaim_mode; is there some other problem besides that?
> 

Really?

zone_reclaim_mode is often a complete disaster unless the workload is
partitioned to fit within NUMA nodes. On older kernels enabling it would
sometimes cause massive stalls. I'm actually very surprised to hear it
fixes anything and would be interested in hearing more about what sort
of circumstnaces would convince you to enable that thing.

> The other thing that comes to mind is the kernel's caching behavior.
> We've talked a lot over the years about the difficulties of getting
> the kernel to write data out when we want it to and to not write data
> out when we don't want it to. 

Is sync_file_range() broke?

> When it writes data back to disk too
> aggressively, we get lousy throughput because the same page can get
> written more than once when caching it for longer would have allowed
> write-combining. 

Do you think that is related to dirty_ratio or dirty_writeback_centisecs?
If it's dirty_writeback_centisecs then that would be particularly tricky
because poor interactions there would come down to luck basically.

> When it doesn't write data to disk aggressively
> enough, we get huge latency spikes at checkpoint time when we call
> fsync() and the kernel says "uh, what? you wanted that data *on the
> disk*? sorry boss!" and then proceeds to destroy the world by starving
> the rest of the system for I/O for many seconds or minutes at a time.

Ok, parts of that are somewhat expected. It *may* depend on the
underlying filesystem. Some of them handle fsync better than others. If
you are syncing the whole file though when you call fsync then you are
potentially burned by having to writeback dirty_ratio amounts of memory
which could take a substantial amount of time.

> We've made some desultory attempts to use sync_file_range() to improve
> things here, but I'm not sure that's really the right tool, and if it
> is we don't know how to use it well enough to obtain consistent
> positive results.
> 

That implies that either sync_file_range() is broken in some fashion we
(or at least I) are not aware of and that needs kicking.

> On a related note, there's also the problem of double-buffering.  When
> we read a page into shared_buffers, we leave a copy behind in the OS
> buffers, and similarly on write-out.  It's very unclear what to do
> about this, since the kernel and PostgreSQL don't have intimate
> knowledge of what each other are doing, but it would be nice to solve
> somehow.
> 

If it's mapped, clean and you do not need any more than
madvise(MADV_DONTNEED). If you are accessing teh data via a file handle,
then I would expect posix_fadvise(POSIX_FADV_DONTNEED). Offhand, I do
not know how it behaved historically but right now it will usually sync
the data and then discard the pages. I say usually because it will not
necessarily sync if the storage is congested and there is no guarantee it
will be discarded. In older kernels, there was a bug where small calls to
posix_fadvise() would not work at all. This was fixed in 3.9.

The flipside is also meant to hold true. If you know data will be needed
in the near future then posix_fadvise(POSIX_FADV_WILLNEED). Glancing at
the implementation it does a forced read-ahead on the range of pages of
interest. It doesn't look like it would block.

The completely different approach for double buffering is direct IO but
there may be reasons why you are avoiding that and are unhappy with the
interfaces that are meant to work.

Just from the start, it looks like there are a number of problem areas.
Some may be fixed -- in which case we should identify what fixed it, what
kernel version and see can it be verified with a test case or did we
manage to break something else in the process. Other bugs may still
exist because we believe some interface works how users want when it is
in fact unfit for purpose for some reason.

-- 
Mel Gorman
SUSE Labs



From:
Mel Gorman
Date:

On Mon, Jan 13, 2014 at 06:27:03PM -0200, Claudio Freire wrote:
> On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby <> wrote:
> > On 1/13/14, 2:19 PM, Claudio Freire wrote:
> >>
> >> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas <>
> >> wrote:
> >>>
> >>> On a related note, there's also the problem of double-buffering.  When
> >>> we read a page into shared_buffers, we leave a copy behind in the OS
> >>> buffers, and similarly on write-out.  It's very unclear what to do
> >>> about this, since the kernel and PostgreSQL don't have intimate
> >>> knowledge of what each other are doing, but it would be nice to solve
> >>> somehow.
> >>
> >>
> >>
> >> There you have a much harder algorithmic problem.
> >>
> >> You can basically control duplication with fadvise and WONTNEED. The
> >> problem here is not the kernel and whether or not it allows postgres
> >> to be smart about it. The problem is... what kind of smarts
> >> (algorithm) to use.
> >
> >
> > Isn't this a fairly simple matter of when we read a page into shared buffers
> > tell the kernel do forget that page? And a corollary to that for when we
> > dump a page out of shared_buffers (here kernel, please put this back into
> > your cache).
> 
> 
> That's my point. In terms of kernel-postgres interaction, it's fairly simple.
> 
> What's not so simple, is figuring out what policy to use. Remember,
> you cannot tell the kernel to put some page in its page cache without
> reading it or writing it. So, once you make the kernel forget a page,
> evicting it from shared buffers becomes quite expensive.

posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by
forcing readahead. If you evict it prematurely then you do get kinda
screwed because you pay the IO cost to read it back in again even if you
had enough memory to cache it. Maybe this is the type of kernel-postgres
interaction that is annoying you.

If you don't evict, the kernel eventually steps in and evicts the wrong
thing. If you do evict and it was unnecessarily you pay an IO cost.

That could be something we look at. There are cases buried deep in the
VM where pages get shuffled to the end of the LRU and get tagged for
reclaim as soon as possible. Maybe you need access to something like
that via posix_fadvise to say "reclaim this page if you need memory but
leave it resident if there is no memory pressure" or something similar.
Not exactly sure what that interface would look like or offhand how it
could be reliably implemented.

-- 
Mel Gorman
SUSE Labs



From:
Andres Freund
Date:

On 2014-01-13 14:19:56 -0800, James Bottomley wrote:
> >  Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> > horrible consequences for performance/scalability - very quickly you
> > contend on locks in the kernel.
> 
> Is this because of problems in the mmap_sem?

It's been a while since I looked at it, but yes, mmap_sem was part of
it. I also seem to recall the amount of IPIs increasing far too much for
it to be practical, but I am not sure anymore.

> > Also, that will mark that page dirty, which isn't what we want in this
> > case.
> 
> You mean madvise (page_addr)?  It shouldn't ... the state of the dirty
> bit should only be updated by actual writes.  Which MADV_ primitive is
> causing the dirty marking, because we might be able to fix it (unless
> there's some weird corner case I don't know about).

Not the madvise() itself, but transplanting the buffer from postgres'
buffers to the mmap() area of the underlying file would, right?

> We also do have a way of transplanting pages: it's called splice.  How
> do the semantics of splice differ from what you need?

Hm. I don't really see how splice would allow us to seed the kernel's
pagecache with content *without* marking the page as dirty in the
kernel.
We don't need zero-copy IO here, the important thing is just to fill the
pagecache with content without a) rereading the page from disk b)
marking the page as dirty.

> >  One major usecase is transplanting a page comming from postgres'
> > buffers into the kernel's buffercache because the latter has a much
> > better chance of properly allocating system resources across independent
> > applications running.
> 
> If you want to share pages between the application and the page cache,
> the only known interface is mmap ... perhaps we can discuss how better
> to improve mmap for you?

I think purely using mmap() is pretty unlikely to work out - there's
just too many constraints about when a page is allowed to be written out
(e.g. it's interlocked with postgres' write ahead log). I also think
that for many practical purposes using mmap() would result in an absurd
number of mappings or mapping way too huge areas; e.g. large btree
indexes are usually accessed in a quite fragmented manner.

> > Oh, and the kernel's page-cache management while far from perfect,
> > actually scales much better than postgres'.
> 
> Well, then, it sounds like the best way forward would be to get
> postgress to use the kernel page cache more efficiently.

No arguments there, although working on postgres scalability is a good
idea as well ;)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



From:
Mel Gorman
Date:

On Mon, Jan 13, 2014 at 11:38:44PM +0100, Jan Kara wrote:
> On Mon 13-01-14 22:26:45, Mel Gorman wrote:
> > The flipside is also meant to hold true. If you know data will be needed
> > in the near future then posix_fadvise(POSIX_FADV_WILLNEED). Glancing at
> > the implementation it does a forced read-ahead on the range of pages of
> > interest. It doesn't look like it would block.
>   That's not quite true. POSIX_FADV_WILLNEED still needs to map logical
> file offsets to physical disk blocks and create IO requests. This happens
> synchronously. So if your disk is congested and relevant metadata is out of
> cache, or we simply run out of free IO requests, POSIX_FADV_WILLNEED can
> block for a significant amount of time.
> 

Umm, yes, you're right. It also potentially stalls allocating the pages
up front even though it will only try and direct reclaim pages once.
That can stall in some circumstances, particularly if there are a number
of processes trying to reclaim memory.

That kinda sucks though. One point of discussion would be to check if
this is an interface that can be used and if so, is it required to never
block and if so is there something we can do about it -- queue the IO
asynchronously if you can but if the kernel would block then do not bother.
That does mean that fadvise is not guaranteeing that the pages will be
resident in the future but it was not the intent of the interface
anyway.

-- 
Mel Gorman
SUSE Labs



From:
Josh Berkus
Date:

On 01/13/2014 02:26 PM, Mel Gorman wrote:
> Really?
> 
> zone_reclaim_mode is often a complete disaster unless the workload is
> partitioned to fit within NUMA nodes. On older kernels enabling it would
> sometimes cause massive stalls. I'm actually very surprised to hear it
> fixes anything and would be interested in hearing more about what sort
> of circumstnaces would convince you to enable that thing.

So the problem with the default setting is that it pretty much isolates
all FS cache for PostgreSQL to whichever socket the postmaster is
running on, and makes the other FS cache unavailable.  This means that,
for example, if you have two memory banks, then only one of them is
available for PostgreSQL filesystem caching ... essentially cutting your
available cache in half.

And however slow moving cached pages between memory banks is, it's an
order of magnitude faster than moving them from disk.  But this isn't
how the NUMA stuff is configured; it seems to assume that it's less
expensive to get pages from disk than to move them between banks, so
whatever you've got cached on the other bank, it flushes it to disk as
fast as possible.  I understand the goal was to make memory usage local
to the processors stuff was running on, but that includes an implicit
assumption that no individual process will ever want more than one
memory bank worth of cache.

So disabling all of the NUMA optimizations is the way to go for any
workload I personally deal with.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



From:
Hannu Krosing
Date:

On 01/13/2014 09:53 PM, Trond Myklebust wrote:
> On Jan 13, 2014, at 15:40, Andres Freund <> wrote:
>
>> On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
>>> On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner <> wrote:
>>>> I notice, Josh, that you didn't mention the problems many people
>>>> have run into with Transparent Huge Page defrag and with NUMA
>>>> access.
>>> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
>>> setting zone_reclaim_mode; is there some other problem besides that?
>> I think that fixes some of the worst instances, but I've seen machines
>> spending horrible amounts of CPU (& BUS) time in page reclaim
>> nonetheless. If I analyzed it correctly it's in RAM << working set
>> workloads where RAM is pretty large and most of it is used as page
>> cache. The kernel ends up spending a huge percentage of time finding and
>> potentially defragmenting pages when looking for victim buffers.
>>
>>> On a related note, there's also the problem of double-buffering.  When
>>> we read a page into shared_buffers, we leave a copy behind in the OS
>>> buffers, and similarly on write-out.  It's very unclear what to do
>>> about this, since the kernel and PostgreSQL don't have intimate
>>> knowledge of what each other are doing, but it would be nice to solve
>>> somehow.
>> I've wondered before if there wouldn't be a chance for postgres to say
>> "my dear OS, that the file range 0-8192 of file x contains y, no need to
>> reread" and do that when we evict a page from s_b but I never dared to
>> actually propose that to kernel people...
> O_DIRECT was specifically designed to solve the problem of double buffering 
> between applications and the kernel. Why are you not able to use that in these situations?
What is asked is the opposite of O_DIRECT - the write from a buffer inside
postgresql to linux *buffercache* and telling linux that it is the same
as what
is currently on disk, so don't bother to write it back ever.

This would avoid current double-buffering between postgresql and linux
buffer caches while still making use of linux cache when possible.

The use case is  pages that postgresql has moved into its buffer cache
but which it has not modified. They will at some point be evicted from the
postgresql cache, but it is likely that they will still be needed
sometime soon,
so what is required is "writing them back" to the original file, only
they should
not really be written - or marked dirty to be written later - more
levels than
just to the linux cache, as they *already* are on the disk.

It is probably ok to put them in the LRU position as they are "written"
out from postgresql, though it may be better if we get some more control
over
where in the LRU order they would be placed. It may make sense to put them
there based on when they were last read while residing inside postgresql
cache

Cheers


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




From:
Claudio Freire
Date:

On Mon, Jan 13, 2014 at 7:36 PM, Mel Gorman <> wrote:
> That could be something we look at. There are cases buried deep in the
> VM where pages get shuffled to the end of the LRU and get tagged for
> reclaim as soon as possible. Maybe you need access to something like
> that via posix_fadvise to say "reclaim this page if you need memory but
> leave it resident if there is no memory pressure" or something similar.
> Not exactly sure what that interface would look like or offhand how it
> could be reliably implemented.


I don't see a reason not to make this behavior the default for WONTNEED.



From:
Jim Nasby
Date:

On 1/13/14, 4:44 PM, Andres Freund wrote:
>>> > >  One major usecase is transplanting a page comming from postgres'
>>> > >buffers into the kernel's buffercache because the latter has a much
>>> > >better chance of properly allocating system resources across independent
>>> > >applications running.
>> >
>> >If you want to share pages between the application and the page cache,
>> >the only known interface is mmap ... perhaps we can discuss how better
>> >to improve mmap for you?
> I think purely using mmap() is pretty unlikely to work out - there's
> just too many constraints about when a page is allowed to be written out
> (e.g. it's interlocked with postgres' write ahead log). I also think
> that for many practical purposes using mmap() would result in an absurd
> number of mappings or mapping way too huge areas; e.g. large btree
> indexes are usually accessed in a quite fragmented manner.

Which brings up another interesting area^Wcan-of-worms: the database is implementing journaling on top of a filesystem
that'sprobably also journaling. And it's going to get worse: a Segate researcher presented at RICon East last year that
thenext generation (or maybe the one after that) of spinning rust will use "shingling", which means that the drive
can'twrite randomly. So now the drive will ALSO have to journal. And of course SSDs already do this.
 

So now there's *three* pieces of software all doing the exact same thing, none of which are able to coordinate with
eachother.
 
-- 
Jim C. Nasby, Data Architect                       
512.569.9461 (cell)                         http://jim.nasby.net



From:
Jim Nasby
Date:

On 1/13/14, 3:04 PM, Jeff Janes wrote:
>
> I think the above is pretty simple for both interaction (allow us to inject a clean page into the file page cache)
andpolicy (forget it after you hand it to us, then remember it again when we hand it back to you clean).  And I think
itwould pretty likely be an improvement over what we currently do.  But I think it is probably the wrong way to get the
improvement. I think the real problem is that we don't trust ourselves to manage more of the memory ourselves.
 
>
> As far as I know, we still don't have a publicly disclosable and readily reproducible test case for the reports of
performancedegradation when we have more than 8GB in shared_buffers. If we had one of those, we could likely reduce the
doublebuffering problem by fixing our own scalability issues and therefore taking responsibility for more of the data
ourselves.

While I agree we need to fix the 8GB limit, we're always going to have a problem here unless we put A LOT of new
abilitiesinto our memory capabilities. Like, for example, stealing memory from shared buffers to support a sort. Or
implementinga system-wide limit on WORK_MEM. Or both.
 

I would much rather teach the OS and Postgres to work together on memory management than for us to try and re-implement
everythingthe OS has already done for us.
 
-- 
Jim C. Nasby, Data Architect                       
512.569.9461 (cell)                         http://jim.nasby.net



From:
Jim Nasby
Date:

On 1/13/14, 4:47 PM, Jan Kara wrote:
> Note to postgres guys: I think you should have a look at the proposed
> 'vrange' system call. The latest posting is here:
> http://www.spinics.net/lists/linux-mm/msg67328.html. It contains a rather
> detailed description of the feature. And if the feature looks good to you,
> you can add your 'me to' plus if anyone would be willing to try that out
> with postgress that would be most welcome (although I understand you might
> not want to burn your time on experimental kernel feature).

I don't think that would help us with buffers unless we switched to MMAP (which is a huge change), but this part is
interesting:

"* Opportunistic freeing of memory that may be quickly reused. Minchan
has done a malloc implementation where free() marks the pages as
volatile, allowing the kernel to reclaim under pressure. This avoids the
unmapping and remapping of anonymous pages on free/malloc."

Postgres has it's own memory management on top of malloc that gives us memory contexts; some of those contexts get
destroyedfrequently. Allowing the kernel to reclaim that free'd memory in the background might be a performance win for
us.
-- 
Jim C. Nasby, Data Architect                       
512.569.9461 (cell)                         http://jim.nasby.net



From:
James Bottomley
Date:

On Mon, 2014-01-13 at 22:12 +0100, Andres Freund wrote:
> On 2014-01-13 12:34:35 -0800, James Bottomley wrote:
> > On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
> > > Well, if we were to collaborate with the kernel community on this then
> > > presumably we can do better than that for eviction... even to the
> > > extent of "here's some data from this range in this file. It's (clean|
> > > dirty). Put it in your cache. Just trust me on this."
> > 
> > This should be the madvise() interface (with MADV_WILLNEED and
> > MADV_DONTNEED) is there something in that interface that is
> > insufficient?
> 
> For one, postgres doesn't use mmap for files (and can't without major
> new interfaces).

I understand, that's why you get double buffering: because we can't
replace a page in the range you give us on read/write.  However, you
don't have to switch entirely to mmap: you can use mmap/madvise
exclusively for cache control and still use read/write (and still pay
the double buffer penalty, of course).  It's only read/write with
directio that would cause problems here (unless you're planning to
switch to DIO?).

>  Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> horrible consequences for performance/scalability - very quickly you
> contend on locks in the kernel.

Is this because of problems in the mmap_sem?

> Also, that will mark that page dirty, which isn't what we want in this
> case.

You mean madvise (page_addr)?  It shouldn't ... the state of the dirty
bit should only be updated by actual writes.  Which MADV_ primitive is
causing the dirty marking, because we might be able to fix it (unless
there's some weird corner case I don't know about).

>  One major usecase is transplanting a page comming from postgres'
> buffers into the kernel's buffercache because the latter has a much
> better chance of properly allocating system resources across independent
> applications running.

If you want to share pages between the application and the page cache,
the only known interface is mmap ... perhaps we can discuss how better
to improve mmap for you?

We also do have a way of transplanting pages: it's called splice.  How
do the semantics of splice differ from what you need?

> Oh, and the kernel's page-cache management while far from perfect,
> actually scales much better than postgres'.

Well, then, it sounds like the best way forward would be to get
postgress to use the kernel page cache more efficiently.

James





From:
Trond Myklebust
Date:

On Jan 13, 2014, at 16:03, Robert Haas <> wrote:

> On Mon, Jan 13, 2014 at 3:53 PM, Trond Myklebust <> wrote:
>> O_DIRECT was specifically designed to solve the problem of double buffering between applications and the kernel. Why
areyou not able to use that in these situations? 
>
> O_DIRECT was apparently designed by a deranged monkey on some serious
> mind-controlling substances.  But don't take it from me, I have it on
> good authority:
>
> http://yarchive.net/comp/linux/o_direct.html
>
> One might even say the best authority.

You do realise that is 12 year old information, right? …and yes, we have added both aio and vectored operations to
O_DIRECTin the meantime. 

Meanwhile, no progress has been made on the “non-deranged” interface that authority was advocating.

Cheers, Trond


From:
Jan Kara
Date:

On Mon 13-01-14 22:26:45, Mel Gorman wrote:
> The flipside is also meant to hold true. If you know data will be needed
> in the near future then posix_fadvise(POSIX_FADV_WILLNEED). Glancing at
> the implementation it does a forced read-ahead on the range of pages of
> interest. It doesn't look like it would block. That's not quite true. POSIX_FADV_WILLNEED still needs to map logical
file offsets to physical disk blocks and create IO requests. This happens
synchronously. So if your disk is congested and relevant metadata is out of
cache, or we simply run out of free IO requests, POSIX_FADV_WILLNEED can
block for a significant amount of time.
                            Honza
-- 
Jan Kara <>
SUSE Labs, CR



From:
Theodore Ts'o
Date:

The issue with O_DIRECT is actually a much more general issue ---
namely, database programmers that for various reasons decide they
don't want to go down the O_DIRECT route, but then care about
performance.  PostgreSQL is not the only database which is had this
issue.

There are two papers at this year's FAST conference about the "Journal
of Journal" (JoJ) problem, which has been triggered by the use of SQLite on
android handsets, and its write patterns, some of which some folks
(including myself) have characterized as "abusive".  (As in, when the
database developer says to the kernel developer, "Doctor, doctor, it
hurts when I do that...")

The program statement for JoJ was introduced in last year's Usenix ATC
conference, I/O Stack Optimizations for Smartphones[1]

[1] https://www.usenix.org/conference/atc13/technical-sessions/presentation/jeong

The high order bit is what's the right thing to do when database
progammers come to kernel engineers saying, we want to do <FOO> and
the performance sucks.  Do we say, "Use O_DIRECT, dummy", not
withstanding Linus's past comments on the issue?  Or do we have some
general design principles that we tell database engineers that they
should do for better performance, and then all developers for all of
the file systems can then try to optimize for a set of new API's, or
recommended ways of using the existing API's?

Surely the wrong answer is that we do things which encourage people to
create entire new specialized file systems for different databases.
The f2fs file system was essentially created because someone thought
it was easier to create a new file system from sratch instad of trying
to change how SQLite or some other existing file system works.
Hopefully we won't have companies using MySQL and PostgreSQL deciding
they need their own mysqlfs and postgresqlfs!  :-)

Cheers,
                - Ted



From:
Jan Kara
Date:

On Mon 13-01-14 22:36:06, Mel Gorman wrote:
> On Mon, Jan 13, 2014 at 06:27:03PM -0200, Claudio Freire wrote:
> > On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby <> wrote:
> > > On 1/13/14, 2:19 PM, Claudio Freire wrote:
> > >>
> > >> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas <>
> > >> wrote:
> > >>>
> > >>> On a related note, there's also the problem of double-buffering.  When
> > >>> we read a page into shared_buffers, we leave a copy behind in the OS
> > >>> buffers, and similarly on write-out.  It's very unclear what to do
> > >>> about this, since the kernel and PostgreSQL don't have intimate
> > >>> knowledge of what each other are doing, but it would be nice to solve
> > >>> somehow.
> > >>
> > >>
> > >>
> > >> There you have a much harder algorithmic problem.
> > >>
> > >> You can basically control duplication with fadvise and WONTNEED. The
> > >> problem here is not the kernel and whether or not it allows postgres
> > >> to be smart about it. The problem is... what kind of smarts
> > >> (algorithm) to use.
> > >
> > >
> > > Isn't this a fairly simple matter of when we read a page into shared buffers
> > > tell the kernel do forget that page? And a corollary to that for when we
> > > dump a page out of shared_buffers (here kernel, please put this back into
> > > your cache).
> > 
> > 
> > That's my point. In terms of kernel-postgres interaction, it's fairly simple.
> > 
> > What's not so simple, is figuring out what policy to use. Remember,
> > you cannot tell the kernel to put some page in its page cache without
> > reading it or writing it. So, once you make the kernel forget a page,
> > evicting it from shared buffers becomes quite expensive.
> 
> posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by
> forcing readahead. If you evict it prematurely then you do get kinda
> screwed because you pay the IO cost to read it back in again even if you
> had enough memory to cache it. Maybe this is the type of kernel-postgres
> interaction that is annoying you.
> 
> If you don't evict, the kernel eventually steps in and evicts the wrong
> thing. If you do evict and it was unnecessarily you pay an IO cost.
> 
> That could be something we look at. There are cases buried deep in the
> VM where pages get shuffled to the end of the LRU and get tagged for
> reclaim as soon as possible. Maybe you need access to something like
> that via posix_fadvise to say "reclaim this page if you need memory but
> leave it resident if there is no memory pressure" or something similar.
> Not exactly sure what that interface would look like or offhand how it
> could be reliably implemented. Well, kernel managing user space cache postgres guys talk about looks
pretty much like what "volatile range" patches are trying to achieve.

Note to postgres guys: I think you should have a look at the proposed
'vrange' system call. The latest posting is here:
http://www.spinics.net/lists/linux-mm/msg67328.html. It contains a rather
detailed description of the feature. And if the feature looks good to you,
you can add your 'me to' plus if anyone would be willing to try that out
with postgress that would be most welcome (although I understand you might
not want to burn your time on experimental kernel feature).
                            Honza
-- 
Jan Kara <>
SUSE Labs, CR



From:
Trond Myklebust
Date:

On Jan 13, 2014, at 19:03, Hannu Krosing <> wrote:

> On 01/13/2014 09:53 PM, Trond Myklebust wrote:
>> On Jan 13, 2014, at 15:40, Andres Freund <> wrote:
>>
>>> On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
>>>> On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner <> wrote:
>>>>> I notice, Josh, that you didn't mention the problems many people
>>>>> have run into with Transparent Huge Page defrag and with NUMA
>>>>> access.
>>>> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
>>>> setting zone_reclaim_mode; is there some other problem besides that?
>>> I think that fixes some of the worst instances, but I've seen machines
>>> spending horrible amounts of CPU (& BUS) time in page reclaim
>>> nonetheless. If I analyzed it correctly it's in RAM << working set
>>> workloads where RAM is pretty large and most of it is used as page
>>> cache. The kernel ends up spending a huge percentage of time finding and
>>> potentially defragmenting pages when looking for victim buffers.
>>>
>>>> On a related note, there's also the problem of double-buffering.  When
>>>> we read a page into shared_buffers, we leave a copy behind in the OS
>>>> buffers, and similarly on write-out.  It's very unclear what to do
>>>> about this, since the kernel and PostgreSQL don't have intimate
>>>> knowledge of what each other are doing, but it would be nice to solve
>>>> somehow.
>>> I've wondered before if there wouldn't be a chance for postgres to say
>>> "my dear OS, that the file range 0-8192 of file x contains y, no need to
>>> reread" and do that when we evict a page from s_b but I never dared to
>>> actually propose that to kernel people...
>> O_DIRECT was specifically designed to solve the problem of double buffering
>> between applications and the kernel. Why are you not able to use that in these situations?
> What is asked is the opposite of O_DIRECT - the write from a buffer inside
> postgresql to linux *buffercache* and telling linux that it is the same
> as what
> is currently on disk, so don't bother to write it back ever.

I don’t understand. Are we talking about mmap()ed files here? Why would the kernel be trying to write back pages that
aren’tdirty? 

> This would avoid current double-buffering between postgresql and linux
> buffer caches while still making use of linux cache when possible.
>
> The use case is  pages that postgresql has moved into its buffer cache
> but which it has not modified. They will at some point be evicted from the
> postgresql cache, but it is likely that they will still be needed
> sometime soon,
> so what is required is "writing them back" to the original file, only
> they should
> not really be written - or marked dirty to be written later - more
> levels than
> just to the linux cache, as they *already* are on the disk.
>
> It is probably ok to put them in the LRU position as they are "written"
> out from postgresql, though it may be better if we get some more control
> over
> where in the LRU order they would be placed. It may make sense to put them
> there based on when they were last read while residing inside postgresql
> cache
>
> Cheers
>
>
> --
> Hannu Krosing
> PostgreSQL Consultant
> Performance, Scalability and High Availability
> 2ndQuadrant Nordic OÜ




From:
James Bottomley
Date:

On Mon, 2014-01-13 at 21:29 +0000, Greg Stark wrote:
> On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund <> wrote:
> > For one, postgres doesn't use mmap for files (and can't without major
> > new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> > horrible consequences for performance/scalability - very quickly you
> > contend on locks in the kernel.
> 
> 
> I may as well dump this in this thread. We've discussed this in person
> a few times, including at least once with Ted T'so when he visited
> Dublin last year.
> 
> The fundamental conflict is that the kernel understands better the
> hardware and other software using the same resources, Postgres
> understands better its own access patterns. We need to either add
> interfaces so Postgres can teach the kernel what it needs about its
> access patterns or add interfaces so Postgres can find out what it
> needs to know about the hardware context.
> 
> The more ambitious and interesting direction is to let Postgres tell
> the kernel what it needs to know to manage everything. To do that we
> would need the ability to control when pages are flushed out. This is
> absolutely necessary to maintain consistency. Postgres would need to
> be able to mark pages as unflushable until some point in time in the
> future when the journal is flushed. We discussed various ways that
> interface could work but it would be tricky to keep it low enough
> overhead to be workable.

So in this case, the question would be what additional information do we
need to exchange that's not covered by the existing interfaces.  Between
madvise and splice, we seem to have most of what you want; what's
missing?

> The less exciting, more conservative option would be to add kernel
> interfaces to teach Postgres about things like raid geometries. Then
> Postgres could use directio and decide to do prefetching based on the
> raid geometry, how much available i/o bandwidth and iops is available,
> etc.
> 
> Reimplementing i/o schedulers and all the rest of the work that the
> kernel provides inside Postgres just seems like something outside our
> competency and that none of us is really excited about doing.

This would also be a well trodden path ... I believe that some large
database company introduced Direct IO for roughly this purpose.

James





From:
Andres Freund
Date:

On 2014-01-13 17:13:51 -0800, James Bottomley wrote:
> a file into a user provided buffer, thus obtaining a page cache entry
> and a copy in their userspace buffer, then insert the page of the user
> buffer back into the page cache as the page cache page ... that's right,
> isn't it postgress people?

Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page
isn't needed anymore when reading. And we'd normally write if the page
is dirty.

> Effectively you end up with buffered read/write that's also mapped into
> the page cache.  It's a pretty awful way to hack around mmap.

Well, the problem is that you can't really use mmap() for the things we
do. Postgres' durability works by guaranteeing that our journal entries
(called WAL := Write Ahead Log) are written & synced to disk before the
corresponding entries of tables and indexes reach the disk. That also
allows to group together many random-writes into a few contiguous writes
fdatasync()ed at once. Only during a checkpointing phase the big bulk of
the data is then (slowly, in the background) synced to disk.

I don't see how that's doable with holding all pages in mmap()ed
buffers.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



From:
Josh Berkus
Date:

On 01/13/2014 05:30 PM, Dave Chinner wrote:
> On Mon, Jan 13, 2014 at 03:24:38PM -0800, Josh Berkus wrote:
> No matter what default NUMA allocation policy we set, there will be
> an application for which that behaviour is wrong. As such, we've had
> tools for setting application specific NUMA policies for quite a few
> years now. e.g:

Yeah, that's why I personally regard the NUMA stuff as just an
information problem; there's an easy configuration variable, and you
can't please everyone (and our project would hardly be one to point
fingers about sub-optimal default configurations).  I was responding to
a question of "what's wrong with the default setting?"

Personally, I have my doubts that the NUMA memory isolation, as
currently implemented, accomplishes what it wants to do.  But that's a
completely different discussion.

The real issue there was that our users had never heard of this change
until suddenly half their RAM became unavailable.  So the solution is
for our project to somehow have these kinds of changes flagged for our
attention so that we can update our docs.  The kernel change list is
quite volumnious, and it's very easy to miss changes of significance in
it.  The easiest way to do this is going to be getting involved in
kernel-database performance testing.

Of course, we are annoyed that we finally removed the main reason to
modify sysctl.conf (SHMMAX), and here we are needing to advise users
about sysctl again.  :-(

I'm much more bothered by the introduction of 2Q logic, since that comes
without a configuration variable to modify its behavior.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



From:
Andres Freund
Date:

On 2014-01-13 10:56:00 -0800, Josh Berkus wrote:
> Well, it was the lack of sysctl options which takes the 2Q change from
> "annoyance" to "potential disaster".  We can't ever get away from the
> possibility that the Postgres use-case might be the minority use-case,
> and we might have to use non-default options.  It's when those options
> aren't present *at all* that we're stuck.

Unless I am missing something the kernel's going further *away* from a
simple 2q system, not the contrary.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



From:
Josh Berkus
Date:

On 01/13/2014 05:48 PM, Andres Freund wrote:
> On 2014-01-13 10:56:00 -0800, Josh Berkus wrote:
>> Well, it was the lack of sysctl options which takes the 2Q change from
>> "annoyance" to "potential disaster".  We can't ever get away from the
>> possibility that the Postgres use-case might be the minority use-case,
>> and we might have to use non-default options.  It's when those options
>> aren't present *at all* that we're stuck.
> 
> Unless I am missing something the kernel's going further *away* from a
> simple 2q system, not the contrary.

Well, they implemented a 2Q system and deliberately offered no sysctl
variables to modify its behavior.  Now they're talking about
implementing an ARC system -- which we know the perils of -- again,
without any configuration variables in case the default behavior doesn't
work for everyone.  And it's highly unlikely that an ARC which is
designed for desktop and/or file server users -- let alone mobile users
-- is going to be optimal for PostgreSQL out of the box.

In fact, I'd assert that it's flat-out impossible to engineer an ARC
which will work for multiple different use cases without user-level
configuration.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



From:
Hannu Krosing
Date:

On 01/14/2014 03:44 AM, Dave Chinner wrote:
> On Tue, Jan 14, 2014 at 02:26:25AM +0100, Andres Freund wrote:
>> On 2014-01-13 17:13:51 -0800, James Bottomley wrote:
>>> a file into a user provided buffer, thus obtaining a page cache entry
>>> and a copy in their userspace buffer, then insert the page of the user
>>> buffer back into the page cache as the page cache page ... that's right,
>>> isn't it postgress people?
>> Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page
>> isn't needed anymore when reading. And we'd normally write if the page
>> is dirty.
> So why, exactly, do you even need the kernel page cache here? You've
> got direct access to the copy of data read into userspace, and you
> want direct control of when and how the data in that buffer is
> written and reclaimed. Why push that data buffer back into the
> kernel and then have to add all sorts of kernel interfaces to
> control the page you already have control of?
To let kernel do the job that it is good at, namely managing the write-back
of dirty buffers to disk and to manage (possible) read-ahead pages.

While we do have control of "the page", we do not (and really don't want to)
have control of the complex and varied side of efficiently reading and
writing
to various file-systems with possibly very different disk configurations.

We quite prefer kernel to take care of it and generally like how kernel
manages it.

We have a few suggestions about giving the kernel extra info about the
applications usage patterns of the data.
>
>>> Effectively you end up with buffered read/write that's also mapped into
>>> the page cache.  It's a pretty awful way to hack around mmap.
>> Well, the problem is that you can't really use mmap() for the things we
>> do. Postgres' durability works by guaranteeing that our journal entries
>> (called WAL := Write Ahead Log) are written & synced to disk before the
>> corresponding entries of tables and indexes reach the disk. That also
>> allows to group together many random-writes into a few contiguous writes
>> fdatasync()ed at once. Only during a checkpointing phase the big bulk of
>> the data is then (slowly, in the background) synced to disk.
> Which is the exact algorithm most journalling filesystems use for
> ensuring durability of their metadata updates.  Indeed, here's an
> interesting piece of architecture that you might like to consider:
>
> * Neither XFS and BTRFS use the kernel page cache to back their
>   metadata transaction engines.
But file system code is supposed to know much more about the
underlying disk than a mere application program like postgresql.

We do not want to start duplicating OS if we can avoid it.

What we would like is to have a way to tell the kernel

1) "here is the modified copy of file page, it is now safe to write   it back" - the current 'lazy' write

2) "here is the page, write it back now, before returning success   to me" - unbuffered write or write + sync

but we also would like to have

3) "here is the page as it is currently on disk, I may need it soon,   so keep it together with your other clean pages
accessedat time X"   - this is the non-dirtying write discussed     the page may be in buffer cache, in which case just
updateits LRU   position (to either current time or time provided by postgresql), or   it may not be there, in which
caseput it there if reasonable by it's   LRU position.
 

And we would like all this to work together with other current linux
kernel goodness of managing the whole disk-side interaction of
efficient reading and writing and managing the buffers :)
> Why not? Because the page cache is too simplistic to adequately
> represent the complex object heirarchies that the filesystems have
> and so it's flat LRU reclaim algorithms and writeback control
> mechanisms are a terrible fit and cause lots of performance issues
> under memory pressure.
Same is true for postgresql - if we would just use direct writes
and reads from disk then the performance would be terrible.

We would need to duplicate all the complicated algorithms in file
system do for good performance if we were to start implementing
that part of the file system ourselves.
> IOWs, the two most complex high performance transaction engines in
> the Linux kernel have moved to fully customised cache and (direct)
> IO implementations because the requirements for scalability and
> performance are far more complex than the kernel page cache
> infrastructure can provide.
And we would like to avoid implementing this again this by delegating
this part of work to said complex high performance transaction
engines in the Linux kernel.

We do not want to abandon all work for postgresql business code
and go into file system development mode for next few years.

Again, as said above the linux file system is doing fine. What we
want is a few ways to interact with it to let it do even better when
working with postgresql by telling it some stuff it otherwise would
have to second guess and by sometimes giving it back some cache
pages which were copied away for potential modifying but ended
up clean in the end.

And let the linux kernel decide if and how long to keep these pages
in its  cache using its superior knowledge of disk subsystem and
about what else is going on in the system in general.

Just food for thought....

We want to have all the performance and complexity provided
by linux, and we would like it to work even better with postgresql by
having a bit more information for its decisions.

We just don't want to re-implement it ;)

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




From:
Hannu Krosing
Date:

On 01/13/2014 11:22 PM, James Bottomley wrote:
>
>> The less exciting, more conservative option would be to add kernel
>> interfaces to teach Postgres about things like raid geometries. Then
>> Postgres could use directio and decide to do prefetching based on the
>> raid geometry, how much available i/o bandwidth and iops is available,
>> etc.
>>
>> Reimplementing i/o schedulers and all the rest of the work that the
>> kernel provides inside Postgres just seems like something outside our
>> competency and that none of us is really excited about doing.
> This would also be a well trodden path ... I believe that some large
> database company introduced Direct IO for roughly this purpose.
>
The file system at that time were much worse than they are now,
so said large companies had no choice but to write their own.

As linux file handling has been much better for most of active
development of postgresql we have been able to avoid
it and still have reasonable performance.

What was been pointed out above are some (allegedly
desktop/mobile influenced) decisions which broke good
performance.

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




From:
Claudio Freire
Date:

On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing <> wrote:
> Again, as said above the linux file system is doing fine. What we
> want is a few ways to interact with it to let it do even better when
> working with postgresql by telling it some stuff it otherwise would
> have to second guess and by sometimes giving it back some cache
> pages which were copied away for potential modifying but ended
> up clean in the end.

You don't need new interfaces. Only a slight modification of what
fadvise DONTNEED does.

This insistence in injecting pages from postgres to kernel is just a
bad idea. At the very least, it still needs postgres to know too much
of the filesystem (block layout) to properly work. Ie: pg must be
required to put entire filesystem-level blocks into the page cache,
since that's how the page cache works. At the very worst, it may
introduce serious security and reliability implications, when
applications can destroy the consistency of the page cache (even if
full access rights are checked, there's still the possibility this
inconsistency might be exploitable).

Simply making fadvise DONTNEED move pages to the head of the LRU (ie:
discard next if you need) should work as expected without all the
complication of the above proposal.



From:
Heikki Linnakangas
Date:

On 01/14/2014 12:26 AM, Mel Gorman wrote:
> On Mon, Jan 13, 2014 at 03:15:16PM -0500, Robert Haas wrote:
>> The other thing that comes to mind is the kernel's caching behavior.
>> We've talked a lot over the years about the difficulties of getting
>> the kernel to write data out when we want it to and to not write data
>> out when we don't want it to.
>
> Is sync_file_range() broke?
>
>> When it writes data back to disk too
>> aggressively, we get lousy throughput because the same page can get
>> written more than once when caching it for longer would have allowed
>> write-combining.
>
> Do you think that is related to dirty_ratio or dirty_writeback_centisecs?
> If it's dirty_writeback_centisecs then that would be particularly tricky
> because poor interactions there would come down to luck basically.

>> When it doesn't write data to disk aggressively
>> enough, we get huge latency spikes at checkpoint time when we call
>> fsync() and the kernel says "uh, what? you wanted that data *on the
>> disk*? sorry boss!" and then proceeds to destroy the world by starving
>> the rest of the system for I/O for many seconds or minutes at a time.
>
> Ok, parts of that are somewhat expected. It *may* depend on the
> underlying filesystem. Some of them handle fsync better than others. If
> you are syncing the whole file though when you call fsync then you are
> potentially burned by having to writeback dirty_ratio amounts of memory
> which could take a substantial amount of time.
>
>> We've made some desultory attempts to use sync_file_range() to improve
>> things here, but I'm not sure that's really the right tool, and if it
>> is we don't know how to use it well enough to obtain consistent
>> positive results.
>
> That implies that either sync_file_range() is broken in some fashion we
> (or at least I) are not aware of and that needs kicking.

Let me try to explain the problem: Checkpoints can cause an I/O spike, 
which slows down other processes.

When it's time to perform a checkpoint, PostgreSQL will write() all 
dirty buffers from the PostgreSQL buffer cache, and finally perform an 
fsync() to flush the writes to disk. After that, we know the data is 
safely on disk.

In older PostgreSQL versions, the write() calls would cause an I/O storm 
as the OS cache quickly fills up with dirty pages, up to dirty_ratio, 
and after that all subsequent write()s block. That's OK as far as the 
checkpoint is concerned, but it significantly slows down queries running 
at the same time. Even a read-only query often needs to write(), to 
evict a dirty page from the buffer cache to make room for a different 
page. We made that less painful by adding sleeps between the write() 
calls, so that they are trickled over a long period of time and 
hopefully stay below dirty_ratio at all times. However, we still have to 
perform the fsync()s after the writes(), and sometimes that still causes 
a similar I/O storm.

The checkpointer is not in a hurry. A checkpoint typically has 10-30 
minutes to finish, before it's time to start the next checkpoint, and 
even if it misses that deadline that's not too serious either. But the 
OS doesn't know that, and we have no way of telling it.

As a quick fix, some sort of a lazy fsync() call would be nice. It would 
behave just like fsync() but it would not change the I/O scheduling at 
all. Instead, it would sleep until all the pages have been flushed to 
disk, at the speed they would've been without the fsync() call.

Another approach would be to give the I/O that the checkpointer process 
initiates a lower priority. This would be slightly preferable, because 
PostgreSQL could then issue the writes() as fast as it can, and have the 
checkpoint finish earlier when there's not much other load. Last I 
looked into this (which was a long time ago), there was no suitable 
priority system for writes, only reads.

- Heikki



From:
Mel Gorman
Date:

On Mon, Jan 13, 2014 at 03:24:38PM -0800, Josh Berkus wrote:
> On 01/13/2014 02:26 PM, Mel Gorman wrote:
> > Really?
> > 
> > zone_reclaim_mode is often a complete disaster unless the workload is
> > partitioned to fit within NUMA nodes. On older kernels enabling it would
> > sometimes cause massive stalls. I'm actually very surprised to hear it
> > fixes anything and would be interested in hearing more about what sort
> > of circumstnaces would convince you to enable that thing.
> 
> So the problem with the default setting is that it pretty much isolates
> all FS cache for PostgreSQL to whichever socket the postmaster is
> running on, and makes the other FS cache unavailable. 

I'm not being pedantic but the default depends on the NUMA characteristics of
the machine so I need to know if it was enabled or disabled. Some machines
will default zone_reclaim_mode to 0 and others will default it to 1. In my
experience the majority of bugs that involved zone_reclaim_mode were due
to zone_reclaim_mode enabled by default.  If I see a bug that involves
a file-based workload on a NUMA machine with stalls and/or excessive IO
when there is plenty of memory free then zone_reclaim_mode is the first
thing I check.

I'm guessing from context that in your experience it gets enabled by default
on the machines you care about. This would indeed limit FS cache usage to
the node where the process is initiating IO (postmaster I guess).

> This means that,
> for example, if you have two memory banks, then only one of them is
> available for PostgreSQL filesystem caching ... essentially cutting your
> available cache in half.
> 
> And however slow moving cached pages between memory banks is, it's an
> order of magnitude faster than moving them from disk.  But this isn't
> how the NUMA stuff is configured; it seems to assume that it's less
> expensive to get pages from disk than to move them between banks, so

Yes, this is right. The history behind this "logic" is that it was assumed
NUMA machines would only ever be used for HPC and that the workloads would
always be partitioned to run within NUMA nodes. This has not been the case
for a long time and I would argue that we should leave that thing disabled
by default in all cases. Last time I tried it was met with resistance but
maybe it's time to try again.

> whatever you've got cached on the other bank, it flushes it to disk as
> fast as possible.  I understand the goal was to make memory usage local
> to the processors stuff was running on, but that includes an implicit
> assumption that no individual process will ever want more than one
> memory bank worth of cache.
> 
> So disabling all of the NUMA optimizations is the way to go for any
> workload I personally deal with.
> 

I would hesitate to recommend "all" on the grounds that zone_reclaim_mode
is brain damage and I'd hate to lump all tuning parameters into the same box.

There is an interesting side-line here. If all IO is initiated by one
process in postgres then the memory locality will be sub-optimal.
The consumer of the data may or may not be running on the same
node as the process that read the data from disk. It is possible to
migrate this from user space but the interface is clumsy and assumes the
data is mapped.

Automatic NUMA balancing does not help you here because that thing also
depends on the data being mapped. It does nothing for data accessed via
read/write. There is nothing fundamental that prevents this, it was not
implemented because it was not deemed to be important enough. The amount
of effort spent on addressing this would depend on how important NUMA
locality is for postgres performance.

-- 
Mel Gorman
SUSE Labs



From:
Hannu Krosing
Date:

On 01/14/2014 09:39 AM, Claudio Freire wrote:
> On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing <> wrote:
>> Again, as said above the linux file system is doing fine. What we
>> want is a few ways to interact with it to let it do even better when
>> working with postgresql by telling it some stuff it otherwise would
>> have to second guess and by sometimes giving it back some cache
>> pages which were copied away for potential modifying but ended
>> up clean in the end.
> You don't need new interfaces. Only a slight modification of what
> fadvise DONTNEED does.
>
> This insistence in injecting pages from postgres to kernel is just a
> bad idea. 
Do you think it would be possible to map copy-on-write pages
from linux cache to postgresql cache ?

this would be a step in direction of solving the double-ram-usage
of pages which have not been read from syscache to postgresql
cache without sacrificing linux read-ahead (which I assume does
not happen when reads bypass system cache).

and we can write back the copy at the point when it is safe (from
postgresql perspective)  to let the system write them back ?

Do you think it is possible to make it work with good performance
for a few million 8kb pages ?

> At the very least, it still needs postgres to know too much
> of the filesystem (block layout) to properly work. Ie: pg must be
> required to put entire filesystem-level blocks into the page cache,
> since that's how the page cache works. 
I was more thinking of an simple write() interface with extra
flags/sysctls to tell kernel that "we already have this on disk"
> At the very worst, it may
> introduce serious security and reliability implications, when
> applications can destroy the consistency of the page cache (even if
> full access rights are checked, there's still the possibility this
> inconsistency might be exploitable).
If you allow write() which just writes clean pages, I can not see
where the extra security concerns are beyond what normal
write can do.


Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




From:
Robert Haas
Date:

On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman <> wrote:
>> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
>> setting zone_reclaim_mode; is there some other problem besides that?
>
> Really?
>
> zone_reclaim_mode is often a complete disaster unless the workload is
> partitioned to fit within NUMA nodes. On older kernels enabling it would
> sometimes cause massive stalls. I'm actually very surprised to hear it
> fixes anything and would be interested in hearing more about what sort
> of circumstnaces would convince you to enable that thing.

By "set" I mean "set to zero".  We've seen multiple of instances of
people complaining about large amounts of system memory going unused
because this setting defaulted to 1.

>> The other thing that comes to mind is the kernel's caching behavior.
>> We've talked a lot over the years about the difficulties of getting
>> the kernel to write data out when we want it to and to not write data
>> out when we don't want it to.
>
> Is sync_file_range() broke?

I don't know.  I think a few of us have played with it and not been
able to achieve a clear win.  Whether the problem is with the system
call or the programmer is harder to determine.  I think the problem is
in part that it's not exactly clear when we should call it.  So
suppose we want to do a checkpoint.  What we used to do a long time
ago is write everything, and then fsync it all, and then call it good.But that produced horrible I/O storms.  So what
wedo now is do the
 
writes over a period of time, with sleeps in between, and then fsync
it all at the end, hoping that the kernel will write some of it before
the fsyncs arrive so that we don't get a huge I/O spike.

And that sorta works, and it's definitely better than doing it all at
full speed, but it's pretty imprecise.  If the kernel doesn't write
enough of the data out in advance, then there's still a huge I/O storm
when we do the fsyncs and everything grinds to a halt.  If it writes
out more data than needed in advance, it increases the total number of
physical writes because we get less write-combining, and that hurts
performance, too.  I basically feel like the I/O scheduler sucks,
though whether it sucks because it's not theoretically possible to do
any better or whether it sucks because of some more tractable reason
is not clear to me.  In an ideal world, when I call fsync() a bunch of
times from one process, other processes on the same machine should
begin to observe 30+-second (or sometimes 300+-second) times for read
or write of an 8kB block.  Imagine a hypothetical UNIX-like system
where when one process starts running at 100% CPU, every other process
on the machine gets timesliced in only once per minute.  That's
obviously ridiculous, and yet it's pretty much exactly what happens
with I/O.

>> When it writes data back to disk too
>> aggressively, we get lousy throughput because the same page can get
>> written more than once when caching it for longer would have allowed
>> write-combining.
>
> Do you think that is related to dirty_ratio or dirty_writeback_centisecs?
> If it's dirty_writeback_centisecs then that would be particularly tricky
> because poor interactions there would come down to luck basically.

See above; I think it's related to fsync.

>> When it doesn't write data to disk aggressively
>> enough, we get huge latency spikes at checkpoint time when we call
>> fsync() and the kernel says "uh, what? you wanted that data *on the
>> disk*? sorry boss!" and then proceeds to destroy the world by starving
>> the rest of the system for I/O for many seconds or minutes at a time.
>
> Ok, parts of that are somewhat expected. It *may* depend on the
> underlying filesystem. Some of them handle fsync better than others. If
> you are syncing the whole file though when you call fsync then you are
> potentially burned by having to writeback dirty_ratio amounts of memory
> which could take a substantial amount of time.

Yeah.  ext3 apparently fsyncs the whole filesystem, which is terrible
for throughput, but if you happen to have xlog (which is flushed
regularly) on the same filesystem as the data files (which are flushed
only periodically) then at least you don't have the problem of the
write queue getting too large.   But I think most of our users are on
ext4 at this point, probably some xfs and other things.

We track the number of un-fsync'd blocks we've written to each file,
and have gotten desperate enough to think of approaches like - ok,
well if the total number of un-fsync'd blocks in the system exceeds
some threshold, then fsync the file with the most such blocks, not
because we really need the data on disk just yet but so that the write
queue won't get too large for the kernel to deal with.  And I think
there may even be some test results from such crocks showing some
benefit.  But really, I don't understand why we have to baby the
kernel like this.  Ensuring scheduling fairness is a basic job of the
kernel; if we wanted to have to control caching behavior manually, we
could use direct I/O.  Having accepted the double buffering that comes
with NOT using direct I/O, ideally we could let the kernel handle
scheduling and call it good.

>> We've made some desultory attempts to use sync_file_range() to improve
>> things here, but I'm not sure that's really the right tool, and if it
>> is we don't know how to use it well enough to obtain consistent
>> positive results.
>
> That implies that either sync_file_range() is broken in some fashion we
> (or at least I) are not aware of and that needs kicking.

So the problem is - when do you call it?  What happens is: before a
checkpoint, we may have already written some blocks to a file.  During
the checkpoint, we're going to write some more.  At the end of the
checkpoint, we'll need all blocks written before and during the
checkpoint to be on disk.  If we call sync_file_range() at the
beginning of the checkpoint, then in theory that should get the ball
rolling, but we may be about to rewrite some of those blocks, or at
least throw some more on the pile.  If we call sync_file_range() near
the end of the checkpoint, just before calling fsync, there's not
enough time for the kernel to reorder I/O to a sufficient degree to do
any good.  What we want, sorta, is to have the kernel start writing it
out just at the right time to get it on disk by the time we're aiming
to complete the checkpoint, but it's not clear exactly how to do that.We can't just write all the blocks,
sync_file_range(),wait, and then
 
fsync() because the "write all the blocks" step can trigger an I/O
storm if the kernel decides there's too much dirty data.

I suppose what we really want to do during a checkpoint is write data
into the O/S cache at a rate that matches what the kernel can
physically get down to the disk, and have the kernel schedule those
writes in as timely a fashion as it can without disrupting overall
system throughput too much.  But the feedback mechanisms that exist
today are just too crude for that.  You can easily write() to the
point where the whole system freezes up, or equally wait between
write()s when the system could easily have handled more right away.
And it's very hard to tell how much you can fsync() at once before
performance falls off a cliff.  A certain number of writes get
absorbed by various layers of caching between us and the physical
hardware - and then at some point, they're all full, and further
writes lead to disaster.  But I don't know of any way to assess how
close we are to that point at any give time except to cross it, and at
that point, it's too late.

>> On a related note, there's also the problem of double-buffering.  When
>> we read a page into shared_buffers, we leave a copy behind in the OS
>> buffers, and similarly on write-out.  It's very unclear what to do
>> about this, since the kernel and PostgreSQL don't have intimate
>> knowledge of what each other are doing, but it would be nice to solve
>> somehow.
>
> If it's mapped, clean and you do not need any more than
> madvise(MADV_DONTNEED). If you are accessing teh data via a file handle,
> then I would expect posix_fadvise(POSIX_FADV_DONTNEED). Offhand, I do
> not know how it behaved historically but right now it will usually sync
> the data and then discard the pages. I say usually because it will not
> necessarily sync if the storage is congested and there is no guarantee it
> will be discarded. In older kernels, there was a bug where small calls to
> posix_fadvise() would not work at all. This was fixed in 3.9.
>
> The flipside is also meant to hold true. If you know data will be needed
> in the near future then posix_fadvise(POSIX_FADV_WILLNEED). Glancing at
> the implementation it does a forced read-ahead on the range of pages of
> interest. It doesn't look like it would block.
>
> The completely different approach for double buffering is direct IO but
> there may be reasons why you are avoiding that and are unhappy with the
> interfaces that are meant to work.
>
> Just from the start, it looks like there are a number of problem areas.
> Some may be fixed -- in which case we should identify what fixed it, what
> kernel version and see can it be verified with a test case or did we
> manage to break something else in the process. Other bugs may still
> exist because we believe some interface works how users want when it is
> in fact unfit for purpose for some reason.

It's all read, not mapped, because we have a need to prevent pages
from being written back to their backing files until WAL is fsync'd,
and there's no way to map a file and modify the page but not let it be
written back to disk until some other event happens.  We've
experimented with don't-need but it's tricky.

Here's an example.  Our write-ahead log files (WAL) are all 16MB;
eventually, when they're no longer needed for any purpose, older files
cease to be needed, but there's a continued demand for new files
driven by database modifications.  Experimentation some years ago
revealed that it's faster to rename and overwrite the old files than
to remove them and create new ones, so that's what we do.  Ideally
this means that at steady state we're just recycling the files over
and over and never creating or destroying any, though I'm not sure
whether we ever actually achieve that ideal.  However, benchmarking
has showed that making the wrong decision about whether to don't-need
those files has a significant effect on performance.   If there's
enough cache around to keep all the files in memory, then we don't
want to don't-need them because then access will be slow when the old
files are recycled.  If however there is cache pressure then we want
to don't-need them as quickly as possible to make room for other,
higher priority data.

Now that may not really be the kernel's fault; it's a general property
of ring buffers that you want to an LRU policy if they fit in cache
and immediate eviction of everything but the active page if they
don't.  But I think it demonstrates the general difficulty of using
posix_fadvise.  Similar cases arise for prefetching: gee, we'd like to
prefetch this data because we're going to use it soon, but if the
system is under enough pressure, the data may get evicted again before
"soon" actually arrives.

Thanks for taking the time to write all of these comments, and listen
to our concerns.  I really appreciate it, whether anything tangible
comes of it or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Kevin Grittner
Date:

First off, I want to give a +1 on everything in the recent posts
from Heikki and Hannu.

Jan Kara <> wrote:

> Now the aging of pages marked as volatile as it is currently
> implemented needn't be perfect for your needs but you still have
> time to influence what gets implemented... Actually developers of
> the vrange() syscall were specifically looking for some ideas
> what to base aging on. Currently I think it is first marked -
> first evicted.

The "first marked - first evicted" seems like what we would want.
The ability to "unmark" and have the page no longer be considered
preferred for eviction would be very nice.  That seems to me like
it would cover the multiple layers of buffering *clean* pages very
nicely (although I know nothing more about vrange() than what has
been said on this thread, so I could be missing something).

The other side of that is related avoiding multiple writes of the
same page as much as possible, while avoid write gluts.  The issue
here is that PostgreSQL tries to hang on to dirty pages for as long
as possible before "writing" them to the OS cache, while the OS
tries to avoid writing them to storage for as long as possible
until they reach a (configurable) threshold or are fsync'd.  The
problem is that a under various conditions PostgreSQL may need to
write and fsync a lot of dirty pages it has accumulated in a short
time.  That has an "avalanche" effect, creating a "write glut"
which can stall all I/O for a period of many seconds up to a few
minutes.  If the OS was aware of the dirty pages pending write in
the application, and counted those for purposes of calculating when
and how much to write, the glut could be avoided.  Currently,
people configure the PostgreSQL background writer to be very
aggressive, configure a small PostgreSQL shared_buffers setting,
and/or set the OS thresholds low enough to minimize the problem;
but all of these mitigation strategies have their own costs.

A new hint that the application has dirtied a page could be used by
the OS to improve things this way:  When the OS is notified that a
page is dirty, it takes action depending on whether the page is
considered dirty by the OS.  If it is not dirty, the page is
immediately discarded from the OS cache.  It is known that the
application has a modified version of the page that it intends to
write, so the version in the OS cache has no value.  We don't want
this page forcing eviction of vrange()-flagged pages.  If it is
dirty, any write ordering to storage by the OS based on when the
page was written to the OS would be pushed back as far as possible
without crossing any write barriers, in hopes that the writes could
be combined.  Either way, this page is counted toward dirty pages
for purposes of calculating how much to write from the OS to
storage, and the later write of the page doesn't redundantly add to
this number.

The combination of these two changes could boost PostgreSQL
performance quite a bit, at least for some common workloads.

The MMAP approach always seems tempting on first blush, but the
need to "pin" pages and the need to assure that dirty pages are not
written ahead of the WAL-logging of those pages makes it hard to
see how we can use it.  The "pin" means that we need to ensure that
a particular 8KB page remains available for direct reference by all
PostgreSQL processes until it is "unpinned".  The other thing we
would need is the ability to modify a page with a solid assurance
that the modified page would *not* be written to disk until we
authorize it.  The page would remain pinned until we do authorize
write, at which point the changes are available to be written, but
can wait for an fsync or accumulations of sufficient dirty pages to
cross the write threshold.  Next comes the hard part.  The page may
or may not be unpinned after that, and if it remains pinned or is
pinned again, there may be further changes to the page.  While the
prior changes can be written (and *must* be written for an fsync),
these new changes must *not* be until we authorize it.  If MMAP can
be made to handle that, we could probably use it (and some of the
previously-discussed techniques might not be needed), but my
understanding is that there is currently no way to do so.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Claudio Freire
Date:

On Tue, Jan 14, 2014 at 11:39 AM, Hannu Krosing <> wrote:
> On 01/14/2014 09:39 AM, Claudio Freire wrote:
>> On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing <> wrote:
>>> Again, as said above the linux file system is doing fine. What we
>>> want is a few ways to interact with it to let it do even better when
>>> working with postgresql by telling it some stuff it otherwise would
>>> have to second guess and by sometimes giving it back some cache
>>> pages which were copied away for potential modifying but ended
>>> up clean in the end.
>> You don't need new interfaces. Only a slight modification of what
>> fadvise DONTNEED does.
>>
>> This insistence in injecting pages from postgres to kernel is just a
>> bad idea.
> Do you think it would be possible to map copy-on-write pages
> from linux cache to postgresql cache ?
>
> this would be a step in direction of solving the double-ram-usage
> of pages which have not been read from syscache to postgresql
> cache without sacrificing linux read-ahead (which I assume does
> not happen when reads bypass system cache).
>
> and we can write back the copy at the point when it is safe (from
> postgresql perspective)  to let the system write them back ?
>
> Do you think it is possible to make it work with good performance
> for a few million 8kb pages ?

I don't think so. The kernel would need to walk the page mapping on
each page fault, which would incurr the cost of a read cache hit on
each page fault.

A cache hit is still orders of magnitude slower than a regular page
fault, because the process page map is compact and efficient. But if
you bloat it, or if you make the kernel go read the buffer cache, it
would mean bad performance for RAM access, which I'd venture isn't
really a net gain.

That's probably the reason there is no zero-copy read mechanism.
Because you always have to copy from/to the buffer cache anyway.

Of course, this is just OTOMH. Without actually benchmarking, this is
all blabber.

>> At the very worst, it may
>> introduce serious security and reliability implications, when
>> applications can destroy the consistency of the page cache (even if
>> full access rights are checked, there's still the possibility this
>> inconsistency might be exploitable).
> If you allow write() which just writes clean pages, I can not see
> where the extra security concerns are beyond what normal
> write can do.

I've been working on security enough to never dismiss any kind of
system-level inconsistency.

The fact that you can make user-land applications see different data
than kernel-land code has over-reaching consequences that are hard to
ponder.



From:
Robert Haas
Date:

On Tue, Jan 14, 2014 at 3:39 AM, Claudio Freire <> wrote:
> On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing <> wrote:
>> Again, as said above the linux file system is doing fine. What we
>> want is a few ways to interact with it to let it do even better when
>> working with postgresql by telling it some stuff it otherwise would
>> have to second guess and by sometimes giving it back some cache
>> pages which were copied away for potential modifying but ended
>> up clean in the end.
>
> You don't need new interfaces. Only a slight modification of what
> fadvise DONTNEED does.

Yeah.  DONTREALLYNEEDALLTHATTERRIBLYMUCH.

> This insistence in injecting pages from postgres to kernel is just a
> bad idea. At the very least, it still needs postgres to know too much
> of the filesystem (block layout) to properly work. Ie: pg must be
> required to put entire filesystem-level blocks into the page cache,
> since that's how the page cache works. At the very worst, it may
> introduce serious security and reliability implications, when
> applications can destroy the consistency of the page cache (even if
> full access rights are checked, there's still the possibility this
> inconsistency might be exploitable).

I agree with all that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Robert Haas
Date:

On Tue, Jan 14, 2014 at 5:00 AM, Jan Kara <> wrote:
> I thought that instead of injecting pages into pagecache for aging as you
> describe in 3), you would mark pages as volatile (i.e. for reclaim by
> kernel) through vrange() syscall. Next time you need the page, you check
> whether the kernel reclaimed the page or not. If yes, you reload it from
> disk, if not, you unmark it and use it.
>
> Now the aging of pages marked as volatile as it is currently implemented
> needn't be perfect for your needs but you still have time to influence what
> gets implemented... Actually developers of the vrange() syscall were
> specifically looking for some ideas what to base aging on. Currently I
> think it is first marked - first evicted.

This is an interesting idea but it stinks of impracticality.
Essentially when the last buffer pin on a page is dropped we'd have to
mark it as discardable, and then the next person wanting to pin it
would have to check whether it's still there.  But the system call
overhead of calling vrange() every time the last pin on a page was
dropped would probably hose us.

*thinks*

Well, I guess it could be done lazily: make periodic sweeps through
shared_buffers, looking for pages that haven't been touched in a
while, and vrange() them.  That's quite a bit of new mechanism, but in
theory it could work out to a win.  vrange() would have to scale well
to millions of separate ranges, though.  Will it?  And a lot depends
on whether the kernel makes the right decision about whether to chunk
data from our vrange() vs. any other page it could have reclaimed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Tom Lane
Date:

James Bottomley <> writes:
> The current mechanism for coherency between a userspace cache and the
> in-kernel page cache is mmap ... that's the only way you get the same
> page in both currently.

Right.

> glibc used to have an implementation of read/write in terms of mmap, so
> it should be possible to insert it into your current implementation
> without a major rewrite.  The problem I think this brings you is
> uncontrolled writeback: you don't want dirty pages to go to disk until
> you issue a write()

Exactly.

> I think we could fix this with another madvise():
> something like MADV_WILLUPDATE telling the page cache we expect to alter
> the pages again, so don't be aggressive about cleaning them.

"Don't be aggressive" isn't good enough.  The prohibition on early write
has to be absolute, because writing a dirty page before we've done
whatever else we need to do results in a corrupt database.  It has to
be treated like a write barrier.

> The problem is we can't give you absolute control of when pages are
> written back because that interface can be used to DoS the system: once
> we get too many dirty uncleanable pages, we'll thrash looking for memory
> and the system will livelock.

Understood, but that makes this direction a dead end.  We can't use
it if the kernel might decide to write anyway.
        regards, tom lane



From:
Claudio Freire
Date:

On Tue, Jan 14, 2014 at 12:42 PM, Trond Myklebust <> wrote:
>> James Bottomley <> writes:
>>> The current mechanism for coherency between a userspace cache and the
>>> in-kernel page cache is mmap ... that's the only way you get the same
>>> page in both currently.
>>
>> Right.
>>
>>> glibc used to have an implementation of read/write in terms of mmap, so
>>> it should be possible to insert it into your current implementation
>>> without a major rewrite.  The problem I think this brings you is
>>> uncontrolled writeback: you don't want dirty pages to go to disk until
>>> you issue a write()
>>
>> Exactly.
>>
>>> I think we could fix this with another madvise():
>>> something like MADV_WILLUPDATE telling the page cache we expect to alter
>>> the pages again, so don't be aggressive about cleaning them.
>>
>> "Don't be aggressive" isn't good enough.  The prohibition on early write
>> has to be absolute, because writing a dirty page before we've done
>> whatever else we need to do results in a corrupt database.  It has to
>> be treated like a write barrier.
>
> Then why are you dirtying the page at all? It makes no sense to tell the kernel “we’re changing this page in the page
cache,but we don’t want you to change it on disk”: that’s not consistent with the function of a page cache. 


PG doesn't currently.

All that dirtying happens in anonymous shared memory, in pg-specific buffers.

The proposal is to use mmap instead of anonymous shared memory as
pg-specific buffers to avoid the extra copy (mmap would share the page
with both kernel and user space). But that would dirty the page when
written to, because now the kernel has the correspondence between that
specific memory region and the file, and that's forbidden for PG's
usage.

I believe the only option here is for the kernel to implement
zero-copy reads. But that implementation is doomed for the performance
reasons I outlined on an eariler mail. So...



From:
Tom Lane
Date:

Trond Myklebust <> writes:
> On Jan 14, 2014, at 10:39, Tom Lane <> wrote:
>> "Don't be aggressive" isn't good enough.  The prohibition on early write
>> has to be absolute, because writing a dirty page before we've done
>> whatever else we need to do results in a corrupt database.  It has to
>> be treated like a write barrier.

> Then why are you dirtying the page at all? It makes no sense to tell the kernel �we�re changing this page in the page
cache,but we don�t want you to change it on disk�: that�s not consistent with the function of a page cache.
 

As things currently stand, we dirty the page in our internal buffers,
and we don't write it to the kernel until we've written and fsync'd the
WAL data that needs to get to disk first.  The discussion here is about
whether we could somehow avoid double-buffering between our internal
buffers and the kernel page cache.

I personally think there is no chance of using mmap for that; the
semantics of mmap are pretty much dictated by POSIX and they don't work
for this.  However, disregarding the fact that the two communities
speaking here don't control the POSIX spec, you could maybe imagine
making it work if *both* pending WAL file contents and data file
contents were mmap'd, and there were kernel APIs allowing us to say
"you can write this mmap'd page if you want, but not till you've written
that mmap'd data over there".  That'd provide the necessary
write-barrier semantics, and avoid the cache coherency question because
all the data visible to the kernel could be thought of as the "current"
filesystem contents, it just might not all have reached disk yet; which
is the behavior of the kernel disk cache already.

I'm dubious that this sketch is implementable with adequate efficiency,
though, because in a live system the kernel would be forced to deal with
a whole lot of active barrier restrictions.  Within Postgres we can
reduce write-ordering tests to a very simple comparison: don't write
this page until WAL is flushed to disk at least as far as WAL sequence
number XYZ.  I think any kernel API would have to be a great deal more
general and thus harder to optimize.

Another difficulty with merging our internal buffers with the kernel
cache is that when we're in the process of applying a change to a page,
there are intermediate states of the page data that should under no
circumstances reach disk (eg, we might need to shuffle records around
within the page).  We can deal with that fairly easily right now by not
issuing a write() while a page change is in progress.  I don't see that
it's even theoretically possible in an mmap'd world; there are no atomic
updates to an mmap'd page that are larger than whatever is an atomic
update for the CPU.
        regards, tom lane



From:
Jan Kara
Date:

On Tue 14-01-14 09:08:40, Hannu Krosing wrote:
> >>> Effectively you end up with buffered read/write that's also mapped into
> >>> the page cache.  It's a pretty awful way to hack around mmap.
> >> Well, the problem is that you can't really use mmap() for the things we
> >> do. Postgres' durability works by guaranteeing that our journal entries
> >> (called WAL := Write Ahead Log) are written & synced to disk before the
> >> corresponding entries of tables and indexes reach the disk. That also
> >> allows to group together many random-writes into a few contiguous writes
> >> fdatasync()ed at once. Only during a checkpointing phase the big bulk of
> >> the data is then (slowly, in the background) synced to disk.
> > Which is the exact algorithm most journalling filesystems use for
> > ensuring durability of their metadata updates.  Indeed, here's an
> > interesting piece of architecture that you might like to consider:
> >
> > * Neither XFS and BTRFS use the kernel page cache to back their
> >   metadata transaction engines.
> But file system code is supposed to know much more about the
> underlying disk than a mere application program like postgresql.
> 
> We do not want to start duplicating OS if we can avoid it.
> 
> What we would like is to have a way to tell the kernel
> 
> 1) "here is the modified copy of file page, it is now safe to write
>     it back" - the current 'lazy' write
> 
> 2) "here is the page, write it back now, before returning success
>     to me" - unbuffered write or write + sync
> 
> but we also would like to have
> 
> 3) "here is the page as it is currently on disk, I may need it soon,
>     so keep it together with your other clean pages accessed at time X"
>     - this is the non-dirtying write discussed
>    
>     the page may be in buffer cache, in which case just update its LRU
>     position (to either current time or time provided by postgresql), or
>     it may not be there, in which case put it there if reasonable by it's
>     LRU position.
> 
> And we would like all this to work together with other current linux
> kernel goodness of managing the whole disk-side interaction of
> efficient reading and writing and managing the buffers :) So when I was speaking about the proposed vrange() syscall
inthis thread,
 
I thought that instead of injecting pages into pagecache for aging as you
describe in 3), you would mark pages as volatile (i.e. for reclaim by
kernel) through vrange() syscall. Next time you need the page, you check
whether the kernel reclaimed the page or not. If yes, you reload it from
disk, if not, you unmark it and use it.

Now the aging of pages marked as volatile as it is currently implemented
needn't be perfect for your needs but you still have time to influence what
gets implemented... Actually developers of the vrange() syscall were
specifically looking for some ideas what to base aging on. Currently I
think it is first marked - first evicted.
                            Honza
-- 
Jan Kara <>
SUSE Labs, CR



From:
Jan Kara
Date:

On Tue 14-01-14 11:11:28, Heikki Linnakangas wrote:
> On 01/14/2014 12:26 AM, Mel Gorman wrote:
> >On Mon, Jan 13, 2014 at 03:15:16PM -0500, Robert Haas wrote:
> >>The other thing that comes to mind is the kernel's caching behavior.
> >>We've talked a lot over the years about the difficulties of getting
> >>the kernel to write data out when we want it to and to not write data
> >>out when we don't want it to.
> >
> >Is sync_file_range() broke?
> >
> >>When it writes data back to disk too
> >>aggressively, we get lousy throughput because the same page can get
> >>written more than once when caching it for longer would have allowed
> >>write-combining.
> >
> >Do you think that is related to dirty_ratio or dirty_writeback_centisecs?
> >If it's dirty_writeback_centisecs then that would be particularly tricky
> >because poor interactions there would come down to luck basically.
> 
> >>When it doesn't write data to disk aggressively
> >>enough, we get huge latency spikes at checkpoint time when we call
> >>fsync() and the kernel says "uh, what? you wanted that data *on the
> >>disk*? sorry boss!" and then proceeds to destroy the world by starving
> >>the rest of the system for I/O for many seconds or minutes at a time.
> >
> >Ok, parts of that are somewhat expected. It *may* depend on the
> >underlying filesystem. Some of them handle fsync better than others. If
> >you are syncing the whole file though when you call fsync then you are
> >potentially burned by having to writeback dirty_ratio amounts of memory
> >which could take a substantial amount of time.
> >
> >>We've made some desultory attempts to use sync_file_range() to improve
> >>things here, but I'm not sure that's really the right tool, and if it
> >>is we don't know how to use it well enough to obtain consistent
> >>positive results.
> >
> >That implies that either sync_file_range() is broken in some fashion we
> >(or at least I) are not aware of and that needs kicking.
> 
> Let me try to explain the problem: Checkpoints can cause an I/O
> spike, which slows down other processes.
> 
> When it's time to perform a checkpoint, PostgreSQL will write() all
> dirty buffers from the PostgreSQL buffer cache, and finally perform
> an fsync() to flush the writes to disk. After that, we know the data
> is safely on disk.
> 
> In older PostgreSQL versions, the write() calls would cause an I/O
> storm as the OS cache quickly fills up with dirty pages, up to
> dirty_ratio, and after that all subsequent write()s block. That's OK
> as far as the checkpoint is concerned, but it significantly slows
> down queries running at the same time. Even a read-only query often
> needs to write(), to evict a dirty page from the buffer cache to
> make room for a different page. We made that less painful by adding
> sleeps between the write() calls, so that they are trickled over a
> long period of time and hopefully stay below dirty_ratio at all
> times. Hum, I wonder whether you see any difference with reasonably recent
kernels (say newer than 3.2). Because those have IO-less dirty throttling.
That means that: a) checkpointing thread (or other threads blocked due to dirty limit)
won't issue IO on their own but rather wait for flusher thread to do the
work. b) there should be more noticeable difference between the delay imposed
on heavily dirtying thread (i.e. the checkpointing thread) and the delay
imposed on lightly dirtying thread (that's what I would expect from those
threads having to do occasional page eviction to make room for other page).

> However, we still have to perform the fsync()s after the
> writes(), and sometimes that still causes a similar I/O storm. Because there is still quite some dirty data in the
pagecache or because
 
e.g. ext3 has to flush a lot of unrelated dirty data?

> The checkpointer is not in a hurry. A checkpoint typically has 10-30
> minutes to finish, before it's time to start the next checkpoint,
> and even if it misses that deadline that's not too serious either.
> But the OS doesn't know that, and we have no way of telling it.
> 
> As a quick fix, some sort of a lazy fsync() call would be nice. It
> would behave just like fsync() but it would not change the I/O
> scheduling at all. Instead, it would sleep until all the pages have
> been flushed to disk, at the speed they would've been without the
> fsync() call.
> 
> Another approach would be to give the I/O that the checkpointer
> process initiates a lower priority. This would be slightly
> preferable, because PostgreSQL could then issue the writes() as fast
> as it can, and have the checkpoint finish earlier when there's not
> much other load. Last I looked into this (which was a long time
> ago), there was no suitable priority system for writes, only reads. Well, IO priority works for writes in principle,
thetrouble is it
 
doesn't work for writes which end up just in the page cache. Then writeback
of page cache is usually done by flusher thread so that's completely
disconnected from whoever created the dirty data (now I know this is dumb
and long term we want to do something about it so that IO cgroups work
reasonably reliably but it is a tough problem, lots of complexity for not so
great gain...).

However, if you really issue the IO from the thread with low priority, it
will have low priority. So specifically if you call fsync() from a thread
with low IO priority, the flushing done by fsync() will have this low
IO priority.

Similarly if you called sync_file_range() once in a while from a thread
with low IO priority, the flushing IO will have low IO priority.  But I
would be really careful about the periodic sync_file_range() calls - it has
a potential of mixing with writeback from flusher thread and mixing these
two on different parts of a file can lead to bad IO patterns...
                            Honza
-- 
Jan Kara <>
SUSE Labs, CR



From:
Dave Chinner
Date:

On Mon, Jan 13, 2014 at 03:24:38PM -0800, Josh Berkus wrote:
> On 01/13/2014 02:26 PM, Mel Gorman wrote:
> > Really?
> > 
> > zone_reclaim_mode is often a complete disaster unless the workload is
> > partitioned to fit within NUMA nodes. On older kernels enabling it would
> > sometimes cause massive stalls. I'm actually very surprised to hear it
> > fixes anything and would be interested in hearing more about what sort
> > of circumstnaces would convince you to enable that thing.
> 
> So the problem with the default setting is that it pretty much isolates
> all FS cache for PostgreSQL to whichever socket the postmaster is
> running on, and makes the other FS cache unavailable.  This means that,
> for example, if you have two memory banks, then only one of them is
> available for PostgreSQL filesystem caching ... essentially cutting your
> available cache in half.

No matter what default NUMA allocation policy we set, there will be
an application for which that behaviour is wrong. As such, we've had
tools for setting application specific NUMA policies for quite a few
years now. e.g:

$ man 8 numactl
....      --interleave=nodes, -i nodes      Set a memory interleave policy. Memory will be      allocated using round
robinon nodes.  When memory      cannot be allocated on the current interleave target      fall back to other nodes.
Multiplenodes may be      specified on --interleave, --membind and      --cpunodebind.
 

Cheers,

Dave.
-- 
Dave Chinner




From:
Dave Chinner
Date:

On Tue, Jan 14, 2014 at 02:26:25AM +0100, Andres Freund wrote:
> On 2014-01-13 17:13:51 -0800, James Bottomley wrote:
> > a file into a user provided buffer, thus obtaining a page cache entry
> > and a copy in their userspace buffer, then insert the page of the user
> > buffer back into the page cache as the page cache page ... that's right,
> > isn't it postgress people?
> 
> Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page
> isn't needed anymore when reading. And we'd normally write if the page
> is dirty.

So why, exactly, do you even need the kernel page cache here? You've
got direct access to the copy of data read into userspace, and you
want direct control of when and how the data in that buffer is
written and reclaimed. Why push that data buffer back into the
kernel and then have to add all sorts of kernel interfaces to
control the page you already have control of?

> > Effectively you end up with buffered read/write that's also mapped into
> > the page cache.  It's a pretty awful way to hack around mmap.
> 
> Well, the problem is that you can't really use mmap() for the things we
> do. Postgres' durability works by guaranteeing that our journal entries
> (called WAL := Write Ahead Log) are written & synced to disk before the
> corresponding entries of tables and indexes reach the disk. That also
> allows to group together many random-writes into a few contiguous writes
> fdatasync()ed at once. Only during a checkpointing phase the big bulk of
> the data is then (slowly, in the background) synced to disk.

Which is the exact algorithm most journalling filesystems use for
ensuring durability of their metadata updates.  Indeed, here's an
interesting piece of architecture that you might like to consider:

* Neither XFS and BTRFS use the kernel page cache to back their metadata transaction engines.

Why not? Because the page cache is too simplistic to adequately
represent the complex object heirarchies that the filesystems have
and so it's flat LRU reclaim algorithms and writeback control
mechanisms are a terrible fit and cause lots of performance issues
under memory pressure.

IOWs, the two most complex high performance transaction engines in
the Linux kernel have moved to fully customised cache and (direct)
IO implementations because the requirements for scalability and
performance are far more complex than the kernel page cache
infrastructure can provide.

Just food for thought....

Cheers,

Dave.
-- 
Dave Chinner




From:
James Bottomley
Date:

On Mon, 2014-01-13 at 19:48 -0500, Trond Myklebust wrote:
> On Jan 13, 2014, at 19:03, Hannu Krosing <> wrote:
> 
> > On 01/13/2014 09:53 PM, Trond Myklebust wrote:
> >> On Jan 13, 2014, at 15:40, Andres Freund <> wrote:
> >> 
> >>> On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
> >>>> On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner <> wrote:
> >>>>> I notice, Josh, that you didn't mention the problems many people
> >>>>> have run into with Transparent Huge Page defrag and with NUMA
> >>>>> access.
> >>>> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
> >>>> setting zone_reclaim_mode; is there some other problem besides that?
> >>> I think that fixes some of the worst instances, but I've seen machines
> >>> spending horrible amounts of CPU (& BUS) time in page reclaim
> >>> nonetheless. If I analyzed it correctly it's in RAM << working set
> >>> workloads where RAM is pretty large and most of it is used as page
> >>> cache. The kernel ends up spending a huge percentage of time finding and
> >>> potentially defragmenting pages when looking for victim buffers.
> >>> 
> >>>> On a related note, there's also the problem of double-buffering.  When
> >>>> we read a page into shared_buffers, we leave a copy behind in the OS
> >>>> buffers, and similarly on write-out.  It's very unclear what to do
> >>>> about this, since the kernel and PostgreSQL don't have intimate
> >>>> knowledge of what each other are doing, but it would be nice to solve
> >>>> somehow.
> >>> I've wondered before if there wouldn't be a chance for postgres to say
> >>> "my dear OS, that the file range 0-8192 of file x contains y, no need to
> >>> reread" and do that when we evict a page from s_b but I never dared to
> >>> actually propose that to kernel people...
> >> O_DIRECT was specifically designed to solve the problem of double buffering 
> >> between applications and the kernel. Why are you not able to use that in these situations?
> > What is asked is the opposite of O_DIRECT - the write from a buffer inside
> > postgresql to linux *buffercache* and telling linux that it is the same
> > as what
> > is currently on disk, so don't bother to write it back ever.
> 
> I don’t understand. Are we talking about mmap()ed files here? Why
> would the kernel be trying to write back pages that aren’t dirty?

No ... if I have it right, it's pretty awful: they want to do a read of
a file into a user provided buffer, thus obtaining a page cache entry
and a copy in their userspace buffer, then insert the page of the user
buffer back into the page cache as the page cache page ... that's right,
isn't it postgress people?

Effectively you end up with buffered read/write that's also mapped into
the page cache.  It's a pretty awful way to hack around mmap.

James





From:
Dave Chinner
Date:

On Mon, Jan 13, 2014 at 09:29:02PM +0000, Greg Stark wrote:
> On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund <> wrote:
> > For one, postgres doesn't use mmap for files (and can't without major
> > new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> > horrible consequences for performance/scalability - very quickly you
> > contend on locks in the kernel.
> 
> I may as well dump this in this thread. We've discussed this in person
> a few times, including at least once with Ted T'so when he visited
> Dublin last year.
> 
> The fundamental conflict is that the kernel understands better the
> hardware and other software using the same resources, Postgres
> understands better its own access patterns. We need to either add
> interfaces so Postgres can teach the kernel what it needs about its
> access patterns or add interfaces so Postgres can find out what it
> needs to know about the hardware context.

In my experience applications don't need to know anything about the
underlying storage hardware - all they need is for someone to 
tell them the optimal IO size and alignment to use.

> The more ambitious and interesting direction is to let Postgres tell
> the kernel what it needs to know to manage everything. To do that we
> would need the ability to control when pages are flushed out. This is
> absolutely necessary to maintain consistency. Postgres would need to
> be able to mark pages as unflushable until some point in time in the
> future when the journal is flushed. We discussed various ways that
> interface could work but it would be tricky to keep it low enough
> overhead to be workable.

IMO, the concept of allowing userspace to pin dirty page cache
pages in memory is just asking for trouble. Apart from the obvious
memory reclaim and OOM issues, some filesystems won't be able to
move their journals forward until the data is flushed. i.e. ordered
mode data writeback on ext3 will have all sorts of deadlock issues
that result from pinning pages and then issuing fsync() on another
file which will block waiting for the pinned pages to be flushed.

Indeed, what happens if you do pin_dirty_pages(fd); .... fsync(fd);?
If fsync() blocks because there are pinned pages, and there's no
other thread to unpin them, then that code just deadlocked. If
fsync() doesn't block and skips the pinned pages, then we haven't
done an fsync() at all, and so violated the expectation that users
have that after fsync() returns their data is safe on disk. And if
we return an error to fsync(), then what the hell does the user do
if it is some other application we don't know about that has pinned
the pages? And if the kernel unpins them after some time, then we
just violated the application's consistency guarantees....

Hmmmm.  What happens if the process crashes after pinning the dirty
pages?  How do we even know what process pinned the dirty pages so
we can clean up after it? What happens if the same page is pinned by
multiple processes? What happens on truncate/hole punch if the
partial pages in the range that need to be zeroed and written are
pinned? What happens if we do direct IO to a range with pinned,
unflushable pages in the page cache?

These are all complex corner cases that are introduced by allowing
applications to pin dirty pages in memory. I've only spent a few
minutes coming up with these, and I'm sure there's more of them.
As such, I just don't see that allowing userspace to pin dirty
page cache pages in memory being a workable solution.

> The less exciting, more conservative option would be to add kernel
> interfaces to teach Postgres about things like raid geometries. Then

/sys/block/<dev>/queue/* contains all the information that is
exposed to filesystems to optimise layout for storage geometry.
Some filesystems can already expose the relevant parts of this
information to userspace, others don't.

What I think we really need to provide is a generic interface
similar to the old XFS_IOC_DIOINFO ioctl that can be used to
expose IO characteristics to applications in a simple, easy to
gather manner.  Something like:

struct io_info {u64    minimum_io_size;    /* sector size */u64    maximum_io_size;    /* currently 2GB */u64
optimal_io_size;   /* stripe unit/width */u64    optimal_io_alignment;    /* stripe unit/width */u64    mem_alignment;
     /* PAGE_SIZE */u32    queue_depth;        /* max IO concurrency */
 
};

> Postgres could use directio and decide to do prefetching based on the
> raid geometry,

Underlying storage array raid geometry and optimal IO sizes for the
filesystem may be different. Hence you want what the filesystem
considers optimal, not what the underlying storage is configured
with. Indeed, a filesystem might be able to supply per-file IO
characteristics depending on where it is located in the filesystem
(think tiered storage)....

> how much available i/o bandwidth and iops is available,
> etc.

The kernel doesn't really know what a device is capable of - it can
only measure what the current IO workload is achieving - and it can
change based on the IO workload characteristics. Hence applications
can track this as well as the kernel does if they need this
information for any reason.

> Reimplementing i/o schedulers and all the rest of the work that the

Nobody needs to reimplement IO schedulers in userspace. Direct IO
still goes through the block layers where all that merging and
IO scheduling occurs.

> kernel provides inside Postgres just seems like something outside our
> competency and that none of us is really excited about doing.

That argument goes both ways - providing fine-grained control over
the page cache contents to userspace doesn't get me excited, either.
In fact, it scares the living daylights out of me. It's complex,
it's fragile and it introduces constraints into everything we do in
the kernel. Any one of those reasons is grounds for saying no to a
proposal, but this idea hits the trifecta....

I'm not saying that O_DIRECT is easy or perfect, but it seems to me
to be a more robust, secure, maintainable and simpler solution than
trying to give applications direct control over complex internal
kernel structures and algorithms.

Cheers,

Dave.
-- 
Dave Chinner




From:
James Bottomley
Date:

On Tue, 2014-01-14 at 15:39 +0100, Hannu Krosing wrote:
> On 01/14/2014 09:39 AM, Claudio Freire wrote:
> > On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing <> wrote:
> >> Again, as said above the linux file system is doing fine. What we
> >> want is a few ways to interact with it to let it do even better when
> >> working with postgresql by telling it some stuff it otherwise would
> >> have to second guess and by sometimes giving it back some cache
> >> pages which were copied away for potential modifying but ended
> >> up clean in the end.
> > You don't need new interfaces. Only a slight modification of what
> > fadvise DONTNEED does.
> >
> > This insistence in injecting pages from postgres to kernel is just a
> > bad idea. 
> Do you think it would be possible to map copy-on-write pages
> from linux cache to postgresql cache ?
> 
> this would be a step in direction of solving the double-ram-usage
> of pages which have not been read from syscache to postgresql
> cache without sacrificing linux read-ahead (which I assume does
> not happen when reads bypass system cache).

The current mechanism for coherency between a userspace cache and the
in-kernel page cache is mmap ... that's the only way you get the same
page in both currently.

glibc used to have an implementation of read/write in terms of mmap, so
it should be possible to insert it into your current implementation
without a major rewrite.  The problem I think this brings you is
uncontrolled writeback: you don't want dirty pages to go to disk until
you issue a write()  I think we could fix this with another madvise():
something like MADV_WILLUPDATE telling the page cache we expect to alter
the pages again, so don't be aggressive about cleaning them.  Plus all
the other issues with mmap() ... but if you can detail those, we might
be able to fix them.

> and we can write back the copy at the point when it is safe (from
> postgresql perspective)  to let the system write them back ?

Using MADV_WILLUPDATE, possibly ... you're still not going to have
absolute control.  The kernel will write back the pages if the dirty
limits are exceeded, for instance, but we could tune it to be useful.

> Do you think it is possible to make it work with good performance
> for a few million 8kb pages ?
> 
> > At the very least, it still needs postgres to know too much
> > of the filesystem (block layout) to properly work. Ie: pg must be
> > required to put entire filesystem-level blocks into the page cache,
> > since that's how the page cache works. 
> I was more thinking of an simple write() interface with extra
> flags/sysctls to tell kernel that "we already have this on disk"
> > At the very worst, it may
> > introduce serious security and reliability implications, when
> > applications can destroy the consistency of the page cache (even if
> > full access rights are checked, there's still the possibility this
> > inconsistency might be exploitable).
> If you allow write() which just writes clean pages, I can not see
> where the extra security concerns are beyond what normal
> write can do.

The problem is we can't give you absolute control of when pages are
written back because that interface can be used to DoS the system: once
we get too many dirty uncleanable pages, we'll thrash looking for memory
and the system will livelock.

James





From:
Trond Myklebust
Date:

On Jan 14, 2014, at 10:39, Tom Lane <> wrote:

> James Bottomley <> writes:
>> The current mechanism for coherency between a userspace cache and the
>> in-kernel page cache is mmap ... that's the only way you get the same
>> page in both currently.
>
> Right.
>
>> glibc used to have an implementation of read/write in terms of mmap, so
>> it should be possible to insert it into your current implementation
>> without a major rewrite.  The problem I think this brings you is
>> uncontrolled writeback: you don't want dirty pages to go to disk until
>> you issue a write()
>
> Exactly.
>
>> I think we could fix this with another madvise():
>> something like MADV_WILLUPDATE telling the page cache we expect to alter
>> the pages again, so don't be aggressive about cleaning them.
>
> "Don't be aggressive" isn't good enough.  The prohibition on early write
> has to be absolute, because writing a dirty page before we've done
> whatever else we need to do results in a corrupt database.  It has to
> be treated like a write barrier.

Then why are you dirtying the page at all? It makes no sense to tell the kernel “we’re changing this page in the page
cache,but we don’t want you to change it on disk”: that’s not consistent with the function of a page cache. 

>> The problem is we can't give you absolute control of when pages are
>> written back because that interface can be used to DoS the system: once
>> we get too many dirty uncleanable pages, we'll thrash looking for memory
>> and the system will livelock.
>
> Understood, but that makes this direction a dead end.  We can't use
> it if the kernel might decide to write anyway.
>
>             regards, tom lane




From:
Robert Haas
Date:

On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley
<> wrote:
> No, I'm sorry, that's never going to be possible.  No user space
> application has all the facts.  If we give you an interface to force
> unconditional holding of dirty pages in core you'll livelock the system
> eventually because you made a wrong decision to hold too many dirty
> pages.   I don't understand why this has to be absolute: if you advise
> us to hold the pages dirty and we do up until it becomes a choice to
> hold on to the pages or to thrash the system into a livelock, why would
> you ever choose the latter?  And if, as I'm assuming, you never would,
> why don't you want the kernel to make that choice for you?

If you don't understand how write-ahead logging works, this
conversation is going nowhere.  Suffice it to say that the word
"ahead" is not optional.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Claudio Freire
Date:

On Tue, Jan 14, 2014 at 1:48 PM, Robert Haas <> wrote:
> On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley
> <> wrote:
>> No, I'm sorry, that's never going to be possible.  No user space
>> application has all the facts.  If we give you an interface to force
>> unconditional holding of dirty pages in core you'll livelock the system
>> eventually because you made a wrong decision to hold too many dirty
>> pages.   I don't understand why this has to be absolute: if you advise
>> us to hold the pages dirty and we do up until it becomes a choice to
>> hold on to the pages or to thrash the system into a livelock, why would
>> you ever choose the latter?  And if, as I'm assuming, you never would,
>> why don't you want the kernel to make that choice for you?
>
> If you don't understand how write-ahead logging works, this
> conversation is going nowhere.  Suffice it to say that the word
> "ahead" is not optional.


In essence, if you do flush when you shouldn't, and there is a
hardware failure, or kernel panic, or anything that stops the rest of
the writes from succeeding, your database is kaputt, and you've got to
restore a backup.

Ie: very very bad.



From:
Heikki Linnakangas
Date:

On 01/14/2014 06:08 PM, Tom Lane wrote:
> Trond Myklebust <> writes:
>> On Jan 14, 2014, at 10:39, Tom Lane <> wrote:
>>> "Don't be aggressive" isn't good enough.  The prohibition on early write
>>> has to be absolute, because writing a dirty page before we've done
>>> whatever else we need to do results in a corrupt database.  It has to
>>> be treated like a write barrier.
>
>> Then why are you dirtying the page at all? It makes no sense to tell the kernel “we’re changing this page in the
pagecache, but we don’t want you to change it on disk”: that’s not consistent with the function of a page cache.
 
>
> As things currently stand, we dirty the page in our internal buffers,
> and we don't write it to the kernel until we've written and fsync'd the
> WAL data that needs to get to disk first.  The discussion here is about
> whether we could somehow avoid double-buffering between our internal
> buffers and the kernel page cache.

To be honest, I think the impact of double buffering in real-life 
applications is greatly exaggerated. If you follow the usual guideline 
and configure shared_buffers to 25% of available RAM, at worst you're 
wasting 25% of RAM to double buffering. That's significant, but it's not 
the end of the world, and it's a problem that can be compensated by 
simply buying more RAM.

Of course, if someone can come up with an easy way to solve that, that'd 
be great, but if it means giving up other advantages that we get from 
relying on the OS page cache, then -1 from me. The usual response to the 
"why don't you just use O_DIRECT?" is that it'd require reimplementing a 
lot of I/O infrastructure, but misses an IMHO more important point: it 
would require setting shared_buffers a lot higher to get the same level 
of performance you get today. That has a number of problems:

1. It becomes a lot more important to tune shared_buffers correctly. Set 
it too low, and you're not taking advantage of all the RAM available. 
Set it too high, and you'll start swapping, totally killing performance. 
I can already hear consultants rubbing their hands, waiting for the rush 
of customers that will need expert help to determine the optimal 
shared_buffers setting.

2. Memory spent on the buffer cache can't be used for other things. For 
example, an index build can temporarily allocate several gigabytes of 
memory; if that memory is allocated to the shared buffer cache, it can't 
be used for that purpose. Yeah, we could change that, and allow 
borrowing pages from the shared buffer cache for other purposes, but 
that means more work and more code.

3. Memory used for the shared buffer cache can't be used by other 
processes (without swapping). It becomes a lot harder to be a good 
citizen on a system that's not entirely dedicated to PostgreSQL.

So not only would we need to re-implement I/O infrastructure, we'd also 
need to make memory management a lot smarter and a lot more flexible. 
We'd need a lot more information on what else is running on the system 
and how badly they need memory.

> I personally think there is no chance of using mmap for that; the
> semantics of mmap are pretty much dictated by POSIX and they don't work
> for this.

Agreed. It would be possible to use mmap() for pages that are not 
modified, though. When you're not modifying, you could mmap() the data 
you need, and bypass the PostgreSQL buffer cache that way. The 
interaction with the buffer cache becomes complicated, because you 
couldn't use the buffer cache's locks etc., and some pages might have a 
never version in the buffer cache than on-disk, but it might be doable.

- Heikki



From:
Kevin Grittner
Date:

Claudio Freire <> wrote:
> Robert Haas <> wrote:
>> James Bottomley <> wrote:

>>> I don't understand why this has to be absolute: if you advise
>>> us to hold the pages dirty and we do up until it becomes a
>>> choice to hold on to the pages or to thrash the system into a
>>> livelock, why would you ever choose the latter?

Because the former creates database corruption and the latter does
not.

>>> And if, as I'm assuming, you never would,

That assumption is totally wrong.

>>> why don't you want the kernel to make that choice for you?
>>
>> If you don't understand how write-ahead logging works, this
>> conversation is going nowhere.  Suffice it to say that the word
>> "ahead" is not optional.
>
> In essence, if you do flush when you shouldn't, and there is a
> hardware failure, or kernel panic, or anything that stops the
> rest of the writes from succeeding, your database is kaputt, and
> you've got to restore a backup.
>
> Ie: very very bad.

Yup.  And when that's a few terrabytes, you will certainly find
yourself wishing that you had been able to do a recovery up to the
end of the last successfully committed transaction rather than a
restore from backup.

Now, as Tom said, if there was an API to create write boundaries
between particular dirty pages we could leave it to the OS.  Each
WAL record's write would be conditional on the previous one and
each data page write would be conditional on the WAL record for the
last update to the page.  But nobody seems to think that would
yield acceptable performance.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Robert Haas
Date:

On Tue, Jan 14, 2014 at 11:57 AM, James Bottomley
<> wrote:
> On Tue, 2014-01-14 at 11:48 -0500, Robert Haas wrote:
>> On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley
>> <> wrote:
>> > No, I'm sorry, that's never going to be possible.  No user space
>> > application has all the facts.  If we give you an interface to force
>> > unconditional holding of dirty pages in core you'll livelock the system
>> > eventually because you made a wrong decision to hold too many dirty
>> > pages.   I don't understand why this has to be absolute: if you advise
>> > us to hold the pages dirty and we do up until it becomes a choice to
>> > hold on to the pages or to thrash the system into a livelock, why would
>> > you ever choose the latter?  And if, as I'm assuming, you never would,
>> > why don't you want the kernel to make that choice for you?
>>
>> If you don't understand how write-ahead logging works, this
>> conversation is going nowhere.  Suffice it to say that the word
>> "ahead" is not optional.
>
> No, I do ... you mean the order of write out, if we have to do it, is
> important.  In the rest of the kernel, we do this with barriers which
> causes ordered grouping of I/O chunks.  If we could force a similar
> ordering in the writeout code, is that enough?

Probably not.  There are a whole raft of problems here.  For that to
be any of any use, we'd have to move to mmap()ing each buffer instead
of read()ing them in, and apparently mmap() doesn't scale well to
millions of mappings.  And even if it did, then we'd have a solution
that only works on Linux.  Plus, as Tom pointed out, there are
critical sections where it's not just a question of ordering but in
fact you need to completely hold off writes.

In terms of avoiding double-buffering, here's my thought after reading
what's been written so far.  Suppose we read a page into our buffer
pool.  Until the page is clean, it would be ideal for the mapping to
be shared between the buffer cache and our pool, sort of like
copy-on-write.  That way, if we decide to evict the page, it will
still be in the OS cache if we end up needing it again (remember, the
OS cache is typically much larger than our buffer pool).  But if the
page is dirtied, then instead of copying it, just have the buffer pool
forget about it, because at that point we know we're going to write
the page back out anyway before evicting it.

This would be pretty similar to copy-on-write, except without the
copying.  It would just be forget-from-the-buffer-pool-on-write.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Robert Haas
Date:

On Tue, Jan 14, 2014 at 12:12 PM, Robert Haas <> wrote:
> In terms of avoiding double-buffering, here's my thought after reading
> what's been written so far.  Suppose we read a page into our buffer
> pool.  Until the page is clean, it would be ideal for the mapping to

Correction: "For so long as the page is clean..."

> be shared between the buffer cache and our pool, sort of like
> copy-on-write.  That way, if we decide to evict the page, it will
> still be in the OS cache if we end up needing it again (remember, the
> OS cache is typically much larger than our buffer pool).  But if the
> page is dirtied, then instead of copying it, just have the buffer pool
> forget about it, because at that point we know we're going to write
> the page back out anyway before evicting it.
>
> This would be pretty similar to copy-on-write, except without the
> copying.  It would just be forget-from-the-buffer-pool-on-write.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Claudio Freire
Date:

On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas <> wrote:
>
> In terms of avoiding double-buffering, here's my thought after reading
> what's been written so far.  Suppose we read a page into our buffer
> pool.  Until the page is clean, it would be ideal for the mapping to
> be shared between the buffer cache and our pool, sort of like
> copy-on-write.  That way, if we decide to evict the page, it will
> still be in the OS cache if we end up needing it again (remember, the
> OS cache is typically much larger than our buffer pool).  But if the
> page is dirtied, then instead of copying it, just have the buffer pool
> forget about it, because at that point we know we're going to write
> the page back out anyway before evicting it.
>
> This would be pretty similar to copy-on-write, except without the
> copying.  It would just be forget-from-the-buffer-pool-on-write.


But... either copy-on-write or forget-on-write needs a page fault, and
thus a page mapping.

Is a page fault more expensive than copying 8k?

(I really don't know).



From:
Robert Haas
Date:

On Tue, Jan 14, 2014 at 12:15 PM, Claudio Freire <> wrote:
> On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas <> wrote:
>> In terms of avoiding double-buffering, here's my thought after reading
>> what's been written so far.  Suppose we read a page into our buffer
>> pool.  Until the page is clean, it would be ideal for the mapping to
>> be shared between the buffer cache and our pool, sort of like
>> copy-on-write.  That way, if we decide to evict the page, it will
>> still be in the OS cache if we end up needing it again (remember, the
>> OS cache is typically much larger than our buffer pool).  But if the
>> page is dirtied, then instead of copying it, just have the buffer pool
>> forget about it, because at that point we know we're going to write
>> the page back out anyway before evicting it.
>>
>> This would be pretty similar to copy-on-write, except without the
>> copying.  It would just be forget-from-the-buffer-pool-on-write.
>
> But... either copy-on-write or forget-on-write needs a page fault, and
> thus a page mapping.
>
> Is a page fault more expensive than copying 8k?

I don't know either.  I wasn't thinking so much that it would save CPU
time as that it would save memory.  Consider a system with 32GB of
RAM.  If you set shared_buffers=8GB, then in the worst case you've got
25% of your RAM wasted storing pages that already exist, dirtied, in
shared_buffers.  It's easy to imagine scenarios in which that results
in lots of extra I/O, so that the CPU required to do the accounting
comes to seem cheap by comparison.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Hannu Krosing
Date:

On 01/14/2014 05:44 PM, James Bottomley wrote:
> On Tue, 2014-01-14 at 10:39 -0500, Tom Lane wrote:
>> James Bottomley <> writes:
>>> The current mechanism for coherency between a userspace cache and the
>>> in-kernel page cache is mmap ... that's the only way you get the same
>>> page in both currently.
>> Right.
>>
>>> glibc used to have an implementation of read/write in terms of mmap, so
>>> it should be possible to insert it into your current implementation
>>> without a major rewrite.  The problem I think this brings you is
>>> uncontrolled writeback: you don't want dirty pages to go to disk until
>>> you issue a write()
>> Exactly.
>>
>>> I think we could fix this with another madvise():
>>> something like MADV_WILLUPDATE telling the page cache we expect to alter
>>> the pages again, so don't be aggressive about cleaning them.
>> "Don't be aggressive" isn't good enough.  The prohibition on early write
>> has to be absolute, because writing a dirty page before we've done
>> whatever else we need to do results in a corrupt database.  It has to
>> be treated like a write barrier.
>>
>>> The problem is we can't give you absolute control of when pages are
>>> written back because that interface can be used to DoS the system: once
>>> we get too many dirty uncleanable pages, we'll thrash looking for memory
>>> and the system will livelock.
>> Understood, but that makes this direction a dead end.  We can't use
>> it if the kernel might decide to write anyway.
> No, I'm sorry, that's never going to be possible.  No user space
> application has all the facts.  If we give you an interface to force
> unconditional holding of dirty pages in core you'll livelock the system
> eventually because you made a wrong decision to hold too many dirty
> pages.   I don't understand why this has to be absolute: if you advise
> us to hold the pages dirty and we do up until it becomes a choice to
> hold on to the pages or to thrash the system into a livelock, why would
> you ever choose the latter?  And if, as I'm assuming, you never would,
> why don't you want the kernel to make that choice for you?
The short answer is "crash safety".

A database system worth its name must make sure that all data
reported as stored to clients is there even after crash.

Write ahead log is the means for that. And writing wal files and
data pages has to be in certain order to guarantee consistent
recovery after crash.

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




From:
Kevin Grittner
Date:

James Bottomley <> wrote:

> you mean the order of write out, if we have to do it, is
> important.  In the rest of the kernel, we do this with barriers
> which causes ordered grouping of I/O chunks.  If we could force a
> similar ordering in the writeout code, is that enough?

Unless it can be between particular pairs of pages, I don't think
performance could be at all acceptable.  Each data page has an
associated Log Sequence Number reflecting the last Write-Ahead Log
record which records a change to that page, and the referenced WAL
record must be safely persisted before the data page is allowed to
be written.  Currently, when we need to write a dirty page to the
OS, we must ensure that the WAL record is written and fsync'd
first.  We also write a WAL record for transaction command and
fsync it at each COMMIT, before telling the client that the COMMIT
request was successful.  (Well, at least by default; they can
choose to set synchronous_commit to off for some or all
transactions.)  If a write barrier to control this applied to
everything on the filesystem, performance would be horrible.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Jeff Janes
Date:

On Mon, Jan 13, 2014 at 2:36 PM, Mel Gorman <> wrote:
On Mon, Jan 13, 2014 at 06:27:03PM -0200, Claudio Freire wrote:
> On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby <> wrote:
> > On 1/13/14, 2:19 PM, Claudio Freire wrote:
> >>
> >> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas <>
> >> wrote:
> >>>
> >>> On a related note, there's also the problem of double-buffering.  When
> >>> we read a page into shared_buffers, we leave a copy behind in the OS
> >>> buffers, and similarly on write-out.  It's very unclear what to do
> >>> about this, since the kernel and PostgreSQL don't have intimate
> >>> knowledge of what each other are doing, but it would be nice to solve
> >>> somehow.
> >>
> >>
> >>
> >> There you have a much harder algorithmic problem.
> >>
> >> You can basically control duplication with fadvise and WONTNEED. The
> >> problem here is not the kernel and whether or not it allows postgres
> >> to be smart about it. The problem is... what kind of smarts
> >> (algorithm) to use.
> >
> >
> > Isn't this a fairly simple matter of when we read a page into shared buffers
> > tell the kernel do forget that page? And a corollary to that for when we
> > dump a page out of shared_buffers (here kernel, please put this back into
> > your cache).
>
>
> That's my point. In terms of kernel-postgres interaction, it's fairly simple.
>
> What's not so simple, is figuring out what policy to use. Remember,
> you cannot tell the kernel to put some page in its page cache without
> reading it or writing it. So, once you make the kernel forget a page,
> evicting it from shared buffers becomes quite expensive.

posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by
forcing readahead.

But telling the kernel to forget a page, then telling it to read it in again from disk because it might be needed again in the near future is itself very expensive.  We would need to hand the page to the kernel so it has it without needing to go to disk to get it.
 
If you evict it prematurely then you do get kinda
screwed because you pay the IO cost to read it back in again even if you
had enough memory to cache it. Maybe this is the type of kernel-postgres
interaction that is annoying you.

If you don't evict, the kernel eventually steps in and evicts the wrong
thing. If you do evict and it was unnecessarily you pay an IO cost.

That could be something we look at. There are cases buried deep in the
VM where pages get shuffled to the end of the LRU and get tagged for
reclaim as soon as possible. Maybe you need access to something like
that via posix_fadvise to say "reclaim this page if you need memory but
leave it resident if there is no memory pressure" or something similar.
Not exactly sure what that interface would look like or offhand how it
could be reliably implemented.

I think the "reclaim this page if you need memory but leave it resident if there is no memory pressure" hint would be more useful for temporary working files than for what was being discussed above (shared buffers).  When I do work that needs large temporary files, I often see physical write IO spike but physical read IO does not.  I interpret that to mean that the temporary data is being written to disk to satisfy either dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS cache and so disk reads are not needed to satisfy it.  So a hint that says "this file will never be fsynced so please ignore dirty_*bytes and dirty_expire_centisecs.  I will need it again relatively soon (but not after a reboot), but will do so mostly sequentially, so please don't evict this without need, but if you do need to then it is a good candidate" would be good.

Cheers,

Jeff
From:
Robert Haas
Date:

On Tue, Jan 14, 2014 at 12:20 PM, James Bottomley
<> wrote:
> On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote:
>> On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas <> wrote:
>> > In terms of avoiding double-buffering, here's my thought after reading
>> > what's been written so far.  Suppose we read a page into our buffer
>> > pool.  Until the page is clean, it would be ideal for the mapping to
>> > be shared between the buffer cache and our pool, sort of like
>> > copy-on-write.  That way, if we decide to evict the page, it will
>> > still be in the OS cache if we end up needing it again (remember, the
>> > OS cache is typically much larger than our buffer pool).  But if the
>> > page is dirtied, then instead of copying it, just have the buffer pool
>> > forget about it, because at that point we know we're going to write
>> > the page back out anyway before evicting it.
>> >
>> > This would be pretty similar to copy-on-write, except without the
>> > copying.  It would just be forget-from-the-buffer-pool-on-write.
>>
>> But... either copy-on-write or forget-on-write needs a page fault, and
>> thus a page mapping.
>>
>> Is a page fault more expensive than copying 8k?
>>
>> (I really don't know).
>
> A page fault can be expensive, yes ... but perhaps you don't need one.
>
> What you want is a range of memory that's read from a file but treated
> as anonymous for writeout (i.e. written to swap if we need to reclaim
> it). Then at some time later, you want to designate it as written back
> to the file instead so you control the writeout order.  I'm not sure we
> can do this: the separation between file backed and anonymous pages is
> pretty deeply ingrained into the OS, but if it were possible, is that
> what you want?

Doesn't sound exactly like what I had in mind.  What I was suggesting
is an analogue of read() that, if it reads full pages of data to a
page-aligned address, shares the data with the buffer cache until it's
first written instead of actually copying the data.  The pages are
write-protected so that an attempt to write the address range causes a
page fault.  In response to such a fault, the pages become anonymous
memory and the buffer cache no longer holds a reference to the page.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Tom Lane
Date:

Robert Haas <> writes:
> On Tue, Jan 14, 2014 at 11:57 AM, James Bottomley
> <> wrote:
>> No, I do ... you mean the order of write out, if we have to do it, is
>> important.  In the rest of the kernel, we do this with barriers which
>> causes ordered grouping of I/O chunks.  If we could force a similar
>> ordering in the writeout code, is that enough?

> Probably not.  There are a whole raft of problems here.  For that to
> be any of any use, we'd have to move to mmap()ing each buffer instead
> of read()ing them in, and apparently mmap() doesn't scale well to
> millions of mappings.

We would presumably mmap whole files, not individual pages (at least
on 64-bit machines; else address space size is going to be a problem).
However, without a fix for the critical-section/atomic-update problem,
the idea's still going nowhere.

> This would be pretty similar to copy-on-write, except without the
> copying.  It would just be forget-from-the-buffer-pool-on-write.

That might possibly work.
        regards, tom lane



From:
James Bottomley
Date:

On Tue, 2014-01-14 at 10:39 -0500, Tom Lane wrote:
> James Bottomley <> writes:
> > The current mechanism for coherency between a userspace cache and the
> > in-kernel page cache is mmap ... that's the only way you get the same
> > page in both currently.
> 
> Right.
> 
> > glibc used to have an implementation of read/write in terms of mmap, so
> > it should be possible to insert it into your current implementation
> > without a major rewrite.  The problem I think this brings you is
> > uncontrolled writeback: you don't want dirty pages to go to disk until
> > you issue a write()
> 
> Exactly.
> 
> > I think we could fix this with another madvise():
> > something like MADV_WILLUPDATE telling the page cache we expect to alter
> > the pages again, so don't be aggressive about cleaning them.
> 
> "Don't be aggressive" isn't good enough.  The prohibition on early write
> has to be absolute, because writing a dirty page before we've done
> whatever else we need to do results in a corrupt database.  It has to
> be treated like a write barrier.
> 
> > The problem is we can't give you absolute control of when pages are
> > written back because that interface can be used to DoS the system: once
> > we get too many dirty uncleanable pages, we'll thrash looking for memory
> > and the system will livelock.
> 
> Understood, but that makes this direction a dead end.  We can't use
> it if the kernel might decide to write anyway.

No, I'm sorry, that's never going to be possible.  No user space
application has all the facts.  If we give you an interface to force
unconditional holding of dirty pages in core you'll livelock the system
eventually because you made a wrong decision to hold too many dirty
pages.   I don't understand why this has to be absolute: if you advise
us to hold the pages dirty and we do up until it becomes a choice to
hold on to the pages or to thrash the system into a livelock, why would
you ever choose the latter?  And if, as I'm assuming, you never would,
why don't you want the kernel to make that choice for you?

James




From:
James Bottomley
Date:

On Tue, 2014-01-14 at 11:48 -0500, Robert Haas wrote:
> On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley
> <> wrote:
> > No, I'm sorry, that's never going to be possible.  No user space
> > application has all the facts.  If we give you an interface to force
> > unconditional holding of dirty pages in core you'll livelock the system
> > eventually because you made a wrong decision to hold too many dirty
> > pages.   I don't understand why this has to be absolute: if you advise
> > us to hold the pages dirty and we do up until it becomes a choice to
> > hold on to the pages or to thrash the system into a livelock, why would
> > you ever choose the latter?  And if, as I'm assuming, you never would,
> > why don't you want the kernel to make that choice for you?
> 
> If you don't understand how write-ahead logging works, this
> conversation is going nowhere.  Suffice it to say that the word
> "ahead" is not optional.

No, I do ... you mean the order of write out, if we have to do it, is
important.  In the rest of the kernel, we do this with barriers which
causes ordered grouping of I/O chunks.  If we could force a similar
ordering in the writeout code, is that enough?

James





From:
James Bottomley
Date:

On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote:
> On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas <> wrote:
> >
> > In terms of avoiding double-buffering, here's my thought after reading
> > what's been written so far.  Suppose we read a page into our buffer
> > pool.  Until the page is clean, it would be ideal for the mapping to
> > be shared between the buffer cache and our pool, sort of like
> > copy-on-write.  That way, if we decide to evict the page, it will
> > still be in the OS cache if we end up needing it again (remember, the
> > OS cache is typically much larger than our buffer pool).  But if the
> > page is dirtied, then instead of copying it, just have the buffer pool
> > forget about it, because at that point we know we're going to write
> > the page back out anyway before evicting it.
> >
> > This would be pretty similar to copy-on-write, except without the
> > copying.  It would just be forget-from-the-buffer-pool-on-write.
> 
> 
> But... either copy-on-write or forget-on-write needs a page fault, and
> thus a page mapping.
> 
> Is a page fault more expensive than copying 8k?
> 
> (I really don't know).

A page fault can be expensive, yes ... but perhaps you don't need one. 

What you want is a range of memory that's read from a file but treated
as anonymous for writeout (i.e. written to swap if we need to reclaim
it).  Then at some time later, you want to designate it as written back
to the file instead so you control the writeout order.  I'm not sure we
can do this: the separation between file backed and anonymous pages is
pretty deeply ingrained into the OS, but if it were possible, is that
what you want?

James





From:
Jeff Janes
Date:

On Mon, Jan 13, 2014 at 6:44 PM, Dave Chinner <> wrote:
On Tue, Jan 14, 2014 at 02:26:25AM +0100, Andres Freund wrote:
> On 2014-01-13 17:13:51 -0800, James Bottomley wrote:
> > a file into a user provided buffer, thus obtaining a page cache entry
> > and a copy in their userspace buffer, then insert the page of the user
> > buffer back into the page cache as the page cache page ... that's right,
> > isn't it postgress people?
>
> Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page
> isn't needed anymore when reading. And we'd normally write if the page
> is dirty.

So why, exactly, do you even need the kernel page cache here?

We don't need it, but it would be nice.
 
You've
got direct access to the copy of data read into userspace, and you
want direct control of when and how the data in that buffer is
written and reclaimed. Why push that data buffer back into the
kernel and then have to add all sorts of kernel interfaces to
control the page you already have control of?

Say 25% of the RAM is dedicated to the database's shared buffers, and 75% is left to the kernel's judgement.  It sure would be nice if the kernel had the capability of using some of that 75% for database pages, if it thought that that was the best use for it.

Which is what we do get now, at the expense of quite a lot of double buffering (by which I mean, a lot of pages are both in the kernel cache and the database cache--not just transiently during the copy process, but for quite a while).  If we had the ability to re-inject clean pages into the kernel's cache, we would get that benefit without the double buffering.

Cheers,

Jeff
From:
Claudio Freire
Date:

On Tue, Jan 14, 2014 at 2:17 PM, Robert Haas <> wrote:
> On Tue, Jan 14, 2014 at 12:15 PM, Claudio Freire <> wrote:
>> On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas <> wrote:
>>> In terms of avoiding double-buffering, here's my thought after reading
>>> what's been written so far.  Suppose we read a page into our buffer
>>> pool.  Until the page is clean, it would be ideal for the mapping to
>>> be shared between the buffer cache and our pool, sort of like
>>> copy-on-write.  That way, if we decide to evict the page, it will
>>> still be in the OS cache if we end up needing it again (remember, the
>>> OS cache is typically much larger than our buffer pool).  But if the
>>> page is dirtied, then instead of copying it, just have the buffer pool
>>> forget about it, because at that point we know we're going to write
>>> the page back out anyway before evicting it.
>>>
>>> This would be pretty similar to copy-on-write, except without the
>>> copying.  It would just be forget-from-the-buffer-pool-on-write.
>>
>> But... either copy-on-write or forget-on-write needs a page fault, and
>> thus a page mapping.
>>
>> Is a page fault more expensive than copying 8k?
>
> I don't know either.  I wasn't thinking so much that it would save CPU
> time as that it would save memory.  Consider a system with 32GB of
> RAM.  If you set shared_buffers=8GB, then in the worst case you've got
> 25% of your RAM wasted storing pages that already exist, dirtied, in
> shared_buffers.  It's easy to imagine scenarios in which that results
> in lots of extra I/O, so that the CPU required to do the accounting
> comes to seem cheap by comparison.

Not necessarily, you pay the CPU cost on each page fault (ie: first
write to the buffer at least), each time the page checks into the
shared buffers level.

It's like a tiered cache.

When promoting is expensive, one must be careful. The traffic to/from
the L0 (shared buffers) and L1 (page cache) will be considerable, even
if everything fits in RAM.

I guess it's the constant battle between inclusive and exclusive caches.



From:
Stephen Frost
Date:

* Claudio Freire () wrote:
> On Tue, Jan 14, 2014 at 2:17 PM, Robert Haas <> wrote:
> > I don't know either.  I wasn't thinking so much that it would save CPU
> > time as that it would save memory.  Consider a system with 32GB of
> > RAM.  If you set shared_buffers=8GB, then in the worst case you've got
> > 25% of your RAM wasted storing pages that already exist, dirtied, in
> > shared_buffers.  It's easy to imagine scenarios in which that results
> > in lots of extra I/O, so that the CPU required to do the accounting
> > comes to seem cheap by comparison.
>
> Not necessarily, you pay the CPU cost on each page fault (ie: first
> write to the buffer at least), each time the page checks into the
> shared buffers level.

I'm really not sure that this is a real issue for us, but if it is,
perhaps having this as an option for each read() call would work..?
That is to say, rather than have this be an open() flag or similar, it's
normal read() with a flags field where we could decide when we want
pages to be write-protected this way and when we don't (perhaps because
we know we're about to write to them).

I'm not 100% sure it'd be easy for us to manage that flag perfectly, but
it's our issue and it'd be on us to deal with as the kernel can't
possibly guess our intentions.

There were concerns brought up earlier that such a zero-copy-read option
wouldn't be performant though and I'm curious to hear more about those
and if we could avoid the performance issues by manging the
zero-copy-read case ourselves as Robert suggests.
Thanks,
    Stephen

From:
Claudio Freire
Date:

On Tue, Jan 14, 2014 at 2:39 PM, Robert Haas <> wrote:
> On Tue, Jan 14, 2014 at 12:20 PM, James Bottomley
> <> wrote:
>> On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote:
>>> On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas <> wrote:
>>> > In terms of avoiding double-buffering, here's my thought after reading
>>> > what's been written so far.  Suppose we read a page into our buffer
>>> > pool.  Until the page is clean, it would be ideal for the mapping to
>>> > be shared between the buffer cache and our pool, sort of like
>>> > copy-on-write.  That way, if we decide to evict the page, it will
>>> > still be in the OS cache if we end up needing it again (remember, the
>>> > OS cache is typically much larger than our buffer pool).  But if the
>>> > page is dirtied, then instead of copying it, just have the buffer pool
>>> > forget about it, because at that point we know we're going to write
>>> > the page back out anyway before evicting it.
>>> >
>>> > This would be pretty similar to copy-on-write, except without the
>>> > copying.  It would just be forget-from-the-buffer-pool-on-write.
>>>
>>> But... either copy-on-write or forget-on-write needs a page fault, and
>>> thus a page mapping.
>>>
>>> Is a page fault more expensive than copying 8k?
>>>
>>> (I really don't know).
>>
>> A page fault can be expensive, yes ... but perhaps you don't need one.
>>
>> What you want is a range of memory that's read from a file but treated
>> as anonymous for writeout (i.e. written to swap if we need to reclaim
>> it). Then at some time later, you want to designate it as written back
>> to the file instead so you control the writeout order.  I'm not sure we
>> can do this: the separation between file backed and anonymous pages is
>> pretty deeply ingrained into the OS, but if it were possible, is that
>> what you want?
>
> Doesn't sound exactly like what I had in mind.  What I was suggesting
> is an analogue of read() that, if it reads full pages of data to a
> page-aligned address, shares the data with the buffer cache until it's
> first written instead of actually copying the data.  The pages are
> write-protected so that an attempt to write the address range causes a
> page fault.  In response to such a fault, the pages become anonymous
> memory and the buffer cache no longer holds a reference to the page.


Yes, that's basically zero-copy reads.

It could be done. The kernel can remap the page to the physical page
holding the shared buffer and mark it read-only, then expire the
buffer and transfer ownership of the page if any page fault happens.

But that incurrs:- Page faults, lots- Hugely bloated mappings, unless KSM is somehow leveraged for this

And there's a nice bingo. Had forgotten about KSM. KSM could help lots.

I could try to see of madvising shared_buffers as mergeable helps. But
this should be an automatic case of KSM - ie, when reading into a
page-aligned address, the kernel should summarily apply KSM-style
sharing without hinting. The current madvise interface puts the burden
of figuring out what duplicates what on the kernel, but postgres
already knows.



From:
Jan Kara
Date:

On Tue 14-01-14 06:42:43, Kevin Grittner wrote:
> First off, I want to give a +1 on everything in the recent posts
> from Heikki and Hannu.
> 
> Jan Kara <> wrote:
> 
> > Now the aging of pages marked as volatile as it is currently
> > implemented needn't be perfect for your needs but you still have
> > time to influence what gets implemented... Actually developers of
> > the vrange() syscall were specifically looking for some ideas
> > what to base aging on. Currently I think it is first marked -
> > first evicted.
> 
> The "first marked - first evicted" seems like what we would want. 
> The ability to "unmark" and have the page no longer be considered
> preferred for eviction would be very nice.  That seems to me like
> it would cover the multiple layers of buffering *clean* pages very
> nicely (although I know nothing more about vrange() than what has
> been said on this thread, so I could be missing something). Here:
http://www.spinics.net/lists/linux-mm/msg67328.html is an email which introduces the syscall. As you say, it might be
a
reasonable fit for your problems with double caching of clean pages.

> The other side of that is related avoiding multiple writes of the
> same page as much as possible, while avoid write gluts.  The issue
> here is that PostgreSQL tries to hang on to dirty pages for as long
> as possible before "writing" them to the OS cache, while the OS
> tries to avoid writing them to storage for as long as possible
> until they reach a (configurable) threshold or are fsync'd.  The
> problem is that a under various conditions PostgreSQL may need to
> write and fsync a lot of dirty pages it has accumulated in a short
> time.  That has an "avalanche" effect, creating a "write glut"
> which can stall all I/O for a period of many seconds up to a few
> minutes.  If the OS was aware of the dirty pages pending write in
> the application, and counted those for purposes of calculating when
> and how much to write, the glut could be avoided.  Currently,
> people configure the PostgreSQL background writer to be very
> aggressive, configure a small PostgreSQL shared_buffers setting,
> and/or set the OS thresholds low enough to minimize the problem;
> but all of these mitigation strategies have their own costs.
> 
> A new hint that the application has dirtied a page could be used by
> the OS to improve things this way:  When the OS is notified that a
> page is dirty, it takes action depending on whether the page is
> considered dirty by the OS.  If it is not dirty, the page is
> immediately discarded from the OS cache.  It is known that the
> application has a modified version of the page that it intends to
> write, so the version in the OS cache has no value.  We don't want
> this page forcing eviction of vrange()-flagged pages.  If it is
> dirty, any write ordering to storage by the OS based on when the
> page was written to the OS would be pushed back as far as possible
> without crossing any write barriers, in hopes that the writes could
> be combined.  Either way, this page is counted toward dirty pages
> for purposes of calculating how much to write from the OS to
> storage, and the later write of the page doesn't redundantly add to
> this number. The evict if clean part is easy. That could be easily a new fadvise()
option - btw. note that POSIX_FADV_DONTNEED has quite close meaning. Only
that it also starts writeback on a dirty page if backing device isn't
congested. Which is somewhat contrary to what you want to achieve. But I'm
not sure the eviction would be a clear win since filesystem then has to
re-create the mapping from logical file block to disk block (it is cached
in the page) and that potentially needs to go to disk to fetch the mapping
data.

I have a hard time thinking how we would implement pushing back writeback
of a particular page (or better set of pages). When we need to write pages
because we are nearing dirty_bytes limit, we likely want to write these
marked pages anyway to make as many pages freeable as possible. So the only
thing we could do is to ignore these pages during periodic writeback and
I'm not sure that would make a big difference.

Just to get some idea about the sizes - how large are the checkpoints we
are talking about that cause IO stalls?
                            Honza

-- 
Jan Kara <>
SUSE Labs, CR



From:
Gavin Flower
Date:

<div class="moz-cite-prefix">On 14/01/14 14:09, Dave Chinner wrote:<br /></div><blockquote
cite="mid:20140114010946.GA3431@dastard"type="cite"><pre wrap="">On Mon, Jan 13, 2014 at 09:29:02PM +0000, Greg Stark
wrote:
</pre><blockquote type="cite"><pre wrap="">On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund <a
class="moz-txt-link-rfc2396E"href="mailto:"><></a> wrote:
 
</pre></blockquote></blockquote> [...]<br /><blockquote cite="mid:20140114010946.GA3431@dastard"
type="cite"><blockquotetype="cite"></blockquote><blockquote type="cite"><pre wrap="">The more ambitious and interesting
directionis to let Postgres tell
 
the kernel what it needs to know to manage everything. To do that we
would need the ability to control when pages are flushed out. This is
absolutely necessary to maintain consistency. Postgres would need to
be able to mark pages as unflushable until some point in time in the
future when the journal is flushed. We discussed various ways that
interface could work but it would be tricky to keep it low enough
overhead to be workable.
</pre></blockquote><pre wrap="">
IMO, the concept of allowing userspace to pin dirty page cache
pages in memory is just asking for trouble. Apart from the obvious
memory reclaim and OOM issues, some filesystems won't be able to
move their journals forward until the data is flushed. i.e. ordered
mode data writeback on ext3 will have all sorts of deadlock issues
that result from pinning pages and then issuing fsync() on another
file which will block waiting for the pinned pages to be flushed.

Indeed, what happens if you do pin_dirty_pages(fd); .... fsync(fd);?
If fsync() blocks because there are pinned pages, and there's no
other thread to unpin them, then that code just deadlocked. If
fsync() doesn't block and skips the pinned pages, then we haven't
done an fsync() at all, and so violated the expectation that users
have that after fsync() returns their data is safe on disk. And if
we return an error to fsync(), then what the hell does the user do
if it is some other application we don't know about that has pinned
the pages? And if the kernel unpins them after some time, then we
just violated the application's consistency guarantees....

</pre></blockquote> [...]<br /><br /> What if Postgres could tell the kernel how strongly that it wanted to hold on to
thepages? <br /><br /> Say a byte (this is arbitrary, it could be a single hint bit which meant "please, Please, PLEASE
don'tflush, if that is okay with you Mr Kernel..."), so strength would be S = (unsigned byte value)/256, so 0 <= S
<1.<br /><br /><tt>S = 0      flush now.</tt><tt><br /></tt><tt>0 < S < 1</tt><tt>  flush if the 'need' is
greaterthan the S<br /></tt><tt>S = 1      never flush (note a value of 1 cannot occur, as max S = 255/256)</tt><br
/><br/> Postgres could use low non-zero S values if it thinks that pages <i>might</i> still be useful later, and very
highvalues when it is <i>more certain</i>.  I am sure Postgres must sometimes know when some pages are more important
toheld onto than others, hence my feeling that S should be more than one bit.<br /><br /> The kernel might simply flush
pagesstarting at ones with low values of S working upwards until it has freed enough memory to resolve its memory
pressure. So an explicit numerical value of 'need' (as implied above) is not required.  Also any practical
implementationwould not use 'S' as a float/double, but use integer values for 'S' & 'need' - assuming that 'need'
didhave to be an actual value, which I suspect would not be reequired.<br /><br /> This way the kernel is free to flush
allsuch pages, when sufficient need arises - yet usually, when there is sufficient memory, the pages will be held
unflushed.<br/><br /><br /> Cheers,<br /> Gavin<br /> 
From:
Robert Haas
Date:

On Tue, Jan 14, 2014 at 1:37 PM, Jan Kara <> wrote:
> Just to get some idea about the sizes - how large are the checkpoints we
> are talking about that cause IO stalls?

Big.  Potentially, we might have dirtied all of shared_buffers and
then started evicting pages from there to the OS buffer pool and
dirties as much memory as the OS will allow, and then the OS might
have started writeback and filled up all the downstream caches between
the OS and the disk.  And then just then the checkpoint hits.

I dunno what a typical checkpoint size is but I don't think you'll be
exaggerating much if you imagine that everything that could possibly
be dirty is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Robert Haas
Date:

On Tue, Jan 14, 2014 at 2:03 PM, Gavin Flower
<> wrote:
> Say a byte (this is arbitrary, it could be a single hint bit which meant
> "please, Please, PLEASE don't flush, if that is okay with you Mr
> Kernel..."), so strength would be S = (unsigned byte value)/256, so 0 <= S <
> 1.
>
> S = 0      flush now.
> 0 < S < 1  flush if the 'need' is greater than the S
> S = 1      never flush (note a value of 1 cannot occur, as max S = 255/256)
>
> Postgres could use low non-zero S values if it thinks that pages might still
> be useful later, and very high values when it is more certain.  I am sure
> Postgres must sometimes know when some pages are more important to held onto
> than others, hence my feeling that S should be more than one bit.
>
> The kernel might simply flush pages starting at ones with low values of S
> working upwards until it has freed enough memory to resolve its memory
> pressure.  So an explicit numerical value of 'need' (as implied above) is
> not required.  Also any practical implementation would not use 'S' as a
> float/double, but use integer values for 'S' & 'need' - assuming that 'need'
> did have to be an actual value, which I suspect would not be reequired.
>
> This way the kernel is free to flush all such pages, when sufficient need
> arises - yet usually, when there is sufficient memory, the pages will be
> held unflushed.

Well, this just begs the question of what value PG ought to pass as
the parameter.

I think the alternate don't-need semantics (we don't think we need
this but please don't throw it away arbitrarily if there's no memory
pressure) would be a big win.  I don't think we know enough in user
space to be more precise than that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Jan Kara
Date:

On Tue 14-01-14 10:04:16, Robert Haas wrote:
> On Tue, Jan 14, 2014 at 5:00 AM, Jan Kara <> wrote:
> > I thought that instead of injecting pages into pagecache for aging as you
> > describe in 3), you would mark pages as volatile (i.e. for reclaim by
> > kernel) through vrange() syscall. Next time you need the page, you check
> > whether the kernel reclaimed the page or not. If yes, you reload it from
> > disk, if not, you unmark it and use it.
> >
> > Now the aging of pages marked as volatile as it is currently implemented
> > needn't be perfect for your needs but you still have time to influence what
> > gets implemented... Actually developers of the vrange() syscall were
> > specifically looking for some ideas what to base aging on. Currently I
> > think it is first marked - first evicted.
> 
> This is an interesting idea but it stinks of impracticality.
> Essentially when the last buffer pin on a page is dropped we'd have to
> mark it as discardable, and then the next person wanting to pin it
> would have to check whether it's still there.  But the system call
> overhead of calling vrange() every time the last pin on a page was
> dropped would probably hose us.
> 
> *thinks*
> 
> Well, I guess it could be done lazily: make periodic sweeps through
> shared_buffers, looking for pages that haven't been touched in a
> while, and vrange() them.  That's quite a bit of new mechanism, but in
> theory it could work out to a win.  vrange() would have to scale well
> to millions of separate ranges, though.  Will it? It is intented to be rather lightweight so I believe milions should
be
OK. But I didn't try :).

> And a lot depends on whether the kernel makes the right decision about
> whether to chunk data from our vrange() vs. any other page it could have
> reclaimed. I think the intent is to reclaim pages in the following order:
used once pages -> volatile pages -> active pages, swapping
                            Honza
-- 
Jan Kara <>
SUSE Labs, CR



From:
Stephen Frost
Date:

* Robert Haas () wrote:
> I dunno what a typical checkpoint size is but I don't think you'll be
> exaggerating much if you imagine that everything that could possibly
> be dirty is.

This is not uncommon for us, at least:

checkpoint complete: wrote 425844 buffers (20.3%); 0 transaction log
file(s) added, 0 removed, 249 recycled; write=175.535 s, sync=17.428 s,
total=196.357 s; sync files=1011, longest=2.675 s, average=0.017 s

That's a checkpoint writing out 20% of 16GB, or over 3GB, and that's
just from one of the four postmasters running- we get this kind of
checkpointing happening on all of them.  All told, it's easy for us to
want to write over 12GB during a single checkpoint period on this box.
(checkpoint_timeout is 5m, checkpoint_target is 0.9).

Thankfully, the box has 256G of RAM and so the shared buffers only use
up 25% of the RAM in the box. :)

I'm sure others could post larger numbers.
Thanks,
    Stephen

From:
Kevin Grittner
Date:

Robert Haas <> wrote:
> Jan Kara <> wrote:
>
>> Just to get some idea about the sizes - how large are the
>> checkpoints we are talking about that cause IO stalls?
>
> Big.

To quantify that, in a production setting we were seeing pauses of
up to two minutes with shared_buffers set to 8GB and default dirty
page settings for Linux, on a machine with 256GB RAM and 512MB
non-volatile cache on the RAID controller.  To eliminate stalls we
had to drop shared_buffers to 2GB (to limit how many dirty pages
could build up out-of-sight from the OS), spread checkpoints to 90%
of allowed time (almost no gap between finishing one checkpoint and
starting the next) and crank up the background writer so that no
dirty page sat unwritten in PostgreSQL shared_buffers for more than
4 seconds. Less aggressive pushing to the OS resulted in the
avalanche of writes I previously described, with the corresponding
I/O stalls.  We approached that incrementally, and that's the point
where stalls stopped occurring.  We did not adjust the OS
thresholds for writing dirty pages, although I know of others who
have had to do so.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Robert Haas
Date:

On Tue, Jan 14, 2014 at 3:00 PM, James Bottomley
<> wrote:
>> Doesn't sound exactly like what I had in mind.  What I was suggesting
>> is an analogue of read() that, if it reads full pages of data to a
>> page-aligned address, shares the data with the buffer cache until it's
>> first written instead of actually copying the data.
>
> The only way to make this happen is mmap the file to the buffer and use
> MADV_WILLNEED.
>
>>   The pages are
>> write-protected so that an attempt to write the address range causes a
>> page fault.  In response to such a fault, the pages become anonymous
>> memory and the buffer cache no longer holds a reference to the page.
>
> OK, so here I thought of another madvise() call to switch the region to
> anonymous memory.  A page fault works too, of course, it's just that one
> per page in the mapping will be expensive.

I don't think either of these ideas works for us.  We start by
creating a chunk of shared memory that all processes (we do not use
threads) will have mapped at a common address, and we read() and
write() into that chunk.

> Do you care about handling aliases ... what happens if someone else
> reads from the file, or will that never occur?  The reason for asking is
> that it's much easier if someone else mmapping the file gets your
> anonymous memory than we create an alias in the page cache.

All reads and writes go through the buffer pool stored in shared
memory, but any of the processes that have that shared memory region
mapped could be responsible for any individual I/O request.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Kevin Grittner
Date:

James Bottomley <> wrote:

> About how many files comprise this cache?  Are you thinking it's
> too difficult for every process to map the files?

The shared_buffers area can be mapping anywhere from about 200
files to millions of files, representing a total space of about 6MB
on the low end to over 100TB on the high end.  For many workloads
performance falls off above a shared_buffers size of about 8GB,
although for data warehousing environments larger sizes sometimes
work out and to avoid write gluts it must often be limited to 1GB
to 1GB.

Data access is in fixed-sized pages, normally of 8KB each.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Kevin Grittner
Date:

I wrote:

> to avoid write gluts it must often be limited to 1GB to 1GB.

That should have been "1GB to 2GB."



From:
Kevin Grittner
Date:

James Bottomley <> wrote:

>> We start by creating a chunk of shared memory that all processes
>> (we do not use threads) will have mapped at a common address,
>> and we read() and write() into that chunk.
>
> Yes, that's what I was thinking: it's a cache.  About how many
> files comprise this cache?  Are you thinking it's too difficult
> for every process to map the files?

It occurred to me that I don't remember seeing any indication of
how many processes we're talking about.  There is once process per
database connection, plus some administrative processes, like the
checkpoint process and the background writer.  At the low end,
about 10 processes would be connected to the shared memory.  The
highest I've personally seen is about 3000; I don't know how far
above that people might try to push it.  I always recommend a
connection pool to limit the number of database connections to
something near ((2 * core count) + effective spindle count), since
that's where I typically see best performance; but people don't
always do that.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Kevin Grittner
Date:

Dave Chinner <> write:

> Essentially, changing dirty_background_bytes, dirty_bytes and
> dirty_expire_centiseconds to be much smaller should make the
> kernel start writeback much sooner and so you shouldn't have to
> limit the amount of buffers the application has to prevent major
> fsync triggered stalls...

Is there any "rule of thumb" about where to start with these?  For
example, should a database server maybe have dirty_background_bytes
set to 75% of the non-volatile write cache present on the
controller, in an attempt to make sure that there is always some
"slack" space for writes?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
James Bottomley
Date:

On Tue, 2014-01-14 at 12:39 -0500, Robert Haas wrote:
> On Tue, Jan 14, 2014 at 12:20 PM, James Bottomley
> <> wrote:
> > On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote:
> >> On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas <> wrote:
> >> > In terms of avoiding double-buffering, here's my thought after reading
> >> > what's been written so far.  Suppose we read a page into our buffer
> >> > pool.  Until the page is clean, it would be ideal for the mapping to
> >> > be shared between the buffer cache and our pool, sort of like
> >> > copy-on-write.  That way, if we decide to evict the page, it will
> >> > still be in the OS cache if we end up needing it again (remember, the
> >> > OS cache is typically much larger than our buffer pool).  But if the
> >> > page is dirtied, then instead of copying it, just have the buffer pool
> >> > forget about it, because at that point we know we're going to write
> >> > the page back out anyway before evicting it.
> >> >
> >> > This would be pretty similar to copy-on-write, except without the
> >> > copying.  It would just be forget-from-the-buffer-pool-on-write.
> >>
> >> But... either copy-on-write or forget-on-write needs a page fault, and
> >> thus a page mapping.
> >>
> >> Is a page fault more expensive than copying 8k?
> >>
> >> (I really don't know).
> >
> > A page fault can be expensive, yes ... but perhaps you don't need one.
> >
> > What you want is a range of memory that's read from a file but treated
> > as anonymous for writeout (i.e. written to swap if we need to reclaim
> > it). Then at some time later, you want to designate it as written back
> > to the file instead so you control the writeout order.  I'm not sure we
> > can do this: the separation between file backed and anonymous pages is
> > pretty deeply ingrained into the OS, but if it were possible, is that
> > what you want?
> 
> Doesn't sound exactly like what I had in mind.  What I was suggesting
> is an analogue of read() that, if it reads full pages of data to a
> page-aligned address, shares the data with the buffer cache until it's
> first written instead of actually copying the data.

The only way to make this happen is mmap the file to the buffer and use
MADV_WILLNEED.

>   The pages are
> write-protected so that an attempt to write the address range causes a
> page fault.  In response to such a fault, the pages become anonymous
> memory and the buffer cache no longer holds a reference to the page.

OK, so here I thought of another madvise() call to switch the region to
anonymous memory.  A page fault works too, of course, it's just that one
per page in the mapping will be expensive.

Do you care about handling aliases ... what happens if someone else
reads from the file, or will that never occur?  The reason for asking is
that it's much easier if someone else mmapping the file gets your
anonymous memory than we create an alias in the page cache.

James





From:
James Bottomley
Date:

On Tue, 2014-01-14 at 15:09 -0500, Robert Haas wrote:
> On Tue, Jan 14, 2014 at 3:00 PM, James Bottomley
> <> wrote:
> >> Doesn't sound exactly like what I had in mind.  What I was suggesting
> >> is an analogue of read() that, if it reads full pages of data to a
> >> page-aligned address, shares the data with the buffer cache until it's
> >> first written instead of actually copying the data.
> >
> > The only way to make this happen is mmap the file to the buffer and use
> > MADV_WILLNEED.
> >
> >>   The pages are
> >> write-protected so that an attempt to write the address range causes a
> >> page fault.  In response to such a fault, the pages become anonymous
> >> memory and the buffer cache no longer holds a reference to the page.
> >
> > OK, so here I thought of another madvise() call to switch the region to
> > anonymous memory.  A page fault works too, of course, it's just that one
> > per page in the mapping will be expensive.
> 
> I don't think either of these ideas works for us.  We start by
> creating a chunk of shared memory that all processes (we do not use
> threads) will have mapped at a common address, and we read() and
> write() into that chunk.

Yes, that's what I was thinking: it's a cache.  About how many files
comprise this cache?  Are you thinking it's too difficult for every
process to map the files?

> > Do you care about handling aliases ... what happens if someone else
> > reads from the file, or will that never occur?  The reason for asking is
> > that it's much easier if someone else mmapping the file gets your
> > anonymous memory than we create an alias in the page cache.
> 
> All reads and writes go through the buffer pool stored in shared
> memory, but any of the processes that have that shared memory region
> mapped could be responsible for any individual I/O request.

That seems to be possible with the abstraction.  The initial mapping
gets the file backed pages: you can do madvise to read them (using
readahead), flush them (using wontneed) and flip them to anonymous
(using something TBD).  Since it's a shared mapping API based on the
file, any of the mapping processes can do any operation.  Future mappers
of the file get the mix of real and anon memory, so it's truly shared.

Given that you want to use this as a shared cache, it seems that the API
to flip back from anon to file mapped is wontneed.  That would also
trigger writeback of any dirty pages in the previously anon region ...
which you could force with msync.  As far as I can see, this is
identical to read/write on a shared region with the exception that you
don't need to copy in and out of the page cache.

From our point of view, the implementation is nice because the pages
effectively never leave the page cache.  We just use an extra per page
flag (which I'll get shot for suggesting) to alter the writeout path
(which is where the complexity which may kill the implementation is).

James





From:
Dave Chinner
Date:

On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote:
> On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman <> wrote:
> >> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
> >> setting zone_reclaim_mode; is there some other problem besides that?
> >
> > Really?
> >
> > zone_reclaim_mode is often a complete disaster unless the workload is
> > partitioned to fit within NUMA nodes. On older kernels enabling it would
> > sometimes cause massive stalls. I'm actually very surprised to hear it
> > fixes anything and would be interested in hearing more about what sort
> > of circumstnaces would convince you to enable that thing.
> 
> By "set" I mean "set to zero".  We've seen multiple of instances of
> people complaining about large amounts of system memory going unused
> because this setting defaulted to 1.
> 
> >> The other thing that comes to mind is the kernel's caching behavior.
> >> We've talked a lot over the years about the difficulties of getting
> >> the kernel to write data out when we want it to and to not write data
> >> out when we don't want it to.
> >
> > Is sync_file_range() broke?
> 
> I don't know.  I think a few of us have played with it and not been
> able to achieve a clear win.

Before you go back down the sync_file_range path, keep in mind that
it is not a guaranteed data integrity operation: it does not force
device cache flushes like fsync/fdatasync(). Hence it does not
guarantee that the metadata that points at the data written nor the
volatile caches in the storage path has been flushed...

IOWs, using sync_file_range() does not avoid the need to fsync() a
file for data integrity purposes...

> Whether the problem is with the system
> call or the programmer is harder to determine.  I think the problem is
> in part that it's not exactly clear when we should call it.  So
> suppose we want to do a checkpoint.  What we used to do a long time
> ago is write everything, and then fsync it all, and then call it good.
>  But that produced horrible I/O storms.  So what we do now is do the
> writes over a period of time, with sleeps in between, and then fsync
> it all at the end, hoping that the kernel will write some of it before
> the fsyncs arrive so that we don't get a huge I/O spike.
> And that sorta works, and it's definitely better than doing it all at
> full speed, but it's pretty imprecise.  If the kernel doesn't write
> enough of the data out in advance, then there's still a huge I/O storm
> when we do the fsyncs and everything grinds to a halt.  If it writes
> out more data than needed in advance, it increases the total number of
> physical writes because we get less write-combining, and that hurts
> performance, too. 

Yup, the kernel defaults to maximising bulk write throughput, which
means it waits to the last possible moment to issue write IO. And
that's exactly to maximise write combining, optimise delayed
allocation, etc. There are many good reasons for doing this, and for
the majority of workloads it is the right behaviour to have.

It sounds to me like you want the kernel to start background
writeback earlier so that it doesn't build up as much dirty data
before you require a flush. There are several ways to do this by
tweaking writeback knobs. The simplest is probably just to set
/proc/sys/vm/dirty_background_bytes to an appropriate threshold (say
50MB) and dirty_expire_centiseconds to a few seconds so that
background writeback starts and walks all dirty inodes almost
immediately. This will keep a steady stream of low level background
IO going, and fsync should then not take very long.

Fundamentally, though, we need bug reports from people seeing these
problems when they see them so we can diagnose them on their
systems. Trying to discuss/diagnose these problems without knowing
anything about the storage, the kernel version, writeback
thresholds, etc really doesn't work because we can't easily
determine a root cause.

Cheers,

Dave.
-- 
Dave Chinner




From:
Dave Chinner
Date:

On Tue, Jan 14, 2014 at 11:40:38AM -0800, Kevin Grittner wrote:
> Robert Haas <> wrote:
> > Jan Kara <> wrote:
> >
> >> Just to get some idea about the sizes - how large are the
> >> checkpoints we are talking about that cause IO stalls?
> >
> > Big.
> 
> To quantify that, in a production setting we were seeing pauses of
> up to two minutes with shared_buffers set to 8GB and default dirty
 ^^^^^^^^^^^^^
 
> page settings for Linux, on a machine with 256GB RAM and 512MB ^^^^^^^^^^^^^
There's your problem.

By default, background writeback doesn't start until 10% of memory
is dirtied, and on your machine that's 25GB of RAM. That's way to
high for your workload.

It appears to me that we are seeing large memory machines much more
commonly in data centers - a couple of years ago 256GB RAM was only
seen in supercomputers. Hence machines of this size are moving from
"tweaking settings for supercomputers is OK" class to "tweaking
settings for enterprise servers is not OK"....

Perhaps what we need to do is deprecate dirty_ratio and
dirty_background_ratio as the default values as move to the byte
based values as the defaults and cap them appropriately.  e.g.
10/20% of RAM for small machines down to a couple of GB for large
machines....

> non-volatile cache on the RAID controller.  To eliminate stalls we
> had to drop shared_buffers to 2GB (to limit how many dirty pages
> could build up out-of-sight from the OS), spread checkpoints to 90%
> of allowed time (almost no gap between finishing one checkpoint and
> starting the next) and crank up the background writer so that no
> dirty page sat unwritten in PostgreSQL shared_buffers for more than
> 4 seconds. Less aggressive pushing to the OS resulted in the
> avalanche of writes I previously described, with the corresponding
> I/O stalls.  We approached that incrementally, and that's the point
> where stalls stopped occurring.  We did not adjust the OS
> thresholds for writing dirty pages, although I know of others who
> have had to do so.

Essentially, changing dirty_background_bytes, dirty_bytes and
dirty_expire_centiseconds to be much smaller should make the kernel
start writeback much sooner and so you shouldn't have to limit the
amount of buffers the application has to prevent major fsync
triggered stalls...

Cheers,

Dave.
-- 
Dave Chinner




From:
Dave Chinner
Date:

On Wed, Jan 15, 2014 at 08:03:28AM +1300, Gavin Flower wrote:
> On 14/01/14 14:09, Dave Chinner wrote:
> >On Mon, Jan 13, 2014 at 09:29:02PM +0000, Greg Stark wrote:
> >>On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund <> wrote:
> [...]
> >>The more ambitious and interesting direction is to let Postgres tell
> >>the kernel what it needs to know to manage everything. To do that we
> >>would need the ability to control when pages are flushed out. This is
> >>absolutely necessary to maintain consistency. Postgres would need to
> >>be able to mark pages as unflushable until some point in time in the
> >>future when the journal is flushed. We discussed various ways that
> >>interface could work but it would be tricky to keep it low enough
> >>overhead to be workable.
> >IMO, the concept of allowing userspace to pin dirty page cache
> >pages in memory is just asking for trouble. Apart from the obvious
> >memory reclaim and OOM issues, some filesystems won't be able to
> >move their journals forward until the data is flushed. i.e. ordered
> >mode data writeback on ext3 will have all sorts of deadlock issues
> >that result from pinning pages and then issuing fsync() on another
> >file which will block waiting for the pinned pages to be flushed.
> >
> >Indeed, what happens if you do pin_dirty_pages(fd); .... fsync(fd);?
> >If fsync() blocks because there are pinned pages, and there's no
> >other thread to unpin them, then that code just deadlocked. If
> >fsync() doesn't block and skips the pinned pages, then we haven't
> >done an fsync() at all, and so violated the expectation that users
> >have that after fsync() returns their data is safe on disk. And if
> >we return an error to fsync(), then what the hell does the user do
> >if it is some other application we don't know about that has pinned
> >the pages? And if the kernel unpins them after some time, then we
> >just violated the application's consistency guarantees....
> >
> [...]
> 
> What if Postgres could tell the kernel how strongly that it wanted
> to hold on to the pages?

That doesn't get rid of the problems, it just makes it harder to
diagnose them when they occur. :/

Cheers,

Dave.
-- 
Dave Chinner




From:
Jim Nasby
Date:

On 1/14/14, 11:30 AM, Jeff Janes wrote:
> I think the "reclaim this page if you need memory but leave it resident if there is no memory pressure" hint would be
moreuseful for temporary working files than for what was being discussed above (shared buffers).  When I do work that
needslarge temporary files, I often see physical write IO spike but physical read IO does not.  I interpret that to
meanthat the temporary data is being written to disk to satisfy either dirty_expire_centisecs or dirty_*bytes, but the
dataremains in the FS cache and so disk reads are not needed to satisfy it.  So a hint that says "this file will never
befsynced so please ignore dirty_*bytes and dirty_expire_centisecs.  I will need it again relatively soon (but not
aftera reboot), but will do so mostly sequentially, so please don't evict this without need, but if you do need to then
itis a good candidate" would be good.
 

I also frequently see this, and it has an even larger impact if pgsql_tmp is on the same filesystem as WAL. Which
*theoretically*shouldn't matter with a BBU controller, except that when the kernel suddenly decides your *temporary*
dataneeds to hit the media you're screwed.
 

Though, it also occurs to me... perhaps it would be better for us to simply map temp objects to memory and let the
kernelswap them out if needed...
 
-- 
Jim C. Nasby, Data Architect                       
512.569.9461 (cell)                         http://jim.nasby.net



From:
Claudio Freire
Date:

On Tue, Jan 14, 2014 at 9:22 PM, Jim Nasby <> wrote:
> On 1/14/14, 11:30 AM, Jeff Janes wrote:
>>
>> I think the "reclaim this page if you need memory but leave it resident if
>> there is no memory pressure" hint would be more useful for temporary working
>> files than for what was being discussed above (shared buffers).  When I do
>> work that needs large temporary files, I often see physical write IO spike
>> but physical read IO does not.  I interpret that to mean that the temporary
>> data is being written to disk to satisfy either dirty_expire_centisecs or
>> dirty_*bytes, but the data remains in the FS cache and so disk reads are not
>> needed to satisfy it.  So a hint that says "this file will never be fsynced
>> so please ignore dirty_*bytes and dirty_expire_centisecs.  I will need it
>> again relatively soon (but not after a reboot), but will do so mostly
>> sequentially, so please don't evict this without need, but if you do need to
>> then it is a good candidate" would be good.
>
>
> I also frequently see this, and it has an even larger impact if pgsql_tmp is
> on the same filesystem as WAL. Which *theoretically* shouldn't matter with a
> BBU controller, except that when the kernel suddenly decides your
> *temporary* data needs to hit the media you're screwed.
>
> Though, it also occurs to me... perhaps it would be better for us to simply
> map temp objects to memory and let the kernel swap them out if needed...


Oum... bad idea.

Swap logic has very poor taste for I/O patterns.



From:
Jonathan Corbet
Date:

On Wed, 15 Jan 2014 09:23:52 +1100
Dave Chinner <> wrote:

> It appears to me that we are seeing large memory machines much more
> commonly in data centers - a couple of years ago 256GB RAM was only
> seen in supercomputers. Hence machines of this size are moving from
> "tweaking settings for supercomputers is OK" class to "tweaking
> settings for enterprise servers is not OK"....
> 
> Perhaps what we need to do is deprecate dirty_ratio and
> dirty_background_ratio as the default values as move to the byte
> based values as the defaults and cap them appropriately.  e.g.
> 10/20% of RAM for small machines down to a couple of GB for large
> machines....

I had thought that was already in the works...it hits people on far
smaller systems than those described here.
http://lwn.net/Articles/572911/

I wonder if anybody ever finished this work out for 3.14?

jon



From:
Dave Chinner
Date:

On Tue, Jan 14, 2014 at 03:03:39PM -0800, Kevin Grittner wrote:
> Dave Chinner <> write:
> 
> > Essentially, changing dirty_background_bytes, dirty_bytes and
> > dirty_expire_centiseconds to be much smaller should make the
> > kernel start writeback much sooner and so you shouldn't have to
> > limit the amount of buffers the application has to prevent major
> > fsync triggered stalls...
> 
> Is there any "rule of thumb" about where to start with these?

There's no absolute rule here, but the threshold for background
writeback needs to consider the amount of dirty data being
generated, the rate at which it can be retired and the checkpoint
period the application is configured with. i.e. it needs to be slow
enough to not cause serious read IO perturbations, but still fast
enough that it avoids peaks at synchronisation points. And most
importantly, it needs to be fast enought that it can complete
writeback of all the dirty data in a checkpoint before the next
checkpoint is triggered.

In general, I find that threshold to be somewhere around 2-5s worth
of data writeback - enough to keep a good amount of write combining
and the IO pipeline full as work is done, but no more.

e.g. if your workload results in writeback rates of 500MB/s, then
I'd be setting the dirty limit somewhere around 1-2GB as an initial
guess. It's basically a simple trade off buffering space for
writeback latency. Some applications perform well with increased
buffering space (e.g. 10-20s of writeback) while others perform
better with extremely low writeback latency (e.g. 0.5-1s). 

>   For
> example, should a database server maybe have dirty_background_bytes
> set to 75% of the non-volatile write cache present on the
> controller, in an attempt to make sure that there is always some
> "slack" space for writes?

I don't think the hardware cache size matters as it's easy to fill
them very quickly and so after a couple of seconds the controller
will fall back to disk speed anyway. IMO, what matters is that the
threshold is large enough to adequately buffer writes to smooth
peaks and troughs in the pipeline.

Cheers,

Dave.
-- 
Dave Chinner




From:
Jim Nasby
Date:

On 1/14/14, 4:21 AM, Mel Gorman wrote:
> There is an interesting side-line here. If all IO is initiated by one
> process in postgres then the memory locality will be sub-optimal.
> The consumer of the data may or may not be running on the same
> node as the process that read the data from disk. It is possible to
> migrate this from user space but the interface is clumsy and assumes the
> data is mapped.

That's really not the case in Postgres. There's essentially 3 main areas for IO requests to come from:

- Individual "backends". These are processes forked off of our startup process (postmaster) for the purpose of serving
userconnections. This is always "foreground" IO and should be avoided as much as possible (but is still a large
percentage).
- autovacuum. This is a set of "clean-up" processes, meant to be low impact, background only. Similar to garbage
collectionis GC languages.
 
- bgwriter. This process is meant to greatly reduce the need for user backends to write data out.

Generally speaking, read requests are most likely to come from user backends. autovacuum can issue them too, but it's
gota throttling mechanism so generally shouldn't be that much of the workload.
 

Ideally most write traffic would come from bgwriter (and autovacuum, though again we don't care too much about it). In
realitythough, that's going to depend very highly on a user's actual workload. To start, backends normally must write
allwrite-ahead-log traffic before they finalize (COMMIT) a transaction for the user. COMMIT is sort of similar in idea
tofsync... "When this returns I guarantee I've permanently stored your data."
 

The amount of WAL data generated for a transaction will vary enormously, even as a percentage of raw page data written.
Insome cases a very small (10s-100s of bytes) amount of WAL data will cover 1 or more base data pages (8k by default,
upto 64k). But to protect against torn page writes, by default we write a complete copy of a data page to WAL the first
timethe page is dirtied after a checkpoint. So the opposite scenario is we actually write slightly MORE data to WAL
thanwe do to the data pages.
 

What makes WAL even trickier is that bgwritter tries to write WAL data out before backends need to. In a system with a
fairlylow transaction rate that can work... but with a higher rate most WAL data will be written by a backend trying to
issuea COMMIT. Note however that COMMIT needs to write ALL WAL data up to a given point, so one backend that only needs
towrite 100 bytes can easily end up flushing (and fsync'ing) megabytes of data written by some other backend.
 

Further complicating things is temporary storage, either in the form of user defined temporary tables, or temporary
storageneeded by the database itself. It's hard to characterize these workloads other than to say that typically
readingand writing to them will want to move a relatively large amount of data at once.
 

BTW, because Postgres doesn't have terribly sophisticated memory management, it's very common to create temporary file
datathat will never, ever, ever actually NEED to hit disk. Where I work being able to tell the kernel to avoid flushing
thosefiles unless the kernel thinks it's got better things to do with that memory would be EXTREMELY valuable, because
it'sall temp data anyway: if the database or server crashes it's just going to get throw away. It might be a good idea
forthe Postgres to look at simply putting this data into plain memory now and relying on the OS to swap it as needed.
That'dbe more problematic for temp tables, but in that case mmap might work very well, because that data is currently
nevershared by other processes, though if we start doing parallel query execution that will change.
 
-- 
Jim C. Nasby, Data Architect                       
512.569.9461 (cell)                         http://jim.nasby.net



From:
Jim Nasby
Date:

On 1/14/14, 3:41 PM, Dave Chinner wrote:
> On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote:
>> On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman <> wrote:
> IOWs, using sync_file_range() does not avoid the need to fsync() a
> file for data integrity purposes...

I belive the PG community understands that, but thanks for the heads-up.

>> Whether the problem is with the system
>> call or the programmer is harder to determine.  I think the problem is
>> in part that it's not exactly clear when we should call it.  So
>> suppose we want to do a checkpoint.  What we used to do a long time
>> ago is write everything, and then fsync it all, and then call it good.
>>   But that produced horrible I/O storms.  So what we do now is do the
>> writes over a period of time, with sleeps in between, and then fsync
>> it all at the end, hoping that the kernel will write some of it before
>> the fsyncs arrive so that we don't get a huge I/O spike.
>> And that sorta works, and it's definitely better than doing it all at
>> full speed, but it's pretty imprecise.  If the kernel doesn't write
>> enough of the data out in advance, then there's still a huge I/O storm
>> when we do the fsyncs and everything grinds to a halt.  If it writes
>> out more data than needed in advance, it increases the total number of
>> physical writes because we get less write-combining, and that hurts
>> performance, too.

I think there's a pretty important bit that Robert didn't mention: we have a specific *time* target for when we want
allthe fsync's to complete. People that have problems here tend to tune checkpoints to complete every 5-15 minutes, and
theywant the write traffic for the checkpoint spread out over 90% of that time interval. To put it another way, fsync's
shouldbe done when 90% of the time to the next checkpoint hits, but preferably not a lot before then.
 

> Yup, the kernel defaults to maximising bulk write throughput, which
> means it waits to the last possible moment to issue write IO. And
> that's exactly to maximise write combining, optimise delayed
> allocation, etc. There are many good reasons for doing this, and for
> the majority of workloads it is the right behaviour to have.
>
> It sounds to me like you want the kernel to start background
> writeback earlier so that it doesn't build up as much dirty data
> before you require a flush. There are several ways to do this by
> tweaking writeback knobs. The simplest is probably just to set
> /proc/sys/vm/dirty_background_bytes to an appropriate threshold (say
> 50MB) and dirty_expire_centiseconds to a few seconds so that
> background writeback starts and walks all dirty inodes almost
> immediately. This will keep a steady stream of low level background
> IO going, and fsync should then not take very long.

Except that still won't throttle writes, right? That's the big issue here: our users often can't tolerate big spikes in
IOlatency. They want user requests to always happen within a specific amount of time.
 

So while delaying writes potentially reduces the total amount of data you're writing, users that run into problems here
ultimatelycare more about ensuring that their foreground IO completes in a timely fashion.
 

> Fundamentally, though, we need bug reports from people seeing these
> problems when they see them so we can diagnose them on their
> systems. Trying to discuss/diagnose these problems without knowing
> anything about the storage, the kernel version, writeback
> thresholds, etc really doesn't work because we can't easily
> determine a root cause.

So is  the best way to accomplish that?

Also, along the lines of collaboration, it would also be awesome to see kernel hackers at PGCon (http://pgcon.org) for
furtherdiscussion of this stuff. That is the conference that has more Postgres internal developers than any other.
There'sa variety of different ways collaboration could happen there, so it's probably best to start a separate
discussionwith those from the linux community who'd be interested in attending. PGCon also directly follows BSDCan
(http://bsdcan.org)at the same venue... so we could potentially kill two OS birds with one stone, so to speak... :) If
there'senough interest we could potentially do a "mini Postgres/OS conference" in-between BSDCan and the formal PGCon.
There'salso potential for the Postgres community to sponsor attendance for kernel hackers if money is a factor.
 

Like I said... best to start a separate thread if there's significant interest on meeting at PGCon. :)
-- 
Jim C. Nasby, Data Architect                       
512.569.9461 (cell)                         http://jim.nasby.net



From:
Jim Nasby
Date:

On 1/14/14, 10:08 AM, Tom Lane wrote:
> Trond Myklebust <> writes:
>> On Jan 14, 2014, at 10:39, Tom Lane <> wrote:
>>> "Don't be aggressive" isn't good enough.  The prohibition on early write
>>> has to be absolute, because writing a dirty page before we've done
>>> whatever else we need to do results in a corrupt database.  It has to
>>> be treated like a write barrier.
>
>> Then why are you dirtying the page at all? It makes no sense to tell the kernel “we’re changing this page in the
pagecache, but we don’t want you to change it on disk”: that’s not consistent with the function of a page cache.
 
>
> As things currently stand, we dirty the page in our internal buffers,
> and we don't write it to the kernel until we've written and fsync'd the
> WAL data that needs to get to disk first.  The discussion here is about
> whether we could somehow avoid double-buffering between our internal
> buffers and the kernel page cache.
>
> I personally think there is no chance of using mmap for that; the
> semantics of mmap are pretty much dictated by POSIX and they don't work
> for this.  However, disregarding the fact that the two communities
> speaking here don't control the POSIX spec, you could maybe imagine
> making it work if *both* pending WAL file contents and data file
> contents were mmap'd, and there were kernel APIs allowing us to say
> "you can write this mmap'd page if you want, but not till you've written
> that mmap'd data over there".  That'd provide the necessary
> write-barrier semantics, and avoid the cache coherency question because
> all the data visible to the kernel could be thought of as the "current"
> filesystem contents, it just might not all have reached disk yet; which
> is the behavior of the kernel disk cache already.
>
> I'm dubious that this sketch is implementable with adequate efficiency,
> though, because in a live system the kernel would be forced to deal with
> a whole lot of active barrier restrictions.  Within Postgres we can
> reduce write-ordering tests to a very simple comparison: don't write
> this page until WAL is flushed to disk at least as far as WAL sequence
> number XYZ.  I think any kernel API would have to be a great deal more
> general and thus harder to optimize.

For the sake of completeness... it's theoretically silly that Postgres is doing all this stuff with WAL when the
filesystemis doing something very similar with it's journal. And an SSD drive (and next generation spinning rust) is
doingthe same thing *again* in it's own journal.
 

If all 3 communities (or even just 2 of them!) could agree on the necessary interface a tremendous amount of this
duplicatedtechnology could be eliminated.
 

That said, I rather doubt the Postgres community would go this route, not so much because of the presumably massive
changesneeded, but more because our community is not a fan of restricting our users to things like "Thou shalt use a
journaledFS or risk all thy data!"
 

> Another difficulty with merging our internal buffers with the kernel
> cache is that when we're in the process of applying a change to a page,
> there are intermediate states of the page data that should under no
> circumstances reach disk (eg, we might need to shuffle records around
> within the page).  We can deal with that fairly easily right now by not
> issuing a write() while a page change is in progress.  I don't see that
> it's even theoretically possible in an mmap'd world; there are no atomic
> updates to an mmap'd page that are larger than whatever is an atomic
> update for the CPU.

Yet another problem with trying to combine database and journaled FS efforts... :(
-- 
Jim C. Nasby, Data Architect                       
512.569.9461 (cell)                         http://jim.nasby.net



From:
Jim Nasby
Date:

On 1/14/14, 6:36 PM, Claudio Freire wrote:
> On Tue, Jan 14, 2014 at 9:22 PM, Jim Nasby <> wrote:
>> On 1/14/14, 11:30 AM, Jeff Janes wrote:
>>>
>>> I think the "reclaim this page if you need memory but leave it resident if
>>> there is no memory pressure" hint would be more useful for temporary working
>>> files than for what was being discussed above (shared buffers).  When I do
>>> work that needs large temporary files, I often see physical write IO spike
>>> but physical read IO does not.  I interpret that to mean that the temporary
>>> data is being written to disk to satisfy either dirty_expire_centisecs or
>>> dirty_*bytes, but the data remains in the FS cache and so disk reads are not
>>> needed to satisfy it.  So a hint that says "this file will never be fsynced
>>> so please ignore dirty_*bytes and dirty_expire_centisecs.  I will need it
>>> again relatively soon (but not after a reboot), but will do so mostly
>>> sequentially, so please don't evict this without need, but if you do need to
>>> then it is a good candidate" would be good.
>>
>>
>> I also frequently see this, and it has an even larger impact if pgsql_tmp is
>> on the same filesystem as WAL. Which *theoretically* shouldn't matter with a
>> BBU controller, except that when the kernel suddenly decides your
>> *temporary* data needs to hit the media you're screwed.
>>
>> Though, it also occurs to me... perhaps it would be better for us to simply
>> map temp objects to memory and let the kernel swap them out if needed...
>
>
> Oum... bad idea.
>
> Swap logic has very poor taste for I/O patterns.

Well, to be honest, so do we. Practically zero in fact...

In fact, the kernel might even be in a better position than we are since you can presumably count page faults much more
cheaplythan we can.
 

BTW, if you guys are looking at ARC you should absolutely read discussion about that in our archives
(http://lnk.nu/postgresql.org/2zeu/as a starting point). We put considerable effort into it, had it in two minor
versions,and then switched to a clock-sweep algorithm that's similar to what FreeBSD used, at least in the 4.x days.
Definitelynot claiming what we've got is the best (in fact, I think we're hurt by not maintaining a real free list),
butthe ARC info there is probably valuable.
 
-- 
Jim C. Nasby, Data Architect                       
512.569.9461 (cell)                         http://jim.nasby.net



From:
Dave Chinner
Date:

On Tue, Jan 14, 2014 at 05:38:10PM -0700, Jonathan Corbet wrote:
> On Wed, 15 Jan 2014 09:23:52 +1100
> Dave Chinner <> wrote:
> 
> > It appears to me that we are seeing large memory machines much more
> > commonly in data centers - a couple of years ago 256GB RAM was only
> > seen in supercomputers. Hence machines of this size are moving from
> > "tweaking settings for supercomputers is OK" class to "tweaking
> > settings for enterprise servers is not OK"....
> > 
> > Perhaps what we need to do is deprecate dirty_ratio and
> > dirty_background_ratio as the default values as move to the byte
> > based values as the defaults and cap them appropriately.  e.g.
> > 10/20% of RAM for small machines down to a couple of GB for large
> > machines....
> 
> I had thought that was already in the works...it hits people on far
> smaller systems than those described here.
> 
>     http://lwn.net/Articles/572911/
> 
> I wonder if anybody ever finished this work out for 3.14?

Not that I know of.  This patch was suggested as the solution to the
slow/fast drive issue that started the whole thread:

http://thread.gmane.org/gmane.linux.kernel/1584789/focus=1587059

but I don't see it in a current kernel. It might be in Andrew's tree
for 3.14, but I haven't checked.

However, most of the discussion in that thread about dirty limits
was a side show that rehashed old territory. Rate limiting and
throttling in a generic, scalable manner is a complex problem. We've
got some of the infrastructure we need to solve the problem, but
there was no conclusion as to the correct way to connect all the
dots.  Perhaps it's another topic for the LSFMM conf?

Cheers,

Dave.
-- 
Dave Chinner




From:
Claudio Freire
Date:

On Wed, Jan 15, 2014 at 1:07 AM, Jim Nasby <> wrote:
>>>
>>> Though, it also occurs to me... perhaps it would be better for us to
>>> simply
>>> map temp objects to memory and let the kernel swap them out if needed...
>>
>>
>>
>> Oum... bad idea.
>>
>> Swap logic has very poor taste for I/O patterns.
>
>
> Well, to be honest, so do we. Practically zero in fact...

I've used mmap'd files for years, they're great for sharing mutable
memory across unrelated (as in out-of-heirarchy) processes.

And my experience is, that when swapping to-from disk is expectably a
significant percentage of the workload, explicit I/O of even the
dumbest kind far outperforms swap-based I/O.

I've read the kernel code and I'm not 100% sure of why is that, but I
have a suspect.

My completely unproven theory is that swapping is overwhelmed by
near-misses. Ie: a process touches a page, and before it's actually
swapped in, another process touches it too, blocking on the other
process' read. But the second process doesn't account for that page
when evaluating predictive models (ie: read-ahead), so the next I/O by
process 2 is unexpected to the kernel. Then the same with 1. Etc... In
essence, swap, by a fluke of its implementation, fails utterly to
predict the I/O pattern, and results in far sub-optimal reads.

Explicit I/O is free from that effect, all read calls are accountable,
and that makes a difference.

Maybe, if the kernel could be fixed in that respect, you could
consider mmap'd files as a suitable form of temporary storage. But
that would depend on the success and availability of such a fix/patch.



From:
Heikki Linnakangas
Date:

On 01/15/2014 06:01 AM, Jim Nasby wrote:
> For the sake of completeness... it's theoretically silly that Postgres
> is doing all this stuff with WAL when the filesystem is doing something
> very similar with it's journal. And an SSD drive (and next generation
> spinning rust) is doing the same thing *again* in it's own journal.
>
> If all 3 communities (or even just 2 of them!) could agree on the
> necessary interface a tremendous amount of this duplicated technology
> could be eliminated.
>
> That said, I rather doubt the Postgres community would go this route,
> not so much because of the presumably massive changes needed, but more
> because our community is not a fan of restricting our users to things
> like "Thou shalt use a journaled FS or risk all thy data!"

The WAL is also used for continuous archiving and replication, not just 
crash recovery. We could skip full-page-writes, though, if we knew that 
the underlying filesystem/storage is guaranteeing that a write() is atomic.

It might be useful for PostgreSQL somehow tell the filesystem that we're 
taking care of WAL-logging, so that the filesystem doesn't need to.

- Heikki



From:
Mel Gorman
Date:

(This thread is now massive and I have not read it all yet. If anything
I say has already been discussed then whoops)

On Tue, Jan 14, 2014 at 12:09:46PM +1100, Dave Chinner wrote:
> On Mon, Jan 13, 2014 at 09:29:02PM +0000, Greg Stark wrote:
> > On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund <> wrote:
> > > For one, postgres doesn't use mmap for files (and can't without major
> > > new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> > > horrible consequences for performance/scalability - very quickly you
> > > contend on locks in the kernel.
> > 
> > I may as well dump this in this thread. We've discussed this in person
> > a few times, including at least once with Ted T'so when he visited
> > Dublin last year.
> > 
> > The fundamental conflict is that the kernel understands better the
> > hardware and other software using the same resources, Postgres
> > understands better its own access patterns. We need to either add
> > interfaces so Postgres can teach the kernel what it needs about its
> > access patterns or add interfaces so Postgres can find out what it
> > needs to know about the hardware context.
> 
> In my experience applications don't need to know anything about the
> underlying storage hardware - all they need is for someone to 
> tell them the optimal IO size and alignment to use.
> 

That potentially misses details on efficient IO patterns. They might
submit many small requests for example each of which are of the optimal
IO size and alignment but which is sub-optimal overall. While these
still go through the underlying block layers there is no guarantee that
the requests will arrive in time for efficient merging to occur.

> > The more ambitious and interesting direction is to let Postgres tell
> > the kernel what it needs to know to manage everything. To do that we
> > would need the ability to control when pages are flushed out. This is
> > absolutely necessary to maintain consistency. Postgres would need to
> > be able to mark pages as unflushable until some point in time in the
> > future when the journal is flushed. We discussed various ways that
> > interface could work but it would be tricky to keep it low enough
> > overhead to be workable.
> 
> IMO, the concept of allowing userspace to pin dirty page cache
> pages in memory is just asking for trouble. Apart from the obvious
> memory reclaim and OOM issues, some filesystems won't be able to
> move their journals forward until the data is flushed. i.e. ordered
> mode data writeback on ext3 will have all sorts of deadlock issues
> that result from pinning pages and then issuing fsync() on another
> file which will block waiting for the pinned pages to be flushed.
> 

That applies if the dirty pages are forced to be kept dirty. You call
this pinned but pinned has special meaning so I would suggest calling it
something like dirty-sticky pages. It could be the case that such hinting
will have the pages excluded from dirty background writing but can still
be cleaned if dirty limits are hit or if fsync is called. It's a hint,
not a forced guarantee.

It's still a hand grenade because if this is tracked on a per-page basis
because of what happens if the process crashes? Those pages stay dirty
potentially forever. An alternative would be to track this on a per-inode
instead of per-page basis. The hint would only exist where there is an
open fd for that inode.  Treat it as a privileged call with a sysctl
controlling how many dirty-sticky pages can exist in the system with the
information presented during OOM kills and maybe it starts becoming a bit
more manageable. Dirty-sticky pages are not guaranteed to stay dirty
until userspace action, the kernel just stays away until there are no
other sensible options.

> Indeed, what happens if you do pin_dirty_pages(fd); .... fsync(fd);?
> If fsync() blocks because there are pinned pages, and there's no
> other thread to unpin them, then that code just deadlocked.

Indeed. Forcing pages with this hint to stay dirty until user space decides
to clean them is eventually going to blow up.

> <SNIP>
> Hmmmm.  What happens if the process crashes after pinning the dirty
> pages? How do we even know what process pinned the dirty pages so
> we can clean up after it? What happens if the same page is pinned by
> multiple processes? What happens on truncate/hole punch if the
> partial pages in the range that need to be zeroed and written are
> pinned? What happens if we do direct IO to a range with pinned,
> unflushable pages in the page cache?
> 

Proposal: A process with an open fd can hint that pages managed by thisinode will have dirty-sticky pages. Pages will
beignored bydirty background writing unless there is an fsync call ordirty page limits are hit. The hint is cleared
whenno processhas the file open.
 

If the process crashes, the hint is cleared and the pages get cleaned as
normal

Multiple processes do not matter as such as all of them will have the file
open. There is a problem if the processes disagree on whether the pages
should be dirty sticky or not. The default would be that a sticky-dirty
hint takes priority although it does mean that a potentially unprivileged
process can cause problems. There would be security concerns here that
have to be taken into account.

fsync and truncrate both override the hint. fsync will write the pages,
truncate will discard them.

If there is direct IO on the range then force the sync, invalidate the
page cache, initiate the direct IO as normal.

At least one major downside is that the performance will depend on system
parameters and be non-deterministic, particularly in comparison to direct IO.

> These are all complex corner cases that are introduced by allowing
> applications to pin dirty pages in memory. I've only spent a few
> minutes coming up with these, and I'm sure there's more of them.
> As such, I just don't see that allowing userspace to pin dirty
> page cache pages in memory being a workable solution.
> 

From what I've read so far, I'm not convinced they are looking for a
hard *pin* as such. They want better control over the how and the when
of writeback, not absolute control.  I somewhat sympathise with their
reluctance to use direct IO when the kernel should be able to get them most,
if not all, of the potential performance.

-- 
Mel Gorman
SUSE Labs



From:
Mel Gorman
Date:

On Tue, Jan 14, 2014 at 09:30:19AM -0800, Jeff Janes wrote:
> > > What's not so simple, is figuring out what policy to use. Remember,
> > > you cannot tell the kernel to put some page in its page cache without
> > > reading it or writing it. So, once you make the kernel forget a page,
> > > evicting it from shared buffers becomes quite expensive.
> >
> > posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by
> > forcing readahead.
> 
> 
> But telling the kernel to forget a page, then telling it to read it in
> again from disk because it might be needed again in the near future is
> itself very expensive.  We would need to hand the page to the kernel so it
> has it without needing to go to disk to get it.
> 

Yes, this is the unnecessary IO cost I was thinking of.

> 
> > If you evict it prematurely then you do get kinda
> > screwed because you pay the IO cost to read it back in again even if you
> > had enough memory to cache it. Maybe this is the type of kernel-postgres
> > interaction that is annoying you.
> >
> > If you don't evict, the kernel eventually steps in and evicts the wrong
> > thing. If you do evict and it was unnecessarily you pay an IO cost.
> >
> > That could be something we look at. There are cases buried deep in the
> > VM where pages get shuffled to the end of the LRU and get tagged for
> > reclaim as soon as possible. Maybe you need access to something like
> > that via posix_fadvise to say "reclaim this page if you need memory but
> > leave it resident if there is no memory pressure" or something similar.
> > Not exactly sure what that interface would look like or offhand how it
> > could be reliably implemented.
> >
> 
> I think the "reclaim this page if you need memory but leave it resident if
> there is no memory pressure" hint would be more useful for temporary
> working files than for what was being discussed above (shared buffers).
>  When I do work that needs large temporary files, I often see physical
> write IO spike but physical read IO does not.  I interpret that to mean
> that the temporary data is being written to disk to satisfy either
> dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS
> cache and so disk reads are not needed to satisfy it.  So a hint that says
> "this file will never be fsynced so please ignore dirty_*bytes and
> dirty_expire_centisecs. 

It would be good to know if dirty_expire_centisecs or dirty ratio|bytes
were the problem here. An interface that forces a dirty page to stay dirty
regardless of the global system would be a major hazard. It potentially
allows the creator of the temporary file to stall all other processes
dirtying pages for an unbounded period of time. I proposed in another part
of the thread a hint for open inodes to have the background writer thread
ignore dirty pages belonging to that inode. Dirty limits and fsync would
still be obeyed. It might also be workable for temporary files but the
proposal could be full of holes.

Your alternative here is to create a private anonymous mapping as they
are not subject to dirty limits. This is only a sensible option if the
temporarily data is guaranteeed to be relatively small. If the shared
buffers, page cache and your temporary data exceed the size of RAM then
data will get discarded or your temporary data will get pushed to swap
and performance will hit the floor.

FWIW, the performance of some IO "benchmarks" used to depend on whether they
could create, write and delete files before any of the data actually hit
the disk -- pretty much exactly the type of behaviour you are looking for.

-- 
Mel Gorman
SUSE Labs



From:
Mel Gorman
Date:

On Wed, Jan 15, 2014 at 09:44:21AM +0000, Mel Gorman wrote:
> > <SNIP>
> > Hmmmm.  What happens if the process crashes after pinning the dirty
> > pages? How do we even know what process pinned the dirty pages so
> > we can clean up after it? What happens if the same page is pinned by
> > multiple processes? What happens on truncate/hole punch if the
> > partial pages in the range that need to be zeroed and written are
> > pinned? What happens if we do direct IO to a range with pinned,
> > unflushable pages in the page cache?
> > 
> 
> Proposal: A process with an open fd can hint that pages managed by this
>     inode will have dirty-sticky pages. Pages will be ignored by
>     dirty background writing unless there is an fsync call or
>     dirty page limits are hit. The hint is cleared when no process
>     has the file open.
> 

I'm still processing the rest of the thread and putting it into my head
but it's at least clear that this proposal would only cover the case where
large temporarily files are created that do not necessarily need to be
persisted. They still have cases where the ordering of writes matter and
the kernel cleaning pages behind their back would lead to corruption.

-- 
Mel Gorman
SUSE Labs



From:
Hannu Krosing
Date:

On 01/14/2014 06:12 PM, Robert Haas wrote:
> This would be pretty similar to copy-on-write, except
> without the copying. It would just be
> forget-from-the-buffer-pool-on-write. 

+1

A version of this could probably already be implement using MADV_DONTNEED
and MADV_WILLNEED

Thet is, just after reading the page in, use MADV_DONTNEED on it. When
evicting
a clean page, check that it is still in cache and if it is, then
MADV_WILLNEED it.

Another nice thing to do would be dynamically adjusting kernel
dirty_background_ratio
and other related knobs in real time based on how many buffers are dirty
inside postgresql.
Maybe in background writer.

Question to LKM folks - will kernel react well to frequent changes to
/proc/sys/vm/dirty_*  ?
How frequent can they be (every few second? every second? 100Hz ?)

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




From:
Hannu Krosing
Date:

On 01/15/2014 12:16 PM, Hannu Krosing wrote:
> On 01/14/2014 06:12 PM, Robert Haas wrote:
>> This would be pretty similar to copy-on-write, except
>> without the copying. It would just be
>> forget-from-the-buffer-pool-on-write. 
> +1
>
> A version of this could probably already be implement using MADV_DONTNEED
> and MADV_WILLNEED
>
> Thet is, just after reading the page in, use MADV_DONTNEED on it. When
> evicting
> a clean page, check that it is still in cache and if it is, then
> MADV_WILLNEED it.
>
> Another nice thing to do would be dynamically adjusting kernel
> dirty_background_ratio
> and other related knobs in real time based on how many buffers are dirty
> inside postgresql.
> Maybe in background writer.
>
> Question to LKM folks - will kernel react well to frequent changes to
> /proc/sys/vm/dirty_*  ?
> How frequent can they be (every few second? every second? 100Hz ?)
One obvious use case of this would be changing dirty_background_bytes
linearly to almost zero during a checkpoint to make final fsync fast.

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




From:
Mel Gorman
Date:

On Mon, Jan 13, 2014 at 02:19:56PM -0800, James Bottomley wrote:
> On Mon, 2014-01-13 at 22:12 +0100, Andres Freund wrote:
> > On 2014-01-13 12:34:35 -0800, James Bottomley wrote:
> > > On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
> > > > Well, if we were to collaborate with the kernel community on this then
> > > > presumably we can do better than that for eviction... even to the
> > > > extent of "here's some data from this range in this file. It's (clean|
> > > > dirty). Put it in your cache. Just trust me on this."
> > > 
> > > This should be the madvise() interface (with MADV_WILLNEED and
> > > MADV_DONTNEED) is there something in that interface that is
> > > insufficient?
> > 
> > For one, postgres doesn't use mmap for files (and can't without major
> > new interfaces).
> 
> I understand, that's why you get double buffering: because we can't
> replace a page in the range you give us on read/write.  However, you
> don't have to switch entirely to mmap: you can use mmap/madvise
> exclusively for cache control and still use read/write (and still pay
> the double buffer penalty, of course).  It's only read/write with
> directio that would cause problems here (unless you're planning to
> switch to DIO?).
> 

There are hazards with using mmap/madvise that may or may not be a problem
for them. I think these are well known but just in case;

mmap/munmap intensive workloads may get hammered on taking mmap_sem for
write. The greatest costs are incurred if the application is threaded
if the parallel threads are fault-intensive. I do not think this is the
case for PostgreSQL as it is process based but it is a concern. Even it's
a single-threaded process, the cost of the mmap_sem cache line bouncing
can be a concern. Outside of that, the mmap/munmap paths are just really
costly and take a lot of work.

madvise has different hazards but lets take DONTNEED as an example because
it's the most likely candidate for use. A DONTNEED hint has three potential
downsides. The first is that mmap_sem taken for read can be very costly
for threaded applications as the cache line bounces. On NUMA machines it
can be a major problem for madvise-intensive workloads. The second is that
the page table teardown frees the pages with the associated costs but most
importantly, an IPI is required afterwards to flush the TLB. If that process
has been running on a lot of different CPUs then the IPI cost can be very
high. The third hazard is that a madvise(DONTNEED) region will incur page
faults on the next accesses again hammering into mmap_sem and all the faults
associated with faulting (allocating the same pages again, zeroing etc)

It may be the case that mmap/madvise is still required to handle a double
buffering problem but it's far from being a free lunch and it has costs
that read/write does not have to deal with. Maybe some of these problems
can be fixed or mitigated but it is a case where a test case demonstrates
the problem even if that requires patching PostgreSQL.

-- 
Mel Gorman
SUSE Labs



From:
Jan Kara
Date:

On Wed 15-01-14 10:27:26, Heikki Linnakangas wrote:
> On 01/15/2014 06:01 AM, Jim Nasby wrote:
> >For the sake of completeness... it's theoretically silly that Postgres
> >is doing all this stuff with WAL when the filesystem is doing something
> >very similar with it's journal. And an SSD drive (and next generation
> >spinning rust) is doing the same thing *again* in it's own journal.
> >
> >If all 3 communities (or even just 2 of them!) could agree on the
> >necessary interface a tremendous amount of this duplicated technology
> >could be eliminated.
> >
> >That said, I rather doubt the Postgres community would go this route,
> >not so much because of the presumably massive changes needed, but more
> >because our community is not a fan of restricting our users to things
> >like "Thou shalt use a journaled FS or risk all thy data!"
> 
> The WAL is also used for continuous archiving and replication, not
> just crash recovery. We could skip full-page-writes, though, if we
> knew that the underlying filesystem/storage is guaranteeing that a
> write() is atomic.
> 
> It might be useful for PostgreSQL somehow tell the filesystem that
> we're taking care of WAL-logging, so that the filesystem doesn't
> need to. Well, journalling fs generally cares about its metadata consistency. We
have much weaker guarantees regarding file data because those guarantees
come at a cost most people don't want to pay.

Filesystems could in theory provide facility like atomic write (at least up
to a certain size say in MB range) but it's not so easy and when there are
no strong usecases fs people are reluctant to make their code more complex
unnecessarily. OTOH without widespread atomic write support I understand
application developers have similar stance. So it's kind of chicken and egg
problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
due to its data=journal mode so if someone on the PostgreSQL side wanted to
research on this, knitting some experimental ext4 patches should be doable.
                            Honza
-- 
Jan Kara <>
SUSE Labs, CR



From:
Dave Chinner
Date:

On Tue, Jan 14, 2014 at 09:54:20PM -0600, Jim Nasby wrote:
> On 1/14/14, 3:41 PM, Dave Chinner wrote:
> >On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote:
> >>On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman <>
> >>wrote: Whether the problem is with the system call or the
> >>programmer is harder to determine.  I think the problem is in
> >>part that it's not exactly clear when we should call it.  So
> >>suppose we want to do a checkpoint.  What we used to do a long
> >>time ago is write everything, and then fsync it all, and then
> >>call it good.  But that produced horrible I/O storms.  So what
> >>we do now is do the writes over a period of time, with sleeps in
> >>between, and then fsync it all at the end, hoping that the
> >>kernel will write some of it before the fsyncs arrive so that we
> >>don't get a huge I/O spike.  And that sorta works, and it's
> >>definitely better than doing it all at full speed, but it's
> >>pretty imprecise.  If the kernel doesn't write enough of the
> >>data out in advance, then there's still a huge I/O storm when we
> >>do the fsyncs and everything grinds to a halt.  If it writes out
> >>more data than needed in advance, it increases the total number
> >>of physical writes because we get less write-combining, and that
> >>hurts performance, too.
> 
> I think there's a pretty important bit that Robert didn't mention:
> we have a specific *time* target for when we want all the fsync's
> to complete. People that have problems here tend to tune
> checkpoints to complete every 5-15 minutes, and they want the
> write traffic for the checkpoint spread out over 90% of that time
> interval. To put it another way, fsync's should be done when 90%
> of the time to the next checkpoint hits, but preferably not a lot
> before then.

I think that is pretty much understood. I don't recall anyone
mentioning a typical checkpoint period, though, so knowing the
typical timeframe of IO storms and how much data is typically
written in a checkpoint helps us understand the scale of the
problem.

> >It sounds to me like you want the kernel to start background
> >writeback earlier so that it doesn't build up as much dirty data
> >before you require a flush. There are several ways to do this by
> >tweaking writeback knobs. The simplest is probably just to set
> >/proc/sys/vm/dirty_background_bytes to an appropriate threshold
> >(say 50MB) and dirty_expire_centiseconds to a few seconds so that
> >background writeback starts and walks all dirty inodes almost
> >immediately. This will keep a steady stream of low level
> >background IO going, and fsync should then not take very long.
> 
> Except that still won't throttle writes, right? That's the big
> issue here: our users often can't tolerate big spikes in IO
> latency. They want user requests to always happen within a
> specific amount of time.

Right, but that's a different problem and one that io scheduling
tweaks can have a major effect on. e.g. the deadline scheduler
should be able to provide a maximum upper bound on read IO latency
even while writes are in progress, though how successful it is is
dependent on the nature of the write load and the architecture of
the underlying storage.

However, the first problem is dealing with the IO storm problem on
fsync. Then we can measure the effect of spreading those writes out
in time and determine what triggers read starvations (if they are
apparent). The we can look at whether IO scheduling tweaks or
whether blk-io throttling solves those problems. Or whether
something else needs to be done to make it work in environments
where problems are manifesting.

FWIW [and I know you're probably sick of hearing this by now], but
the blk-io throttling works almost perfectly with applications that
use direct IO.....

> So while delaying writes potentially reduces the total amount of
> data you're writing, users that run into problems here ultimately
> care more about ensuring that their foreground IO completes in a
> timely fashion.

Understood. Applications that crunch randomly through large data
sets are almost always read IO latency bound....

> >Fundamentally, though, we need bug reports from people seeing
> >these problems when they see them so we can diagnose them on
> >their systems. Trying to discuss/diagnose these problems without
> >knowing anything about the storage, the kernel version, writeback
> >thresholds, etc really doesn't work because we can't easily
> >determine a root cause.
> 
> So is  the best way to accomplish that?

No. That is just the list for organising the LFSMM summit. ;)

For general pagecache and writeback issues, discussions, etc,
 is the list to use. LKML simple has
too much noise to be useful these days, so I'd avoid it. Otherwise
the filesystem specific lists are are good place to get help for
specific problems (e.g.  and
). We tend to cross-post to other relevant lists as
triage moves into different areas of the storage stack.

> Also, along the lines of collaboration, it would also be awesome
> to see kernel hackers at PGCon (http://pgcon.org) for further
> discussion of this stuff.

True, but I don't think I'll be one of those hackers as Ottawa is
(roughly) a 30 hour commute from where I live and I try to limit the
number of them I do every year....

Cheers,

Dave.
-- 
Dave Chinner




From:
Jan Kara
Date:

On Wed 15-01-14 12:16:50, Hannu Krosing wrote:
> On 01/14/2014 06:12 PM, Robert Haas wrote:
> > This would be pretty similar to copy-on-write, except
> > without the copying. It would just be
> > forget-from-the-buffer-pool-on-write. 
> 
> +1
> 
> A version of this could probably already be implement using MADV_DONTNEED
> and MADV_WILLNEED
> 
> Thet is, just after reading the page in, use MADV_DONTNEED on it. When
> evicting
> a clean page, check that it is still in cache and if it is, then
> MADV_WILLNEED it.
> 
> Another nice thing to do would be dynamically adjusting kernel
> dirty_background_ratio
> and other related knobs in real time based on how many buffers are dirty
> inside postgresql.
> Maybe in background writer.
> 
> Question to LKM folks - will kernel react well to frequent changes to
> /proc/sys/vm/dirty_*  ?
> How frequent can they be (every few second? every second? 100Hz ?) So the question is what do you mean by 'react'. We
checkwhether we
 
should start background writeback every dirty_writeback_centisecs (5s). We
will also check whether we didn't exceed the background dirty limit (and
wake writeback thread) when dirtying pages. However this check happens once
per several dirtied MB (unless we are close to dirty_bytes).

When writeback is running we check roughly once per second (the logic is
more complex there but I don't think explaining details would be useful
here) whether we are below dirty_background_bytes and stop writeback in
that case.

So changing dirty_background_bytes every few seconds should work
reasonably, once a second is pushing it and 100 Hz - no way. But I'd also
note that you have conflicting requirements on the kernel writeback. On one
hand you want checkpoint data to steadily trickle to disk (well, trickle
isn't exactly the proper word since if you need to checkpoing 16 GB every 5
minutes than you need a steady throughput of ~50 MB/s just for
checkpointing) so you want to set dirty_background_bytes low, on the other
hand you don't want temporary files to get to disk so you want to set
dirty_background_bytes high. And also that changes of
dirty_background_bytes probably will not take into account other events
happening on the system (maybe a DB backup is running...). So I'm somewhat
skeptical you will be able to tune dirty_background_bytes frequently in a
useful way.
                            Honza
-- 
Jan Kara <>
SUSE Labs, CR



From:
Hannu Krosing
Date:

On 01/15/2014 02:01 PM, Jan Kara wrote:
> On Wed 15-01-14 12:16:50, Hannu Krosing wrote:
>> On 01/14/2014 06:12 PM, Robert Haas wrote:
>>> This would be pretty similar to copy-on-write, except
>>> without the copying. It would just be
>>> forget-from-the-buffer-pool-on-write. 
>> +1
>>
>> A version of this could probably already be implement using MADV_DONTNEED
>> and MADV_WILLNEED
>>
>> Thet is, just after reading the page in, use MADV_DONTNEED on it. When
>> evicting
>> a clean page, check that it is still in cache and if it is, then
>> MADV_WILLNEED it.
>>
>> Another nice thing to do would be dynamically adjusting kernel
>> dirty_background_ratio
>> and other related knobs in real time based on how many buffers are dirty
>> inside postgresql.
>> Maybe in background writer.
>>
>> Question to LKM folks - will kernel react well to frequent changes to
>> /proc/sys/vm/dirty_*  ?
>> How frequent can they be (every few second? every second? 100Hz ?)
>   So the question is what do you mean by 'react'. We check whether we
> should start background writeback every dirty_writeback_centisecs (5s). We
> will also check whether we didn't exceed the background dirty limit (and
> wake writeback thread) when dirtying pages. However this check happens once
> per several dirtied MB (unless we are close to dirty_bytes).
>
> When writeback is running we check roughly once per second (the logic is
> more complex there but I don't think explaining details would be useful
> here) whether we are below dirty_background_bytes and stop writeback in
> that case.
>
> So changing dirty_background_bytes every few seconds should work
> reasonably, once a second is pushing it and 100 Hz - no way. But I'd also
> note that you have conflicting requirements on the kernel writeback. On one
> hand you want checkpoint data to steadily trickle to disk (well, trickle
> isn't exactly the proper word since if you need to checkpoing 16 GB every 5
> minutes than you need a steady throughput of ~50 MB/s just for
> checkpointing) so you want to set dirty_background_bytes low, on the other
> hand you don't want temporary files to get to disk so you want to set
> dirty_background_bytes high. 
Is it possible to have more fine-grained control over writeback, like
configuring dirty_background_bytes per file system / device (or even
a file or a group of files) ?

If not, then how hard would it be to provide this ?

This is a bit backwards from keeping-the-cache-clean perspective,
but would help a lot with hinting the writer that a big sync is coming.

> And also that changes of
> dirty_background_bytes probably will not take into account other events
> happening on the system (maybe a DB backup is running...). So I'm somewhat
> skeptical you will be able to tune dirty_background_bytes frequently in a
> useful way.
>


Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




From:
Robert Haas
Date:

On Tue, Jan 14, 2014 at 4:23 PM, James Bottomley
<> wrote:
> Yes, that's what I was thinking: it's a cache.  About how many files
> comprise this cache?  Are you thinking it's too difficult for every
> process to map the files?

No, I'm thinking that would throw cache coherency out the window.
Separate mappings are all well and good until somebody decides to
modify the page, but after that point the database processes need to
see the modified version of the page (which is, further, hedged about
with locks) yet the operating system MUST NOT see the modified version
of the page until the write-ahead log entry for the page modification
has been flushed to disk.  There's really no way to do that without
having our own private cache.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Robert Haas
Date:

On Tue, Jan 14, 2014 at 5:23 PM, Dave Chinner <> wrote:
> By default, background writeback doesn't start until 10% of memory
> is dirtied, and on your machine that's 25GB of RAM. That's way to
> high for your workload.
>
> It appears to me that we are seeing large memory machines much more
> commonly in data centers - a couple of years ago 256GB RAM was only
> seen in supercomputers. Hence machines of this size are moving from
> "tweaking settings for supercomputers is OK" class to "tweaking
> settings for enterprise servers is not OK"....
>
> Perhaps what we need to do is deprecate dirty_ratio and
> dirty_background_ratio as the default values as move to the byte
> based values as the defaults and cap them appropriately.  e.g.
> 10/20% of RAM for small machines down to a couple of GB for large
> machines....

I think that's right.  In our case we know we're going to call fsync()
eventually and that's going to produce a torrent of I/O.  If that
torrent fits in downstream caches or can be satisfied quickly without
disrupting the rest of the system too much, then life is good.  But
the downstream caches don't typically grow proportionately to the size
of system memory.  Maybe a machine with 16GB has 1GB of battery-backed
write cache, but it doesn't follow that 256GB machine has 16GB of
battery-backed write cache.

> Essentially, changing dirty_background_bytes, dirty_bytes and
> dirty_expire_centiseconds to be much smaller should make the kernel
> start writeback much sooner and so you shouldn't have to limit the
> amount of buffers the application has to prevent major fsync
> triggered stalls...

I think this has been tried with some success, but I don't know the
details.  I think the bytes values are clearly more useful than the
percentages, because you can set them smaller and with better
granularity.

One thought that occurs to me is that it might be useful to have
PostgreSQL tell the system when we expect to perform an fsync.
Imagine fsync_is_coming(int fd, time_t).  We know long in advance
(minutes) when we're gonna do it, so in some sense what we'd like to
tell the kernel is: we're not in a hurry to get this data on disk
right now, but when the indicated time arrives, we are going to do
fsyncs of a bunch of files in rapid succession, so please arrange to
flush the data as close to that time as possible (to maximize
write-combining) while still finishing by that time (so that the
fsyncs are fast and more importantly so that they don't cause a
system-wide stall).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Mel Gorman
Date:

> One assumption would be that Postgres is perfectly happy with the current
> kernel behaviour in which case our discussion here is done.

It has been demonstrated that this statement was farcical.  The thread is
massive just from interaction with the LSF/MM program committee.  I'm hoping
that there will be Postgres representation at LSF/MM this year to bring
the issues to a wider audience. I expect that LSF/MM can only commit to
one person attending the whole summit due to limited seats but we could
be more more flexible for the Postgres track itself so informal meetings
can be arranged for the evenings and at collab summit.

In this gets forgotten, this mail describes what has already been
discussed and some of the proposals. Some stuff I do not describe because
it was superseded by later discussion. If I missed something important,
misinterpreted or simply screwed up then shout and I'll update this. I'd
rather none of this gets lost even if it takes months or years to address
it all.

On testing of modern kernels
----------------------------

Josh Berkus claims that most people are using Postgres with 2.6.19 and
consequently there may be poor awareness of recent kernel developments.
This is a disturbingly large window of opportunity for problems to have
been introduced. It begs the question what sort of penetration modern
distributions shipping Postgres has. More information on why older kernels
dominate in Postgres installation would be nice.

Postgres bug reports and LKML
-------------------------------

It is claimed that LKML does not welcome bug reports but it's less clear
what the basis of this claim is.  Is it because the reports are ignored? A
possible explanation is that they are simply getting lost in the LKML noise
and there would be better luck if the bug report was cc'd to a specific
subsystem list. Another explanation is that there is not enough data
available to debug the problem. The worst explanation is that to date
the problem has not been fixable but the details of this have been lost
and are now unknown. Is is possible that some of these bug reports can be
refreshed so at least there is a chance they get addressed?

Apparently there were changes to the reclaim algorithms that crippled
performance without any sysctls. The problem may be compounded by the
introduction of adaptive replacement cache in the shape of the thrash
detection patches currently being reviewed.  Postgres investigated the
use of ARC in the past and ultimately abandoned it. Details are in the
archives (http://www.Postgres.org/search/?m=1&q=arc&l=1&d=-1&s=r). I
have not read then, just noting they exist for future reference.

Sysctls to control VM behaviour are not popular as such tuning parameters
are often used as an excuse to not properly fix the problem. Would it be
possible to describe a test case that shows 2.6.19 performing well and a
modern kernel failing? That would give the VM people a concrete basis to
work from to either fix the problem or identify exactly what sysctls are
required to make this work.

I am confident that any bug related to VM reclaim in this area has been lost.
At least, I recall no instances of it being discussed on linux-mm and it
has not featured on LSF/MM during the last years.

IO Scheduling
-------------

Kevin Grittner has stated that it is known that the DEADLINE and NOOP
schedulers perform better than any alternatives for most database loads.
It would be desirable to quantify this for some test case and see can the
default scheduler cope in some way.

The deadline scheduler makes sense to a large extent though. Postgres
is sensitive to large latencies due to IO write spikes. It is at least
plausible that deadline would give more deterministic behaviour for
parallel reads in the presence of large writes assuming there were not
ordering problems between the reads/writes and the underlying filesystem.

For reference, these IO spikes can be massive. If the shared buffer is
completely dirtied in a short space of time then it could be 20-25% of
RAM being dirtied and writeback required in typical configurations. There
have been cases where it was worked around by limiting the size of the
shared buffer to a small enough size so that it can be written back
quickly. There are other tuning options available such as altering when
dirty background writing starts within the kernel but that will not help if
the dirtying happens in a very short space of time. Dave Chinner described
the considerations as follows
There's no absolute rule here, but the threshold for backgroundwriteback needs to consider the amount of dirty data
beinggenerated,the rate at which it can be retired and the checkpoint period theapplication is configured with. i.e. it
needsto be slow enough tonot cause serious read IO perturbations, but still fast enough thatit avoids peaks at
synchronisationpoints. And most importantly, itneeds to be fast enought that it can complete writeback of all thedirty
datain a checkpoint before the next checkpoint is triggered.
 
In general, I find that threshold to be somewhere around 2-5sworth of data writeback - enough to keep a good amount of
writecombiningand the IO pipeline full as work is done, but no more.
 
e.g. if your workload results in writeback rates of 500MB/s,then I'd be setting the dirty limit somewhere around 1-2GB
asaninitial guess. It's basically a simple trade off bufferingspace for writeback latency. Some applications perform
wellwithincreased buffering space (e.g. 10-20s of writeback) while othersperform better with extremely low writeback
latency(e.g. 0.5-1s).
 

Some of this may have been addressed in recent changes with IO-less dirty
throttling. When considering stalls related to excessive IO it will be
important to check if the kernel was later than 3.2 and what the underlying
filesystem was.

Again, it really should be possible to demonstrate this with a test case,
one driven by pgbench maybe? Workload would generate a bunch of test data,
dirty a large percentage of it and try to sync. Metrics would be measuring
average read-only query latency when reading in parallel to the write,
average latencies from the underlying storage, IO queue lengths etc and
comparing default IO scheduler with deadline or noop.

NUMA Optimisations
------------------

The primary one that showed up was zone_reclaim_mode. Enabling that parameter
is a disaster for many workloads and apparently Postgres is one. It might
be time to revisit leaving that thing disabled by default and explicitly
requiring that NUMA-aware workloads that are correctly partitioned enable it.
Otherwise NUMA considerations are not that much of a concern right now.

Direct IO, buffered IO and double buffering
-------------------------------------------

The general position of Postgres is that the kernel knows more about
storage geometries and IO scheduling that an application can or should
know. It would be preferred to have interfaces that allow Postgres.
give hints to the kernel about how and when data should be written back.
The alternative is exposing details of the underlying storage to userspace
so Postgres can implement a full IO scheduler using direct IO. It has
been asserted on the kernel side that the optimal IO size and alignment
is the most important detail should be all the details that are required
in the majority of cases. While some database vendors have this option,
the Postgres community do not have the resources to implement something
of this magnitude.

I can understand Postgres preference for using the kernel to handle these
details for them. They are a cross-platform application and the kernel
should not be washing its hands of the problem and hiding behind direct
IO as a solution. Ted Ts'o summarises the issues as
The high order bit is what's the right thing to do when databaseprogrammers come to kernel engineers saying, we want to
do<FOO>and the performance sucks.  Do we say, "Use O_DIRECT, dummy", notwithstanding Linus's past comments on the
issue? Or do we havesome general design principles that we tell database engineers thatthey should do for better
performance,and then all developers forall of the file systems can then try to optimize for a set of newAPI's, or
recommendedways of using the existing API's?
 

In an effort to avoid depending on direct IO there are some proposals
and/or wishlist items
  1. Reclaim pages only under reclaim pressure but then prioritise their     reclaim. This avoids a problem where
fadvise(DONTNEED)discards a     page only to have a read/write or WILLNEED hint immediately read     it back in again.
Therequirements are similar to the volatile     range hinting but they do not use mmap() currently and would need     a
file-descriptorbased interface. Robert Hass had some concerns     with the general concept and described them thusly
 
This is an interesting idea but it stinks of impracticality.Essentially when the last buffer pin on a page is dropped
we'dhaveto mark it as discardable, and then the next person wantingto pin it would have to check whether it's still
there. But thesystem call overhead of calling vrange() every time the last pinon a page was dropped would probably hose
us.
Well, I guess it could be done lazily: make periodic sweeps throughshared_buffers, looking for pages that haven't been
touchedin awhile, and vrange() them.  That's quite a bit of new mechanism,but in theory it could work out to a win.
vrange()would haveto scale well to millions of separate ranges, though.  Will it?And a lot depends on whether the
kernelmakes the right decisionabout whether to chunk data from our vrange() vs. any other pageit could have reclaimed.
 
  2. Only writeback some pages if explicitly synced or dirty limits     are violated. Jeff Janes states that he has
problemswith large     temporary files that generate IO spikes when the data starts hitting     the platter even though
thedata does not need to be preserved. Jim     Nasby agreed and commented that he "also frequently see this, and it
hasan even larger impact if pgsql_tmp is on the same filesystem as     WAL. Which *theoretically* shouldn't matter with
aBBU controller,     except that when the kernel suddenly +decides your *temporary*     data needs to hit the media
you'rescrewed."
 
     One proposal that may address this is
Allow a process with an open fd to hint that pages managed by thisinode will have dirty-sticky pages. Pages will be
ignoredby dirtybackground writing unless there is an fsync call or dirty page limitsare hit. The hint is cleared when
noprocess has the file open.
 
  3. Only writeback pages if explicitly synced. Postgres has strict write     ordering requirements. In the words of
TomLane -- "As things currently     stand, we dirty the page in our internal buffers, and we don't write     it to the
kerneluntil we've written and fsync'd the WAL data that     needs to get to disk first". mmap() would avoid double
bufferingbut     it has no control about the write ordering which is a show-stopper.     As Andres Freund described;
 
Postgres' durability works by guaranteeing that our journalentries (called WAL := Write Ahead Log) are written & synced
todiskbefore the corresponding entries of tables and indexes reachthe disk. That also allows to group together many
random-writesintoa few contiguous writes fdatasync()ed at once. Only duringa checkpointing phase the big bulk of the
datais then (slowly,in the background) synced to disk. I don't see how that's doablewith holding all pages in mmap()ed
buffers.
     There are also concerns there would be an absurd number of mappings.
     The problem with this sort of dirty pinning interface is that it     can deadlock the kernel if all dirty pages in
thesystem cannot be     written back by the kernel. James Bottomley stated
 
No, I'm sorry, that's never going to be possible.  No user spaceapplication has all the facts.    If we give you an
interfacetoforce unconditional holding of dirty pages in core you'll livelockthe system eventually because you made a
wrongdecision to holdtoo many dirty pages.
 
     However, it was very clearly stated that the writing ordering is     critical. If the kernel breaks the
requirementthen the database     can get trashed in the event of a power failure.
 
     This led to a discussion on write barriers which the kernel uses     internally but there are scaling concerns
bothwith the number of     constraints that would exist and the requirement that Postgres use     mapped buffers.
 
     I did not bring it up on the list but one possibility is that the     kernel would allow a limited number of
pinneddirty pages. If a     process tries to dirty more pages without cleaning some of them     we could either block
itor fail the write. The number of dirty     pages would be controlled by limits and we'd require that the limit     be
lowerthan dirty_ratio|bytes or be at most 50% of that value.     There are unclear semantics about what happens if the
processcrashes.
 
  4. Allow userspace process to insert data into the kernel page cache     without marking the page dirty. This would
allowthe application     to request that the OS use the application copy of data as page     cache if it does not have
acopy already. The difficulty here     is that the application has no way of knowing if something else     has altered
theunderlying file in the meantime via something like     direct IO. Granted, such activity has probably corrupted the
database    already but initial reactions are that this is not a safe interface     and there are coherency concerns.
 
     Dave Chinner asked "why, exactly, do you even need the kernel page     cache here?"  when Postgres already knows
howand when data should     be written back to disk. The answer boiled down to "To let kernel do     the job that it is
goodat, namely managing the write-back of dirty     buffers to disk and to manage (possible) read-ahead pages".
Postgres    has some ordering requirements but it does not want to be responsible     for all cache replacement and IO
scheduling.Hannu Krosing summarised     it best as
 
Again, as said above the linux file system is doing fine. What wewant is a few ways to interact with it to let it do
evenbetterwhen working with Postgres by telling it some stuff it otherwisewould have to second guess and by sometimes
givingit back somecache pages which were copied away for potential modifying butended up clean in the end.
 
And let the linux kernel decide if and how long to keep these pagesin its    cache using its superior knowledge of disk
subsystemandabout what else is going on in the system in general.
 
  5. Allow copy-on-write of page-cache pages to anonymous. This would limit     the double ram usage to some extent.
It'snot as simple as having a     MAP_PRIVATE mapping of a file-backed page because presumably they want     this data
ina shared buffer shared between Postgres processes. The     implementation details of something like this are hairy
becauseit's     mmap()-like but not mmap() as it does not have the same writeback     semantics due to the write
orderingrequireqments Postgres has for     database integrity.
 
     Completely nuts and this was not mentioned on the list, but arguably     you could try implementing something like
thisas a character device     that allows MAP_SHARED with ioctls with ioctls controlling that file     and offset backs
pageswithin the mapping.  A new mapping would be     forced resident and read-only. A write would COW the page. It's a
  crazy way of doing something like this but avoids a lot of overhead.     Even considering the stupid solution might
makethe general solution     a bit more obvious.
 
     For reference, Tom Lane comprehensively     described the problems with mmap at
http://www.Postgres.org/message-id/
     There were some variants of how something like this could be achieved     but no finalised proposal at the time of
writing.

Not all of these suggestions are viable but some are more viable than
others. Ultimately we would still need a test case showing the benefit
even if that depends on a Postgres patch taking advantage of a new
feature.

-- 
Mel Gorman
SUSE Labs



From:
Heikki Linnakangas
Date:

On 01/15/2014 07:50 AM, Dave Chinner wrote:
> However, the first problem is dealing with the IO storm problem on
> fsync. Then we can measure the effect of spreading those writes out
> in time and determine what triggers read starvations (if they are
> apparent). The we can look at whether IO scheduling tweaks or
> whether blk-io throttling solves those problems. Or whether
> something else needs to be done to make it work in environments
> where problems are manifesting.
>
> FWIW [and I know you're probably sick of hearing this by now], but
> the blk-io throttling works almost perfectly with applications that
> use direct IO.....

For checkpoint writes, direct I/O actually would be reasonable. 
Bypassing the OS cache is a good thing in that case - we don't want the 
written pages to evict other pages from the OS cache, as we already have 
them in the PostgreSQL buffer cache.

Writing one page at a time with O_DIRECT from a single process might be 
quite slow, so we'd probably need to use writev() or asynchronous I/O to 
work around that.

We'd still need to issue an fsync() to flush any already-written pages 
from the OS cache to disk, though.

- Heikki



From:
Robert Haas
Date:

On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara <> wrote:
> Filesystems could in theory provide facility like atomic write (at least up
> to a certain size say in MB range) but it's not so easy and when there are
> no strong usecases fs people are reluctant to make their code more complex
> unnecessarily. OTOH without widespread atomic write support I understand
> application developers have similar stance. So it's kind of chicken and egg
> problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
> due to its data=journal mode so if someone on the PostgreSQL side wanted to
> research on this, knitting some experimental ext4 patches should be doable.

Atomic 8kB writes would improve performance for us quite a lot.  Full
page writes to WAL are very expensive.  I don't remember what
percentage of write-ahead log traffic that accounts for, but it's not
small.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Tom Lane
Date:

Heikki Linnakangas <> writes:
> On 01/15/2014 07:50 AM, Dave Chinner wrote:
>> FWIW [and I know you're probably sick of hearing this by now], but
>> the blk-io throttling works almost perfectly with applications that
>> use direct IO.....

> For checkpoint writes, direct I/O actually would be reasonable. 
> Bypassing the OS cache is a good thing in that case - we don't want the 
> written pages to evict other pages from the OS cache, as we already have 
> them in the PostgreSQL buffer cache.

But in exchange for that, we'd have to deal with selecting an order to
write pages that's appropriate depending on the filesystem layout,
other things happening in the system, etc etc.  We don't want to build
an I/O scheduler, IMO, but we'd have to.

> Writing one page at a time with O_DIRECT from a single process might be 
> quite slow, so we'd probably need to use writev() or asynchronous I/O to 
> work around that.

Yeah, and if the system has multiple spindles, we'd need to be issuing
multiple O_DIRECT writes concurrently, no?

What we'd really like for checkpointing is to hand the kernel a boatload
(several GB) of dirty pages and say "how about you push all this to disk
over the next few minutes, in whatever way seems optimal given the storage
hardware and system situation.  Let us know when you're done."  Right now,
because there's no way to negotiate such behavior, we're reduced to having
to dribble out the pages (in what's very likely a non-optimal order) and
hope that the kernel is neither too lazy nor too aggressive about cleaning
dirty pages in its caches.
        regards, tom lane



From:
Robert Haas
Date:

On Wed, Jan 15, 2014 at 4:44 AM, Mel Gorman <> wrote:
> That applies if the dirty pages are forced to be kept dirty. You call
> this pinned but pinned has special meaning so I would suggest calling it
> something like dirty-sticky pages. It could be the case that such hinting
> will have the pages excluded from dirty background writing but can still
> be cleaned if dirty limits are hit or if fsync is called. It's a hint,
> not a forced guarantee.
>
> It's still a hand grenade because if this is tracked on a per-page basis
> because of what happens if the process crashes? Those pages stay dirty
> potentially forever. An alternative would be to track this on a per-inode
> instead of per-page basis. The hint would only exist where there is an
> open fd for that inode.  Treat it as a privileged call with a sysctl
> controlling how many dirty-sticky pages can exist in the system with the
> information presented during OOM kills and maybe it starts becoming a bit
> more manageable. Dirty-sticky pages are not guaranteed to stay dirty
> until userspace action, the kernel just stays away until there are no
> other sensible options.

I think this discussion is vividly illustrating why this whole line of
inquiry is a pile of fail.  If all the processes that have the file
open crash, the changes have to be *thrown away* not written to disk
whenever the kernel likes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Mel Gorman
Date:

On Wed, Jan 15, 2014 at 10:16:27AM -0500, Robert Haas wrote:
> On Wed, Jan 15, 2014 at 4:44 AM, Mel Gorman <> wrote:
> > That applies if the dirty pages are forced to be kept dirty. You call
> > this pinned but pinned has special meaning so I would suggest calling it
> > something like dirty-sticky pages. It could be the case that such hinting
> > will have the pages excluded from dirty background writing but can still
> > be cleaned if dirty limits are hit or if fsync is called. It's a hint,
> > not a forced guarantee.
> >
> > It's still a hand grenade because if this is tracked on a per-page basis
> > because of what happens if the process crashes? Those pages stay dirty
> > potentially forever. An alternative would be to track this on a per-inode
> > instead of per-page basis. The hint would only exist where there is an
> > open fd for that inode.  Treat it as a privileged call with a sysctl
> > controlling how many dirty-sticky pages can exist in the system with the
> > information presented during OOM kills and maybe it starts becoming a bit
> > more manageable. Dirty-sticky pages are not guaranteed to stay dirty
> > until userspace action, the kernel just stays away until there are no
> > other sensible options.
> 
> I think this discussion is vividly illustrating why this whole line of
> inquiry is a pile of fail.  If all the processes that have the file
> open crash, the changes have to be *thrown away* not written to disk
> whenever the kernel likes.
> 

I realise that now and sorry for the noise.

I later read the parts of the thread that covered the strict ordering
requirements and in a summary mail I split the requirements in two. In one,
there are dirty sticky pages that the kernel should not writeback unless
it has no other option or fsync is called. This may be suitable for large
temporary files that Postgres does not necessarily want to hit the platter
but also does not have strict ordering requirements for. The second is have
pages that are strictly kept dirty until the application syncs them. An
unbounded number of these pages would blow up but maybe bounds could be
placed on it. There are no solid conclusions on that part yet.

-- 
Mel Gorman
SUSE Labs



From:
Robert Haas
Date:

On Wed, Jan 15, 2014 at 10:53 AM, Mel Gorman <> wrote:
> I realise that now and sorry for the noise.
>
> I later read the parts of the thread that covered the strict ordering
> requirements and in a summary mail I split the requirements in two. In one,
> there are dirty sticky pages that the kernel should not writeback unless
> it has no other option or fsync is called. This may be suitable for large
> temporary files that Postgres does not necessarily want to hit the platter
> but also does not have strict ordering requirements for. The second is have
> pages that are strictly kept dirty until the application syncs them. An
> unbounded number of these pages would blow up but maybe bounds could be
> placed on it. There are no solid conclusions on that part yet.

I think that the bottom line is that we're not likely to make massive
changes to the way that we do block caching now.  Even if some other
scheme could work much better on Linux (and so far I'm unconvinced
that any of the proposals made here would in fact work much better),
we aim to be portable to Windows as well as other UNIX-like systems
(BSD, Solaris, etc.).  So using completely Linux-specific technology
in an overhaul of our block cache seems to me to have no future.

On the other hand, giving the kernel hints about what we're doing that
would enable it to be smarter seems to me to have a lot of potential.
Ideas so far mentioned include:

- Hint that we're going to do an fsync on file X at time Y, so that
the kernel can schedule the write-out to complete right around that
time.
- Hint that a block is a good candidate for reclaim without actually
purging it if there's no memory pressure.
- Hint that a page we modify in our cache should be dropped from the
kernel cache.
- Hint that a page we write back to the operating system should be
dropped from the kernel cache after the I/O completes.

It's hard to say which of these ideas will work well without testing
them, and the overhead of the extra system calls might be significant
in some of those cases, but it seems a promising line of inquiry.

And the idea of being able to do an 8kB atomic write with OS support
so that we don't have to save full page images in our write-ahead log
to cover the "torn page" scenario seems very intriguing indeed.  If
that worked well, it would be a *big* deal for us.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Stephen Frost
Date:

* Claudio Freire () wrote:
> Yes, that's basically zero-copy reads.
>
> It could be done. The kernel can remap the page to the physical page
> holding the shared buffer and mark it read-only, then expire the
> buffer and transfer ownership of the page if any page fault happens.
>
> But that incurrs:
>  - Page faults, lots
>  - Hugely bloated mappings, unless KSM is somehow leveraged for this

The page faults might be a problem but might be worth it.  Bloated
mappings sounds like a real issue though.

> And there's a nice bingo. Had forgotten about KSM. KSM could help lots.
>
> I could try to see of madvising shared_buffers as mergeable helps. But
> this should be an automatic case of KSM - ie, when reading into a
> page-aligned address, the kernel should summarily apply KSM-style
> sharing without hinting. The current madvise interface puts the burden
> of figuring out what duplicates what on the kernel, but postgres
> already knows.

I'm certainly curious as to if KSM could help here, but on Ubuntu 12.04
with 3.5.0-23-generic, it's not doing anything with just PG running.
The page here: http://www.linux-kvm.org/page/KSM seems to indicate why:

----
KSM is a memory-saving de-duplication feature, that merges anonymous
(private) pages (not pagecache ones).
----

Looks like it won't merge between pagecache and private/application
memory?  Or is it just that we're not madvise()'ing the shared buffers
region?  I'd be happy to test doing that, if there's a chance it'll
actually work..
Thanks,
    Stephen

From:
Claudio Freire
Date:

On Wed, Jan 15, 2014 at 1:35 PM, Stephen Frost <> wrote:
>> And there's a nice bingo. Had forgotten about KSM. KSM could help lots.
>>
>> I could try to see of madvising shared_buffers as mergeable helps. But
>> this should be an automatic case of KSM - ie, when reading into a
>> page-aligned address, the kernel should summarily apply KSM-style
>> sharing without hinting. The current madvise interface puts the burden
>> of figuring out what duplicates what on the kernel, but postgres
>> already knows.
>
> I'm certainly curious as to if KSM could help here, but on Ubuntu 12.04
> with 3.5.0-23-generic, it's not doing anything with just PG running.
> The page here: http://www.linux-kvm.org/page/KSM seems to indicate why:
>
> ----
> KSM is a memory-saving de-duplication feature, that merges anonymous
> (private) pages (not pagecache ones).
> ----
>
> Looks like it won't merge between pagecache and private/application
> memory?  Or is it just that we're not madvise()'ing the shared buffers
> region?  I'd be happy to test doing that, if there's a chance it'll
> actually work..


Yes, it's onlyl *intended* for merging private memory.

But, still, the implementation is very similar to what postgres needs:
sharing a physical page for two distinct logical pages, efficiently,
with efficient copy-on-write.

So it'd be just a matter of removing that limitation regarding page
cache and shared pages.

If you asked me, I'd implement it as copy-on-write on the page cache
(not the user page). That ought to be low-overhead.



From:
Tom Lane
Date:

Robert Haas <> writes:
> I think that the bottom line is that we're not likely to make massive
> changes to the way that we do block caching now.  Even if some other
> scheme could work much better on Linux (and so far I'm unconvinced
> that any of the proposals made here would in fact work much better),
> we aim to be portable to Windows as well as other UNIX-like systems
> (BSD, Solaris, etc.).  So using completely Linux-specific technology
> in an overhaul of our block cache seems to me to have no future.

Unfortunately, I have to agree with this.  Even if there were a way to
merge our internal buffers with the kernel's, it would surely be far
too invasive to coexist with buffer management that'd still work on
more traditional platforms.

But we could add hint calls, or modify the I/O calls we use, and that
ought to be a reasonably localized change.

> And the idea of being able to do an 8kB atomic write with OS support
> so that we don't have to save full page images in our write-ahead log
> to cover the "torn page" scenario seems very intriguing indeed.  If
> that worked well, it would be a *big* deal for us.

+1.  That would be a significant win, and trivial to implement, since
we already have a way to switch off full-page images for people who
trust their filesystems to do atomic writes.  It's just that safe
use of that switch isn't widely possible ...
        regards, tom lane



From:
Claudio Freire
Date:

On Wed, Jan 15, 2014 at 2:52 PM, Tom Lane <> wrote:
> Robert Haas <> writes:
>> I think that the bottom line is that we're not likely to make massive
>> changes to the way that we do block caching now.  Even if some other
>> scheme could work much better on Linux (and so far I'm unconvinced
>> that any of the proposals made here would in fact work much better),
>> we aim to be portable to Windows as well as other UNIX-like systems
>> (BSD, Solaris, etc.).  So using completely Linux-specific technology
>> in an overhaul of our block cache seems to me to have no future.
>
> Unfortunately, I have to agree with this.  Even if there were a way to
> merge our internal buffers with the kernel's, it would surely be far
> too invasive to coexist with buffer management that'd still work on
> more traditional platforms.
>
> But we could add hint calls, or modify the I/O calls we use, and that
> ought to be a reasonably localized change.


That's what's pretty nice with the zero-copy read idea. It's almost
transparent. You read to a page-aligned address, and it works. The
only code change would be enabling zero-copy reads, which I'm not sure
will be low-overhead enough to leave enabled by default.



From:
Stephen Frost
Date:

* Claudio Freire () wrote:
> But, still, the implementation is very similar to what postgres needs:
> sharing a physical page for two distinct logical pages, efficiently,
> with efficient copy-on-write.

Agreed, except that KSM seems like it'd be slow/lazy about it and I'm
guessing there's a reason the pagecache isn't included normally..

> So it'd be just a matter of removing that limitation regarding page
> cache and shared pages.

Any idea why that limitation is there?

> If you asked me, I'd implement it as copy-on-write on the page cache
> (not the user page). That ought to be low-overhead.

Not entirely sure I'm following this- if it's a shared page, it doesn't
matter who starts writing to it, as soon as that happens, it need to get
copied.  Perhaps you mean that the application should keep the
"original" and that the page-cache should get the "copy" (or, really,
perhaps just forget about the page existing at that point- we won't want
it again...).

Would that be a way to go, perhaps?  This does go back to the "make it
act like mmap, but not *be* mmap", but the idea would be:

open(..., O_ZEROCOPY_READ)
read() - Goes to PG's shared buffers, pagecache and PG share the page
page fault (PG writes to it) - pagecache forgets about the page
write() / fsync() - operate as normal

The differences here from O_DIRECT are that the pagecache will keep the
page while clean (absolutely valuable from PG's perspective- we might
have to evict the page from shared buffers sooner than the kernel does),
and the write()'s happen at the kernel's pace, allowing for
write-combining, etc, until an fsync() happens, of course.

This isn't the "big win" of dealing with I/O issues during checkpoints
that we'd like to see, but it certainly feels like it'd be an
improvement over the current double-buffering situation at least.
Thanks,
    Stephen

From:
Claudio Freire
Date:

On Wed, Jan 15, 2014 at 3:41 PM, Stephen Frost <> wrote:
> * Claudio Freire () wrote:
>> But, still, the implementation is very similar to what postgres needs:
>> sharing a physical page for two distinct logical pages, efficiently,
>> with efficient copy-on-write.
>
> Agreed, except that KSM seems like it'd be slow/lazy about it and I'm
> guessing there's a reason the pagecache isn't included normally..

KSM does an active de-duplication. That's slow. This would be
leveraging KSM structures in the kernel (page sharing) but without all
the de-duplication logic.

>
>> So it'd be just a matter of removing that limitation regarding page
>> cache and shared pages.
>
> Any idea why that limitation is there?

No, but I'm guessing it's because nobody bothered to implement the
required copy-on-write in the page cache, which would be a PITA to
write - think of all the complexities with privilege checks and
everything - even though the benefits for many kinds of applications
would be important.

>> If you asked me, I'd implement it as copy-on-write on the page cache
>> (not the user page). That ought to be low-overhead.
>
> Not entirely sure I'm following this- if it's a shared page, it doesn't
> matter who starts writing to it, as soon as that happens, it need to get
> copied.  Perhaps you mean that the application should keep the
> "original" and that the page-cache should get the "copy" (or, really,
> perhaps just forget about the page existing at that point- we won't want
> it again...).
>
> Would that be a way to go, perhaps?  This does go back to the "make it
> act like mmap, but not *be* mmap", but the idea would be:
> open(..., O_ZEROCOPY_READ)
> read() - Goes to PG's shared buffers, pagecache and PG share the page
> page fault (PG writes to it) - pagecache forgets about the page
> write() / fsync() - operate as normal

Yep.



From:
Jan Kara
Date:

On Wed 15-01-14 14:38:44, Hannu Krosing wrote:
> On 01/15/2014 02:01 PM, Jan Kara wrote:
> > On Wed 15-01-14 12:16:50, Hannu Krosing wrote:
> >> On 01/14/2014 06:12 PM, Robert Haas wrote:
> >>> This would be pretty similar to copy-on-write, except
> >>> without the copying. It would just be
> >>> forget-from-the-buffer-pool-on-write. 
> >> +1
> >>
> >> A version of this could probably already be implement using MADV_DONTNEED
> >> and MADV_WILLNEED
> >>
> >> Thet is, just after reading the page in, use MADV_DONTNEED on it. When
> >> evicting
> >> a clean page, check that it is still in cache and if it is, then
> >> MADV_WILLNEED it.
> >>
> >> Another nice thing to do would be dynamically adjusting kernel
> >> dirty_background_ratio
> >> and other related knobs in real time based on how many buffers are dirty
> >> inside postgresql.
> >> Maybe in background writer.
> >>
> >> Question to LKM folks - will kernel react well to frequent changes to
> >> /proc/sys/vm/dirty_*  ?
> >> How frequent can they be (every few second? every second? 100Hz ?)
> >   So the question is what do you mean by 'react'. We check whether we
> > should start background writeback every dirty_writeback_centisecs (5s). We
> > will also check whether we didn't exceed the background dirty limit (and
> > wake writeback thread) when dirtying pages. However this check happens once
> > per several dirtied MB (unless we are close to dirty_bytes).
> >
> > When writeback is running we check roughly once per second (the logic is
> > more complex there but I don't think explaining details would be useful
> > here) whether we are below dirty_background_bytes and stop writeback in
> > that case.
> >
> > So changing dirty_background_bytes every few seconds should work
> > reasonably, once a second is pushing it and 100 Hz - no way. But I'd also
> > note that you have conflicting requirements on the kernel writeback. On one
> > hand you want checkpoint data to steadily trickle to disk (well, trickle
> > isn't exactly the proper word since if you need to checkpoing 16 GB every 5
> > minutes than you need a steady throughput of ~50 MB/s just for
> > checkpointing) so you want to set dirty_background_bytes low, on the other
> > hand you don't want temporary files to get to disk so you want to set
> > dirty_background_bytes high. 
> Is it possible to have more fine-grained control over writeback, like
> configuring dirty_background_bytes per file system / device (or even
> a file or a group of files) ? Currently it isn't possible to tune dirty_background_bytes per device
directly. However see below.

> If not, then how hard would it be to provide this ? We do track amount of dirty pages per device and the thread doing
the
flushing is also per device. The thing is that currently we compute the
per-device background limit as dirty_background_bytes * p, where p is a
proportion of writeback happening on this device to total writeback in the
system (computed as floating average with exponential time-based backoff).
BTW, similarly maximum per-device dirty limit is derived from global
dirty_bytes in the same way. And you can also set bounds on the proportion
'p' in /sys/block/sda/bdi/{min,max}_ratio so in theory you should be able
to set fixed background limit for a device by setting matching min and max
proportions.
                            Honza
-- 
Jan Kara <>
SUSE Labs, CR



From:
Jeff Janes
Date:

On Wed, Jan 15, 2014 at 7:12 AM, Tom Lane <> wrote:
Heikki Linnakangas <> writes:
> On 01/15/2014 07:50 AM, Dave Chinner wrote:
>> FWIW [and I know you're probably sick of hearing this by now], but
>> the blk-io throttling works almost perfectly with applications that
>> use direct IO.....

> For checkpoint writes, direct I/O actually would be reasonable.
> Bypassing the OS cache is a good thing in that case - we don't want the
> written pages to evict other pages from the OS cache, as we already have
> them in the PostgreSQL buffer cache.

But in exchange for that, we'd have to deal with selecting an order to
write pages that's appropriate depending on the filesystem layout,
other things happening in the system, etc etc.  We don't want to build
an I/O scheduler, IMO, but we'd have to.

> Writing one page at a time with O_DIRECT from a single process might be
> quite slow, so we'd probably need to use writev() or asynchronous I/O to
> work around that.

Yeah, and if the system has multiple spindles, we'd need to be issuing
multiple O_DIRECT writes concurrently, no?

writev effectively does do that, doesn't it?  But they do have to be on the same file handle, so that could be a problem.  I think we need something like sorted checkpoints sooner or later, anyway.



What we'd really like for checkpointing is to hand the kernel a boatload
(several GB) of dirty pages and say "how about you push all this to disk
over the next few minutes, in whatever way seems optimal given the storage
hardware and system situation.  Let us know when you're done."  

And most importantly, "Also, please don't freeze up everything else in the process"
 
Cheers,

Jeff
From:
Tom Lane
Date:

Dave Chinner <> writes:
> On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote:
>> What we'd really like for checkpointing is to hand the kernel a boatload
>> (several GB) of dirty pages and say "how about you push all this to disk
>> over the next few minutes, in whatever way seems optimal given the storage
>> hardware and system situation.  Let us know when you're done."

> The issue there is that the kernel has other triggers for needing to
> clean data. We have no infrastructure to handle variable writeback
> deadlines at the moment, nor do we have any infrastructure to do
> roughly metered writeback of such files to disk. I think we could
> add it to the infrastructure without too much perturbation of the
> code, but as you've pointed out that still leaves the fact there's
> no obvious interface to configure such behaviour. Would it need to
> be persistent?

No, we'd be happy to re-request it during each checkpoint cycle, as
long as that wasn't an unduly expensive call to make.  I'm not quite
sure where such requests ought to "live" though.  One idea is to tie
them to file descriptors; but the data to be written might be spread
across more files than we really want to keep open at one time.
But the only other idea that comes to mind is some kind of global sysctl,
which would probably have security and permissions issues.  (One thing
that hasn't been mentioned yet in this thread, but maybe is worth pointing
out now, is that Postgres does not run as root, and definitely doesn't
want to.  So we don't want a knob that would require root permissions
to twiddle.)  We could probably live with serially checkpointing data
in sets of however-many-files-we-can-have-open, if file descriptors are
the place to keep the requests.
        regards, tom lane



From:
Tom Lane
Date:

Dave Chinner <> writes:
> On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote:
>> And most importantly, "Also, please don't freeze up everything else in the
>> process"

> If you hand writeback off to the kernel, then writeback for memory
> reclaim needs to take precedence over "metered writeback". If we are
> low on memory, then cleaning dirty memory quickly to avoid ongoing
> allocation stalls, failures and potentially OOM conditions is far more
> important than anything else.....

I think you're in violent agreement, actually.  Jeff's point is exactly
that we'd rather the checkpoint deadline slid than that the system goes
to hell in a handbasket for lack of I/O cycles.  Here "metered" really
means "do it as a low-priority task".
        regards, tom lane



From:
Jim Nasby
Date:

On 1/15/14, 12:00 AM, Claudio Freire wrote:
> My completely unproven theory is that swapping is overwhelmed by
> near-misses. Ie: a process touches a page, and before it's actually
> swapped in, another process touches it too, blocking on the other
> process' read. But the second process doesn't account for that page
> when evaluating predictive models (ie: read-ahead), so the next I/O by
> process 2 is unexpected to the kernel. Then the same with 1. Etc... In
> essence, swap, by a fluke of its implementation, fails utterly to
> predict the I/O pattern, and results in far sub-optimal reads.
>
> Explicit I/O is free from that effect, all read calls are accountable,
> and that makes a difference.
>
> Maybe, if the kernel could be fixed in that respect, you could
> consider mmap'd files as a suitable form of temporary storage. But
> that would depend on the success and availability of such a fix/patch.

Another option is to consider some of the more "radical" ideas in this thread, but only for temporary data. Our write
sequencingand other needs are far less stringent for this stuff.
 
-- 
Jim C. Nasby, Data Architect                       
512.569.9461 (cell)                         http://jim.nasby.net



From:
Robert Haas
Date:

On Wed, Jan 15, 2014 at 7:22 PM, Dave Chinner <> wrote:
> No, I meant the opposite - in low memory situations, the system is
> going to go to hell in a handbasket because we are going to cause a
> writeback IO storm cleaning memory regardless of these IO
> priorities. i.e. there is no way we'll let "low priority writeback
> to avoid IO storms" cause OOM conditions to occur. That is, in OOM
> conditions, cleaning dirty pages becomes one of the highest priority
> tasks of the system....

I don't see that as a problem.  What we're struggling with today is
that, until we fsync(), the system is too lazy about writing back
dirty pages.  And then when we fsync(), it becomes very aggressive and
system-wide throughput goes into the tank.  What we're aiming to do
here is get is to start the writeback sooner than it would otherwise
start so that it is spread out over a longer period of time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Tom Lane
Date:

Dave Chinner <> writes:
> On Wed, Jan 15, 2014 at 07:08:18PM -0500, Tom Lane wrote:
>> No, we'd be happy to re-request it during each checkpoint cycle, as
>> long as that wasn't an unduly expensive call to make.  I'm not quite
>> sure where such requests ought to "live" though.  One idea is to tie
>> them to file descriptors; but the data to be written might be spread
>> across more files than we really want to keep open at one time.

> It would be a property of the inode, as that is how writeback is
> tracked and timed. Set and queried through a file descriptor,
> though - it's basically the same context that fadvise works
> through.

Ah, got it.  That would be fine on our end, I think.

>> We could probably live with serially checkpointing data
>> in sets of however-many-files-we-can-have-open, if file descriptors are
>> the place to keep the requests.

> Inodes live longer than file descriptors, but there's no guarantee
> that they live from one fd context to another. Hence my question
> about persistence ;)

I plead ignorance about what an "fd context" is.  However, if what you're
saying is that there's a small chance of the kernel forgetting the request
during normal system operation, I think we could probably tolerate that,
if the API is designed so that we ultimately do an fsync on the file
anyway.  The point of the hint would be to try to ensure that the later
fsync had little to do.  If sometimes it didn't work, well, that's life.
We're ahead of the game as long as it usually works.
        regards, tom lane



From:
Tom Lane
Date:

Robert Haas <> writes:
> I don't see that as a problem.  What we're struggling with today is
> that, until we fsync(), the system is too lazy about writing back
> dirty pages.  And then when we fsync(), it becomes very aggressive and
> system-wide throughput goes into the tank.  What we're aiming to do
> here is get is to start the writeback sooner than it would otherwise
> start so that it is spread out over a longer period of time.

Yeah.  It's sounding more and more like the right semantics are to
give the kernel a hint that we're going to fsync these files later,
so it ought to get on with writing them anytime the disk has nothing
better to do.  I'm not sure if there's value in being specific about
how much later; that would probably depend on details of the scheduler
that I don't know.
        regards, tom lane



From:
Dave Chinner
Date:

On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote:
> On Wed, Jan 15, 2014 at 7:12 AM, Tom Lane <> wrote:
> 
> > Heikki Linnakangas <> writes:
> > > On 01/15/2014 07:50 AM, Dave Chinner wrote:
> > >> FWIW [and I know you're probably sick of hearing this by now], but
> > >> the blk-io throttling works almost perfectly with applications that
> > >> use direct IO.....
> >
> > > For checkpoint writes, direct I/O actually would be reasonable.
> > > Bypassing the OS cache is a good thing in that case - we don't want the
> > > written pages to evict other pages from the OS cache, as we already have
> > > them in the PostgreSQL buffer cache.
> >
> > But in exchange for that, we'd have to deal with selecting an order to
> > write pages that's appropriate depending on the filesystem layout,
> > other things happening in the system, etc etc.  We don't want to build
> > an I/O scheduler, IMO, but we'd have to.
> >
> > > Writing one page at a time with O_DIRECT from a single process might be
> > > quite slow, so we'd probably need to use writev() or asynchronous I/O to
> > > work around that.
> >
> > Yeah, and if the system has multiple spindles, we'd need to be issuing
> > multiple O_DIRECT writes concurrently, no?
> >
> 
> writev effectively does do that, doesn't it?  But they do have to be on the
> same file handle, so that could be a problem.  I think we need something
> like sorted checkpoints sooner or later, anyway.

No, it doesn't. writev() allows you to supply multiple user buffers
for a single IO to fixed offset. If th efile is contiguous, then it
will be issued as a single IO. If you want concurrent DIO, then you
need to use multiple threads or AIO.

> > What we'd really like for checkpointing is to hand the kernel a boatload
> > (several GB) of dirty pages and say "how about you push all this to disk
> > over the next few minutes, in whatever way seems optimal given the storage
> > hardware and system situation.  Let us know when you're done."
> 
> And most importantly, "Also, please don't freeze up everything else in the
> process"

If you hand writeback off to the kernel, then writeback for memory
reclaim needs to take precedence over "metered writeback". If we are
low on memory, then cleaning dirty memory quickly to avoid ongoing
allocation stalls, failures and potentially OOM conditions is far more
important than anything else.....

Cheers,

Dave.
-- 
Dave Chinner




From:
Dave Chinner
Date:

On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote:
> Heikki Linnakangas <> writes:
> > On 01/15/2014 07:50 AM, Dave Chinner wrote:
> >> FWIW [and I know you're probably sick of hearing this by now], but
> >> the blk-io throttling works almost perfectly with applications that
> >> use direct IO.....
> 
> > For checkpoint writes, direct I/O actually would be reasonable. 
> > Bypassing the OS cache is a good thing in that case - we don't want the 
> > written pages to evict other pages from the OS cache, as we already have 
> > them in the PostgreSQL buffer cache.
> 
> But in exchange for that, we'd have to deal with selecting an order to
> write pages that's appropriate depending on the filesystem layout,
> other things happening in the system, etc etc.  We don't want to build
> an I/O scheduler, IMO, but we'd have to.

I don't see that as necessary - nobody else needs to do this with
direct IO. Indeed, if the application does ascending offset order
writeback from within a file, then it's replicating exactly what the
kernel page cache writeback does. If what the kernel does is good
enough for you, then I can't see how doing the same thing with
a background thread doing direct IO is going to need any special
help....

> > Writing one page at a time with O_DIRECT from a single process might be 
> > quite slow, so we'd probably need to use writev() or asynchronous I/O to 
> > work around that.
> 
> Yeah, and if the system has multiple spindles, we'd need to be issuing
> multiple O_DIRECT writes concurrently, no?
> 
> What we'd really like for checkpointing is to hand the kernel a boatload
> (several GB) of dirty pages and say "how about you push all this to disk
> over the next few minutes, in whatever way seems optimal given the storage
> hardware and system situation.  Let us know when you're done."

The issue there is that the kernel has other triggers for needing to
clean data. We have no infrastructure to handle variable writeback
deadlines at the moment, nor do we have any infrastructure to do
roughly metered writeback of such files to disk. I think we could
add it to the infrastructure without too much perturbation of the
code, but as you've pointed out that still leaves the fact there's
no obvious interface to configure such behaviour. Would it need to
be persistent?

Cheers,

Dave.
-- 
Dave Chinner




From:
Dave Chinner
Date:

On Wed, Jan 15, 2014 at 10:12:38AM -0500, Robert Haas wrote:
> On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara <> wrote:
> > Filesystems could in theory provide facility like atomic write (at least up
> > to a certain size say in MB range) but it's not so easy and when there are
> > no strong usecases fs people are reluctant to make their code more complex
> > unnecessarily. OTOH without widespread atomic write support I understand
> > application developers have similar stance. So it's kind of chicken and egg
> > problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
> > due to its data=journal mode so if someone on the PostgreSQL side wanted to
> > research on this, knitting some experimental ext4 patches should be doable.
> 
> Atomic 8kB writes would improve performance for us quite a lot.  Full
> page writes to WAL are very expensive.  I don't remember what
> percentage of write-ahead log traffic that accounts for, but it's not
> small.

Essentially, the "atomic writes" will essentially be journalled data
so initially there is not going to be any different in performance
between journalling the data in userspace and journalling it in the
filesystem journal. Indeed, it could be worse because the filesystem
journal is typically much smaller than a database WAL file, and it
will flush much more frequently and without the database having any
say in when that occurs.

AFAICT, we're stuck with sucky WAL until block layer and hardware
support atomic writes.

FWIW, I've certainly considered adding per-file data journalling
capabilities to XFS in the past. If we decide that this is the way
to proceed (i.e. as a stepping stone towards hardware atomic write
support), then I can go back to my notes from a few years ago and
see what still needs to be done to support it....

Cheers,

Dave.
-- 
Dave Chinner




From:
Dave Chinner
Date:

On Wed, Jan 15, 2014 at 07:13:27PM -0500, Tom Lane wrote:
> Dave Chinner <> writes:
> > On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote:
> >> And most importantly, "Also, please don't freeze up everything else in the
> >> process"
> 
> > If you hand writeback off to the kernel, then writeback for memory
> > reclaim needs to take precedence over "metered writeback". If we are
> > low on memory, then cleaning dirty memory quickly to avoid ongoing
> > allocation stalls, failures and potentially OOM conditions is far more
> > important than anything else.....
> 
> I think you're in violent agreement, actually.  Jeff's point is exactly
> that we'd rather the checkpoint deadline slid than that the system goes
> to hell in a handbasket for lack of I/O cycles.  Here "metered" really
> means "do it as a low-priority task".

No, I meant the opposite - in low memory situations, the system is
going to go to hell in a handbasket because we are going to cause a
writeback IO storm cleaning memory regardless of these IO
priorities. i.e. there is no way we'll let "low priority writeback
to avoid IO storms" cause OOM conditions to occur. That is, in OOM
conditions, cleaning dirty pages becomes one of the highest priority
tasks of the system....

Cheers,

Dave.
-- 
Dave Chinner




From:
Dave Chinner
Date:

On Wed, Jan 15, 2014 at 07:08:18PM -0500, Tom Lane wrote:
> Dave Chinner <> writes:
> > On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote:
> >> What we'd really like for checkpointing is to hand the kernel a boatload
> >> (several GB) of dirty pages and say "how about you push all this to disk
> >> over the next few minutes, in whatever way seems optimal given the storage
> >> hardware and system situation.  Let us know when you're done."
> 
> > The issue there is that the kernel has other triggers for needing to
> > clean data. We have no infrastructure to handle variable writeback
> > deadlines at the moment, nor do we have any infrastructure to do
> > roughly metered writeback of such files to disk. I think we could
> > add it to the infrastructure without too much perturbation of the
> > code, but as you've pointed out that still leaves the fact there's
> > no obvious interface to configure such behaviour. Would it need to
> > be persistent?
> 
> No, we'd be happy to re-request it during each checkpoint cycle, as
> long as that wasn't an unduly expensive call to make.  I'm not quite
> sure where such requests ought to "live" though.  One idea is to tie
> them to file descriptors; but the data to be written might be spread
> across more files than we really want to keep open at one time.

It would be a property of the inode, as that is how writeback is
tracked and timed. Set and queried through a file descriptor,
though - it's basically the same context that fadvise works
through.

> But the only other idea that comes to mind is some kind of global sysctl,
> which would probably have security and permissions issues.  (One thing
> that hasn't been mentioned yet in this thread, but maybe is worth pointing
> out now, is that Postgres does not run as root, and definitely doesn't
> want to.  So we don't want a knob that would require root permissions
> to twiddle.)

I have assumed all along that requiring root to do stuff would be a
bad thing. :)

> We could probably live with serially checkpointing data
> in sets of however-many-files-we-can-have-open, if file descriptors are
> the place to keep the requests.

Inodes live longer than file descriptors, but there's no guarantee
that they live from one fd context to another. Hence my question
about persistence ;)

Cheers,

Dave.
-- 
Dave Chinner




From:
Robert Haas
Date:

On Wed, Jan 15, 2014 at 8:41 PM, Jan Kara <> wrote:
> On Wed 15-01-14 10:12:38, Robert Haas wrote:
>> On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara <> wrote:
>> > Filesystems could in theory provide facility like atomic write (at least up
>> > to a certain size say in MB range) but it's not so easy and when there are
>> > no strong usecases fs people are reluctant to make their code more complex
>> > unnecessarily. OTOH without widespread atomic write support I understand
>> > application developers have similar stance. So it's kind of chicken and egg
>> > problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
>> > due to its data=journal mode so if someone on the PostgreSQL side wanted to
>> > research on this, knitting some experimental ext4 patches should be doable.
>>
>> Atomic 8kB writes would improve performance for us quite a lot.  Full
>> page writes to WAL are very expensive.  I don't remember what
>> percentage of write-ahead log traffic that accounts for, but it's not
>> small.
>   OK, and do you need atomic writes on per-IO basis or per-file is enough?
> It basically boils down to - is all or most of IO to a file going to be
> atomic or it's a smaller fraction?

The write-ahead log wouldn't need it, but data files writes would.  So
we'd need it a lot, but not for absolutely everything.

For any given file, we'd either care about writes being atomic, or we wouldn't.

> As Dave notes, unless there is HW support (which is coming with newest
> solid state drives), ext4/xfs will have to implement this by writing data
> to a filesystem journal and after transaction commit checkpointing them to
> a final location. Which is exactly what you do with your WAL logs so
> it's not clear it will be a performance win. But it is easy enough to code
> for ext4 that I'm willing to try...

Yeah, hardware support would be great.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
knizhnik
Date:

I wonder if kernel can sometimes provide weaker version of fsync() which 
is not enforcing all pending data to be written immediately but just 
servers as write barrier, guaranteeing
that all write operations preceding fsync() will be completed before any 
of subsequent operations.

It will allow implementation of weaker transaction models which are not 
satisfying all ACID requirements (results of committed transaction can 
be lost in case power failure or OS crash) but still preserving database 
consistency. It is acceptable for many applications and can provide much 
better performance.

Right now it is possible to implement something like this at application 
level using asynchronous write process. So all write/sync operations 
should be redirected to this process.
But such process can become a bottleneck reducing scalability of the 
system. Also communication channels with this process can cause 
significant memory/CPU overhead.

In most DBMSes including PostgreSQL transaction log and database data 
are located in separate files. So such write barrier should be 
associated not with one file, but with set of files or may be the whole 
file system.  I wonder if there are some principle problems in 
implementing or using such file system write barrier?




From:
Jeremy Harris
Date:

On 14/01/14 22:23, Dave Chinner wrote:
> On Tue, Jan 14, 2014 at 11:40:38AM -0800, Kevin Grittner wrote:
>> To quantify that, in a production setting we were seeing pauses of
>> up to two minutes with shared_buffers set to 8GB and default dirty
>                                                         ^^^^^^^^^^^^^
>> page settings for Linux, on a machine with 256GB RAM and 512MB
>    ^^^^^^^^^^^^^
> There's your problem.
>
> By default, background writeback doesn't start until 10% of memory
> is dirtied, and on your machine that's 25GB of RAM. That's way to
> high for your workload.
>
> It appears to me that we are seeing large memory machines much more
> commonly in data centers - a couple of years ago 256GB RAM was only
> seen in supercomputers. Hence machines of this size are moving from
> "tweaking settings for supercomputers is OK" class to "tweaking
> settings for enterprise servers is not OK"....
>
> Perhaps what we need to do is deprecate dirty_ratio and
> dirty_background_ratio as the default values as move to the byte
> based values as the defaults and cap them appropriately.  e.g.
> 10/20% of RAM for small machines down to a couple of GB for large
> machines....

<whisper>  Perhaps the kernel needs a dirty-amount control measured
in time units rather than pages (it being up to the kernel to
measure the achievable write rate)...
-- 
Cheers,   Jeremy



From:
Dave Chinner
Date:

On Wed, Jan 15, 2014 at 07:31:15PM -0500, Tom Lane wrote:
> Dave Chinner <> writes:
> > On Wed, Jan 15, 2014 at 07:08:18PM -0500, Tom Lane wrote:
> >> No, we'd be happy to re-request it during each checkpoint cycle, as
> >> long as that wasn't an unduly expensive call to make.  I'm not quite
> >> sure where such requests ought to "live" though.  One idea is to tie
> >> them to file descriptors; but the data to be written might be spread
> >> across more files than we really want to keep open at one time.
> 
> > It would be a property of the inode, as that is how writeback is
> > tracked and timed. Set and queried through a file descriptor,
> > though - it's basically the same context that fadvise works
> > through.
> 
> Ah, got it.  That would be fine on our end, I think.
> 
> >> We could probably live with serially checkpointing data
> >> in sets of however-many-files-we-can-have-open, if file descriptors are
> >> the place to keep the requests.
> 
> > Inodes live longer than file descriptors, but there's no guarantee
> > that they live from one fd context to another. Hence my question
> > about persistence ;)
> 
> I plead ignorance about what an "fd context" is.

open-to-close life time.
fd = open("some/file", ....);.....close(fd);

is a single context. If multiple fd contexts of the same file
overlap in lifetime, then the inode is constantly referenced and the
inode won't get reclaimed so the value won't get lost. However, is
there is no open fd context, there are no external references to the
inode so it can get reclaimed. Hence there's not guarantee that the
inode is present and the writeback property maintained across
close-to-open timeframes.

> We're ahead of the game as long as it usually works.

*nod*

Cheers,

Dave.
-- 
Dave Chinner




From:
Jan Kara
Date:

On Wed 15-01-14 10:12:38, Robert Haas wrote:
> On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara <> wrote:
> > Filesystems could in theory provide facility like atomic write (at least up
> > to a certain size say in MB range) but it's not so easy and when there are
> > no strong usecases fs people are reluctant to make their code more complex
> > unnecessarily. OTOH without widespread atomic write support I understand
> > application developers have similar stance. So it's kind of chicken and egg
> > problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
> > due to its data=journal mode so if someone on the PostgreSQL side wanted to
> > research on this, knitting some experimental ext4 patches should be doable.
> 
> Atomic 8kB writes would improve performance for us quite a lot.  Full
> page writes to WAL are very expensive.  I don't remember what
> percentage of write-ahead log traffic that accounts for, but it's not
> small. OK, and do you need atomic writes on per-IO basis or per-file is enough?
It basically boils down to - is all or most of IO to a file going to be
atomic or it's a smaller fraction?

As Dave notes, unless there is HW support (which is coming with newest
solid state drives), ext4/xfs will have to implement this by writing data
to a filesystem journal and after transaction commit checkpointing them to
a final location. Which is exactly what you do with your WAL logs so
it's not clear it will be a performance win. But it is easy enough to code
for ext4 that I'm willing to try...
                            Honza
-- 
Jan Kara <>
SUSE Labs, CR



From:
Jan Kara
Date:

On Wed 15-01-14 21:37:16, Robert Haas wrote:
> On Wed, Jan 15, 2014 at 8:41 PM, Jan Kara <> wrote:
> > On Wed 15-01-14 10:12:38, Robert Haas wrote:
> >> On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara <> wrote:
> >> > Filesystems could in theory provide facility like atomic write (at least up
> >> > to a certain size say in MB range) but it's not so easy and when there are
> >> > no strong usecases fs people are reluctant to make their code more complex
> >> > unnecessarily. OTOH without widespread atomic write support I understand
> >> > application developers have similar stance. So it's kind of chicken and egg
> >> > problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
> >> > due to its data=journal mode so if someone on the PostgreSQL side wanted to
> >> > research on this, knitting some experimental ext4 patches should be doable.
> >>
> >> Atomic 8kB writes would improve performance for us quite a lot.  Full
> >> page writes to WAL are very expensive.  I don't remember what
> >> percentage of write-ahead log traffic that accounts for, but it's not
> >> small.
> >   OK, and do you need atomic writes on per-IO basis or per-file is enough?
> > It basically boils down to - is all or most of IO to a file going to be
> > atomic or it's a smaller fraction?
> 
> The write-ahead log wouldn't need it, but data files writes would.  So
> we'd need it a lot, but not for absolutely everything.
> 
> For any given file, we'd either care about writes being atomic, or we
> wouldn't. OK, when you say that either all writes to a file should be atomic or
none of them should be, then can you try the following:
chattr +j <file>
 will turn on data journalling for <file> on ext3/ext4 filesystem.
Currently it *won't* guarantee the atomicity in all the cases but the
performance will be very similar as if it would. You might also want to
increase filesystem journal size with 'tune2fs -J size=XXX /dev/yyy' where
XXX is desired journal size in MB. Default is 128 MB I think but with
intensive data journalling you might want to have that in GB range. I'd be
interested in hearing what impact does turning 'atomic write' support
in PostgreSQL and using data journalling on ext4 have.
                            Honza
-- 
Jan Kara <>
SUSE Labs, CR



From:
Jeff Layton
Date:

On Wed, 15 Jan 2014 21:37:16 -0500
Robert Haas <> wrote:

> On Wed, Jan 15, 2014 at 8:41 PM, Jan Kara <> wrote:
> > On Wed 15-01-14 10:12:38, Robert Haas wrote:
> >> On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara <> wrote:
> >> > Filesystems could in theory provide facility like atomic write (at least up
> >> > to a certain size say in MB range) but it's not so easy and when there are
> >> > no strong usecases fs people are reluctant to make their code more complex
> >> > unnecessarily. OTOH without widespread atomic write support I understand
> >> > application developers have similar stance. So it's kind of chicken and egg
> >> > problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
> >> > due to its data=journal mode so if someone on the PostgreSQL side wanted to
> >> > research on this, knitting some experimental ext4 patches should be doable.
> >>
> >> Atomic 8kB writes would improve performance for us quite a lot.  Full
> >> page writes to WAL are very expensive.  I don't remember what
> >> percentage of write-ahead log traffic that accounts for, but it's not
> >> small.
> >   OK, and do you need atomic writes on per-IO basis or per-file is enough?
> > It basically boils down to - is all or most of IO to a file going to be
> > atomic or it's a smaller fraction?
> 
> The write-ahead log wouldn't need it, but data files writes would.  So
> we'd need it a lot, but not for absolutely everything.
> 
> For any given file, we'd either care about writes being atomic, or we wouldn't.
> 

Just getting caught up on this thread. One thing that you're just now
getting to here is that the different types of files in the DB have
different needs.

It might be good to outline each type of file (WAL, data files, tmp
files), what sort of I/O patterns are typically done to them, and what
sort of "special needs" they have (atomicity or whatever). Then we
could treat each file type as a separate problem, which may make some
of these problems easier to solve.

For instance, typically a WAL would be fairly sequential I/O, whereas
the data files are almost certainly random. It may make sense to
consider DIO for some of these use-cases, even if it's not suitable
everywhere.

For tempfiles, it may make sense to consider housing those on tmpfs.
They wouldn't go to disk at all that way, but if there is mem pressure
they could get swapped out (maybe this is standard practice already --
I don't know).

> > As Dave notes, unless there is HW support (which is coming with newest
> > solid state drives), ext4/xfs will have to implement this by writing data
> > to a filesystem journal and after transaction commit checkpointing them to
> > a final location. Which is exactly what you do with your WAL logs so
> > it's not clear it will be a performance win. But it is easy enough to code
> > for ext4 that I'm willing to try...
> 
> Yeah, hardware support would be great.
> 


-- 
Jeff Layton <>



From:
Theodore Ts'o
Date:

On Wed, Jan 15, 2014 at 10:35:44AM +0100, Jan Kara wrote:
> Filesystems could in theory provide facility like atomic write (at least up
> to a certain size say in MB range) but it's not so easy and when there are
> no strong usecases fs people are reluctant to make their code more complex
> unnecessarily. OTOH without widespread atomic write support I understand
> application developers have similar stance. So it's kind of chicken and egg
> problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
> due to its data=journal mode so if someone on the PostgreSQL side wanted to
> research on this, knitting some experimental ext4 patches should be doable.

For the record, a researcher (plus is PhD student) at HP Labs actually
implemented a prototype based on ext3 which created an atomic write
facility.  It was good up to about 25% of the ext4 journal size (so, a
couple of MB), and it was use to research using persistent memory by
creating a persistent heap using standard in-memory data structures as
a replacement for using a database.

The results of their research work was that showed that ext3 plus
atomic write plus standard Java associative arrays beat using Sqllite.

It was a research prototype, so they didn't handle OOM kill
conditions, and they also didn't try benchmarking against a real
database instead of a toy database such as SqlLite, but if someone
wants to experiment with Atomic write, there are patches against ext3
that we can probably get from HP Labs.
                                    - Ted



From:
Jeff Janes
Date:

On Thu, Jan 16, 2014 at 3:23 PM, Dave Chinner <> wrote:
On Wed, Jan 15, 2014 at 06:14:18PM -0600, Jim Nasby wrote:
> On 1/15/14, 12:00 AM, Claudio Freire wrote:
> >My completely unproven theory is that swapping is overwhelmed by
> >near-misses. Ie: a process touches a page, and before it's
> >actually swapped in, another process touches it too, blocking on
> >the other process' read. But the second process doesn't account
> >for that page when evaluating predictive models (ie: read-ahead),
> >so the next I/O by process 2 is unexpected to the kernel. Then
> >the same with 1. Etc... In essence, swap, by a fluke of its
> >implementation, fails utterly to predict the I/O pattern, and
> >results in far sub-optimal reads.
> >
> >Explicit I/O is free from that effect, all read calls are
> >accountable, and that makes a difference.
> >
> >Maybe, if the kernel could be fixed in that respect, you could
> >consider mmap'd files as a suitable form of temporary storage.
> >But that would depend on the success and availability of such a
> >fix/patch.
>
> Another option is to consider some of the more "radical" ideas in
> this thread, but only for temporary data. Our write sequencing and
> other needs are far less stringent for this stuff.  -- Jim C.

I suspect that a lot of the temporary data issues can be solved by
using tmpfs for temporary files....

Temp files can collectively reach hundreds of gigs.  So I would have to set up two temporary tablespaces, one in tmpfs and one in regular storage, and then remember to choose between them based on my estimate of how much temp space is going to be used in each connection (and hope I don't mess up the estimation and so either get errors, or render the server unresponsive).

So I just use regular storage, and pay the "insurance premium" of having some extraneous write IO.  It would be nice if the insurance premium were cheaper, though.  I think the IO storms during checkpoint syncs are definitely the more critical issue, this is just something nice to have which seemed to align with one the comments.
 
Cheers,

Jeff
From:
Jeff Janes
Date:

On Wed, Jan 15, 2014 at 2:08 AM, Mel Gorman <> wrote:
On Tue, Jan 14, 2014 at 09:30:19AM -0800, Jeff Janes wrote:
> >
> > That could be something we look at. There are cases buried deep in the
> > VM where pages get shuffled to the end of the LRU and get tagged for
> > reclaim as soon as possible. Maybe you need access to something like
> > that via posix_fadvise to say "reclaim this page if you need memory but
> > leave it resident if there is no memory pressure" or something similar.
> > Not exactly sure what that interface would look like or offhand how it
> > could be reliably implemented.
> >
>
> I think the "reclaim this page if you need memory but leave it resident if
> there is no memory pressure" hint would be more useful for temporary
> working files than for what was being discussed above (shared buffers).
>  When I do work that needs large temporary files, I often see physical
> write IO spike but physical read IO does not.  I interpret that to mean
> that the temporary data is being written to disk to satisfy either
> dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS
> cache and so disk reads are not needed to satisfy it.  So a hint that says
> "this file will never be fsynced so please ignore dirty_*bytes and
> dirty_expire_centisecs.

It would be good to know if dirty_expire_centisecs or dirty ratio|bytes
were the problem here.

Is there an easy way to tell?  I would guess it has to be at least dirty_expire_centisecs, if not both, as a very large sort operation takes a lot more than 30 seconds to complete.
 
An interface that forces a dirty page to stay dirty
regardless of the global system would be a major hazard. It potentially
allows the creator of the temporary file to stall all other processes
dirtying pages for an unbounded period of time.

Are the dirty ratio/bytes limits the mechanisms by which adequate clean memory is maintained?  I thought those were there just to but a limit on long it would take to execute a sync call should one be issued, and there were other setting which said how much clean memory to maintain.  It should definitely write out the pages if it needs the memory for other things, just not write them out due to fear of how long it would take to sync it if a sync was called.  (And if it needs the memory, it should be able to write it out quickly as the writes would be mostly sequential, not random--although how the kernel can believe me that that will always be the case could a problem)

 
I proposed in another part
of the thread a hint for open inodes to have the background writer thread
ignore dirty pages belonging to that inode. Dirty limits and fsync would
still be obeyed. It might also be workable for temporary files but the
proposal could be full of holes.

If calling fsync would fail with an error, would that lower the risk of DoS?
 

Your alternative here is to create a private anonymous mapping as they
are not subject to dirty limits. This is only a sensible option if the
temporarily data is guaranteeed to be relatively small. If the shared
buffers, page cache and your temporary data exceed the size of RAM then
data will get discarded or your temporary data will get pushed to swap
and performance will hit the floor.

PostgreSQL mainly uses temp files precisely when that gaurantee is hard to make.  There is a pretty big margin where it is too big to be certain it will fit in memory, so we have to switch to a disk-friendly mostly-sequential algorithm.  Yet it would still be nice to avoid the actual disk writes until we have observed that it actually is growing too big.


Cheers,

Jeff
From:
Robert Haas
Date:

On Thu, Jan 16, 2014 at 7:31 PM, Dave Chinner <> wrote:
> But there's something here that I'm not getting - you're talking
> about a data set that you want ot keep cache resident that is at
> least an order of magnitude larger than the cyclic 5-15 minute WAL
> dataset that ongoing operations need to manage to avoid IO storms.
> Where do these temporary files fit into this picture, how fast do
> they grow and why are do they need to be so large in comparison to
> the ongoing modifications being made to the database?

I'm not sure you've got that quite right.  WAL is fsync'd very
frequently - on every commit, at the very least, and multiple times
per second even there are no commits going on just to make sure we get
it all down to the platter as fast as possible.  The thing that causes
the I/O storm is the data file writes, which are performed either when
we need to free up space in PostgreSQL's internal buffer pool (aka
shared_buffers) or once per checkpoint interval (5-60 minutes) in any
event.  The point of this system is that if we crash, we're going to
need to replay all of the WAL to recover the data files to the proper
state; but we don't want to keep WAL around forever, so we checkpoint
periodically.  By writing all the data back to the underlying data
files, checkpoints render older WAL segments irrelevant, at which
point we can recycle those files before the disk fills up.

Temp files are something else again.  If PostgreSQL needs to sort a
small amount of data, like a kilobyte, it'll use quicksort.  But if it
needs to sort a large amount of data, like a terabyte, it'll use a
merge sort.[1]  The reason is of course that quicksort requires random
access to work well; if parts of quicksort's working memory get paged
out during the sort, your life sucks.  Merge sort (or at least our
implementation of it) is slower overall, but it only accesses the data
sequentially.  When we do a merge sort, we use files to simulate the
tapes that Knuth had in mind when he wrote down the algorithm.  If the
OS runs short of memory - because the sort is really big or just
because of other memory pressure - it can page out the parts of the
file we're not actively using without totally destroying performance.
It'll be slow, of course, because disks always are, but not like
quicksort would be if it started swapping.

I haven't actually experienced (or heard mentioned) the problem Jeff
Janes is mentioning where temp files get written out to disk too
aggressively; as mentioned before, the problems I've seen are usually
the other way - stuff not getting written out aggressively enough.
But it sounds plausible.  The OS only lets you set one policy, and if
you make that file right for permanent data files that get
checkpointed it could well be wrong for temp files that get thrown
out.  Just stuffing the data on RAMFS will work for some
installations, but might not be good if you actually do want to
perform sorts whose size exceeds RAM.

BTW, I haven't heard anyone on pgsql-hackers say they'd be interesting
in attending Collab on behalf of the PostgreSQL community.  Although
the prospect of a cross-country flight is a somewhat depressing
thought, it does sound pretty cool, so I'm potentially interested.  I
have no idea what the procedure is here for moving forward though,
especially since it sounds like there might be only one seat available
and I don't know who else may wish to sit in it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] The threshold where we switch from quicksort to merge sort is a
configurable parameter.



From:
Greg Stark
Date:

On Wed, Jan 15, 2014 at 7:53 AM, Mel Gorman <> wrote:
> The second is have
> pages that are strictly kept dirty until the application syncs them. An
> unbounded number of these pages would blow up but maybe bounds could be
> placed on it. There are no solid conclusions on that part yet.

I think the interface would be subtler than that. The current
architecture is that if an individual process decides to evict one of
these pages it knows how much of the log needs to be flushed and
fsynced before it can do so and proceeds to do it itself. This is a
situation to be avoided as much as possible but there are workloads
where it's inevitable (the typical example is mass data loads).

There would need to be some kind of similar interface where there
would be some way for the kernel to force log pages to be written to
allow it to advance the epoch. Either some way to wake Postgres up and
inform it of the urgency or better yet Postgres would just always be
writing out pages without fsyncing them and instead be issuing some
other syscall to mark the points in the log file that correspond to
the write barriers that would unpin these buffers.

Ted T'so was concerned this would all be a massive layering violation
and I have to admit that's a huge risk. It would take some clever API
engineering to come with a clean set of primitives to express the kind
of ordering guarantees we need without being too tied to Postgres's
specific implementation. The reason I think it's more interesting
though is that Postgres's journalling and checkpointing architecture
is pretty bog-standard CS stuff and there are hundreds or thousands of
pieces of software out there that do pretty much the same work and
trying to do it efficiently with fsync or O_DIRECT is like working
with both hands tied to your feet.

-- 
greg



From:
Jeff Janes
Date:

On Thursday, January 16, 2014, Dave Chinner <> wrote:
On Thu, Jan 16, 2014 at 03:58:56PM -0800, Jeff Janes wrote:
> On Thu, Jan 16, 2014 at 3:23 PM, Dave Chinner <> wrote:
>
> > On Wed, Jan 15, 2014 at 06:14:18PM -0600, Jim Nasby wrote:
> > > On 1/15/14, 12:00 AM, Claudio Freire wrote:
> > > >My completely unproven theory is that swapping is overwhelmed by
> > > >near-misses. Ie: a process touches a page, and before it's
> > > >actually swapped in, another process touches it too, blocking on
> > > >the other process' read. But the second process doesn't account
> > > >for that page when evaluating predictive models (ie: read-ahead),
> > > >so the next I/O by process 2 is unexpected to the kernel. Then
> > > >the same with 1. Etc... In essence, swap, by a fluke of its
> > > >implementation, fails utterly to predict the I/O pattern, and
> > > >results in far sub-optimal reads.
> > > >
> > > >Explicit I/O is free from that effect, all read calls are
> > > >accountable, and that makes a difference.
> > > >
> > > >Maybe, if the kernel could be fixed in that respect, you could
> > > >consider mmap'd files as a suitable form of temporary storage.
> > > >But that would depend on the success and availability of such a
> > > >fix/patch.
> > >
> > > Another option is to consider some of the more "radical" ideas in
> > > this thread, but only for temporary data. Our write sequencing and
> > > other needs are far less stringent for this stuff.  -- Jim C.
> >
> > I suspect that a lot of the temporary data issues can be solved by
> > using tmpfs for temporary files....
> >
>
> Temp files can collectively reach hundreds of gigs.

So unless you have terabytes of RAM you're going to have to write
them back to disk.

If they turn out to be hundreds of gigs, then yes they have to hit disk (at least on my hardware).  But if they are 10 gig, then maybe not (depending on whether other people decide to do similar things at the same time I'm going to be doing it--something which is often hard to predict).   But now for every action I take, I have to decide, is this going to take 10 gig, or 14 gig, and how absolutely certain am I?  And is someone else going to try something similar at the same time?  What a hassle.  It would be so much nicer to say "This is accessed sequentially, and will never be fsynced.  Maybe it will fit entirely in memory, maybe it won't, either way, you know what to do."  

If I start out writing to tmpfs, I can't very easily change my mind 94% of the way through and decide to go somewhere else.  But the kernel, effectively, can.
 
But there's something here that I'm not getting - you're talking
about a data set that you want ot keep cache resident that is at
least an order of magnitude larger than the cyclic 5-15 minute WAL
dataset that ongoing operations need to manage to avoid IO storms.

Those are mostly orthogonal issues.  The permanent files need to be fsynced on a regular basis, and might have gigabytes of data dirtied at random from within terabytes of underlying storage.  We better start writing that pretty quickly or when do issue the fsyncs, the world will fall apart.

The temporary files will never need to be fsynced, and can be written out sequentially if they do ever need to be written out.  Better to delay this as much as feasible.


Where do these temporary files fit into this picture, how fast do
they grow and why are do they need to be so large in comparison to
the ongoing modifications being made to the database?

The permanent files tend to be things like "Jane Doe just bought a pair of green shoes from Hendrick Green Shoes Limited--record that, charge her credit card, and schedule delivery".  The temp files are more like "It is the end of the year, how many shoes have been purchased in each color from each manufacturer for each quarter over the last 6 years"?   So the temp files quickly manipulate data that has slowly been accumulating over very long times, while the permanent files represent the processes of those accumulations.

If you are Amazon, of course, you have thousands of people who can keep two sets of records, one organized for fast update and one slightly delayed copy reorganized for fast analysis, and also do partial analysis on an ongoing basis and roll them up in ways that can be incrementally updated.  If you are not Amazon, it would be nice if one system did a better job of doing both things with the trade off between the two being dynamic and automatic.

Cheers,

Jeff
From:
Dave Chinner
Date:

On Wed, Jan 15, 2014 at 06:14:18PM -0600, Jim Nasby wrote:
> On 1/15/14, 12:00 AM, Claudio Freire wrote:
> >My completely unproven theory is that swapping is overwhelmed by
> >near-misses. Ie: a process touches a page, and before it's
> >actually swapped in, another process touches it too, blocking on
> >the other process' read. But the second process doesn't account
> >for that page when evaluating predictive models (ie: read-ahead),
> >so the next I/O by process 2 is unexpected to the kernel. Then
> >the same with 1. Etc... In essence, swap, by a fluke of its
> >implementation, fails utterly to predict the I/O pattern, and
> >results in far sub-optimal reads.
> >
> >Explicit I/O is free from that effect, all read calls are
> >accountable, and that makes a difference.
> >
> >Maybe, if the kernel could be fixed in that respect, you could
> >consider mmap'd files as a suitable form of temporary storage.
> >But that would depend on the success and availability of such a
> >fix/patch.
> 
> Another option is to consider some of the more "radical" ideas in
> this thread, but only for temporary data. Our write sequencing and
> other needs are far less stringent for this stuff.  -- Jim C.

I suspect that a lot of the temporary data issues can be solved by
using tmpfs for temporary files....

Cheers,

Dave.
-- 
Dave Chinner




From:
Dave Chinner
Date:

On Thu, Jan 16, 2014 at 03:58:56PM -0800, Jeff Janes wrote:
> On Thu, Jan 16, 2014 at 3:23 PM, Dave Chinner <> wrote:
> 
> > On Wed, Jan 15, 2014 at 06:14:18PM -0600, Jim Nasby wrote:
> > > On 1/15/14, 12:00 AM, Claudio Freire wrote:
> > > >My completely unproven theory is that swapping is overwhelmed by
> > > >near-misses. Ie: a process touches a page, and before it's
> > > >actually swapped in, another process touches it too, blocking on
> > > >the other process' read. But the second process doesn't account
> > > >for that page when evaluating predictive models (ie: read-ahead),
> > > >so the next I/O by process 2 is unexpected to the kernel. Then
> > > >the same with 1. Etc... In essence, swap, by a fluke of its
> > > >implementation, fails utterly to predict the I/O pattern, and
> > > >results in far sub-optimal reads.
> > > >
> > > >Explicit I/O is free from that effect, all read calls are
> > > >accountable, and that makes a difference.
> > > >
> > > >Maybe, if the kernel could be fixed in that respect, you could
> > > >consider mmap'd files as a suitable form of temporary storage.
> > > >But that would depend on the success and availability of such a
> > > >fix/patch.
> > >
> > > Another option is to consider some of the more "radical" ideas in
> > > this thread, but only for temporary data. Our write sequencing and
> > > other needs are far less stringent for this stuff.  -- Jim C.
> >
> > I suspect that a lot of the temporary data issues can be solved by
> > using tmpfs for temporary files....
> >
> 
> Temp files can collectively reach hundreds of gigs.

So unless you have terabytes of RAM you're going to have to write
them back to disk.

But there's something here that I'm not getting - you're talking
about a data set that you want ot keep cache resident that is at
least an order of magnitude larger than the cyclic 5-15 minute WAL
dataset that ongoing operations need to manage to avoid IO storms.
Where do these temporary files fit into this picture, how fast do
they grow and why are do they need to be so large in comparison to
the ongoing modifications being made to the database?

Cheers,

Dave.
-- 
Dave Chinner




From:
Dave Chinner
Date:

On Thu, Jan 16, 2014 at 08:48:24PM -0500, Robert Haas wrote:
> On Thu, Jan 16, 2014 at 7:31 PM, Dave Chinner <> wrote:
> > But there's something here that I'm not getting - you're talking
> > about a data set that you want ot keep cache resident that is at
> > least an order of magnitude larger than the cyclic 5-15 minute WAL
> > dataset that ongoing operations need to manage to avoid IO storms.
> > Where do these temporary files fit into this picture, how fast do
> > they grow and why are do they need to be so large in comparison to
> > the ongoing modifications being made to the database?

[ snip ]

> Temp files are something else again.  If PostgreSQL needs to sort a
> small amount of data, like a kilobyte, it'll use quicksort.  But if it
> needs to sort a large amount of data, like a terabyte, it'll use a
> merge sort.[1] 

IOWs the temp files contain data that requires transformation as
part of a query operation. So, temp file size is bound by the
dataset, growth determined by data retreival and transformation
rate.

IOWs, there are two very different IO and caching requirements in
play here and tuning the kernel for one actively degrades the
performance of the other. Right, got it now.

Cheers,

Dave.
-- 
Dave Chinner




From:
Jeff Layton
Date:

On Thu, 16 Jan 2014 20:48:24 -0500
Robert Haas <> wrote:

> On Thu, Jan 16, 2014 at 7:31 PM, Dave Chinner <> wrote:
> > But there's something here that I'm not getting - you're talking
> > about a data set that you want ot keep cache resident that is at
> > least an order of magnitude larger than the cyclic 5-15 minute WAL
> > dataset that ongoing operations need to manage to avoid IO storms.
> > Where do these temporary files fit into this picture, how fast do
> > they grow and why are do they need to be so large in comparison to
> > the ongoing modifications being made to the database?
> 
> I'm not sure you've got that quite right.  WAL is fsync'd very
> frequently - on every commit, at the very least, and multiple times
> per second even there are no commits going on just to make sure we get
> it all down to the platter as fast as possible.  The thing that causes
> the I/O storm is the data file writes, which are performed either when
> we need to free up space in PostgreSQL's internal buffer pool (aka
> shared_buffers) or once per checkpoint interval (5-60 minutes) in any
> event.  The point of this system is that if we crash, we're going to
> need to replay all of the WAL to recover the data files to the proper
> state; but we don't want to keep WAL around forever, so we checkpoint
> periodically.  By writing all the data back to the underlying data
> files, checkpoints render older WAL segments irrelevant, at which
> point we can recycle those files before the disk fills up.
> 

So this says to me that the WAL is a place where DIO should really be
reconsidered. It's mostly sequential writes that need to hit the disk
ASAP, and you need to know that they have hit the disk before you can
proceed with other operations.

Also, is the WAL actually ever read under normal (non-recovery)
conditions or is it write-only under normal operation? If it's seldom
read, then using DIO for them also avoids some double buffering since
they wouldn't go through pagecache.

Again, I think this discussion would really benefit from an outline of
the different files used by pgsql, and what sort of data access
patterns you expect with them.

> Temp files are something else again.  If PostgreSQL needs to sort a
> small amount of data, like a kilobyte, it'll use quicksort.  But if it
> needs to sort a large amount of data, like a terabyte, it'll use a
> merge sort.[1]  The reason is of course that quicksort requires random
> access to work well; if parts of quicksort's working memory get paged
> out during the sort, your life sucks.  Merge sort (or at least our
> implementation of it) is slower overall, but it only accesses the data
> sequentially.  When we do a merge sort, we use files to simulate the
> tapes that Knuth had in mind when he wrote down the algorithm.  If the
> OS runs short of memory - because the sort is really big or just
> because of other memory pressure - it can page out the parts of the
> file we're not actively using without totally destroying performance.
> It'll be slow, of course, because disks always are, but not like
> quicksort would be if it started swapping.
> 
> I haven't actually experienced (or heard mentioned) the problem Jeff
> Janes is mentioning where temp files get written out to disk too
> aggressively; as mentioned before, the problems I've seen are usually
> the other way - stuff not getting written out aggressively enough.
> But it sounds plausible.  The OS only lets you set one policy, and if
> you make that file right for permanent data files that get
> checkpointed it could well be wrong for temp files that get thrown
> out.  Just stuffing the data on RAMFS will work for some
> installations, but might not be good if you actually do want to
> perform sorts whose size exceeds RAM.
> 
> BTW, I haven't heard anyone on pgsql-hackers say they'd be interesting
> in attending Collab on behalf of the PostgreSQL community.  Although
> the prospect of a cross-country flight is a somewhat depressing
> thought, it does sound pretty cool, so I'm potentially interested.  I
> have no idea what the procedure is here for moving forward though,
> especially since it sounds like there might be only one seat available
> and I don't know who else may wish to sit in it.
> 


-- 
Jeff Layton <>



From:
Robert Haas
Date:

On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton <> wrote:
> So this says to me that the WAL is a place where DIO should really be
> reconsidered. It's mostly sequential writes that need to hit the disk
> ASAP, and you need to know that they have hit the disk before you can
> proceed with other operations.

Ironically enough, we actually *have* an option to use O_DIRECT here.
But it doesn't work well.  See below.

> Also, is the WAL actually ever read under normal (non-recovery)
> conditions or is it write-only under normal operation? If it's seldom
> read, then using DIO for them also avoids some double buffering since
> they wouldn't go through pagecache.

This is the first problem: if replication is in use, then the WAL gets
read shortly after it gets written.  Using O_DIRECT bypasses the
kernel cache for the writes, but then the reads stink.  However, if
you configure wal_sync_method=open_sync and disable replication, then
you will in fact get O_DIRECT|O_SYNC behavior.

But that still doesn't work out very well, because now the guy who
does the write() has to wait for it to finish before he can do
anything else.  That's not always what we want, because WAL gets
written out from our internal buffers for multiple different reasons.
If we're forcing the WAL out to disk because of transaction commit or
because we need to write the buffer protected by a certain WAL record
only after the WAL hits the platter, then it's fine.  But sometimes
we're writing WAL just because we've run out of internal buffer space,
and we don't want to block waiting for the write to complete.  Opening
the file with O_SYNC deprives us of the ability to control the timing
of the sync relative to the timing of the write.

> Again, I think this discussion would really benefit from an outline of
> the different files used by pgsql, and what sort of data access
> patterns you expect with them.

I think I more or less did that in my previous email, but here it is
again in briefer form:

- WAL files are written (and sometimes read) sequentially and fsync'd
very frequently and it's always good to write the data out to disk as
soon as possible
- Temp files are written and read sequentially and never fsync'd.
They should only be written to disk when memory pressure demands it
(but are a good candidate when that situation comes up)
- Data files are read and written randomly.  They are fsync'd at
checkpoint time; between checkpoints, it's best not to write them
sooner than necessary, but when the checkpoint arrives, they all need
to get out to the disk without bringing the system to a standstill

We have other kinds of files, but off-hand I'm not thinking of any
that are really very interesting, apart from those.

Maybe it'll be useful to have hints that say "always write this file
to disk as quick as you can" and "always postpone writing this file to
disk for as long as you can" for WAL and temp files respectively.  But
the rule for the data files, which are the really important case, is
not so simple.  fsync() is actually a fine API except that it tends to
destroy system throughput.  Maybe what we need is just for fsync() to
be less aggressive, or a less aggressive version of it.  We wouldn't
mind waiting an almost arbitrarily long time for fsync to complete if
other processes could still get their I/O requests serviced in a
reasonable amount of time in the meanwhile.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Mel Gorman
Date:

On Thu, Jan 16, 2014 at 04:30:59PM -0800, Jeff Janes wrote:
> On Wed, Jan 15, 2014 at 2:08 AM, Mel Gorman <> wrote:
> 
> > On Tue, Jan 14, 2014 at 09:30:19AM -0800, Jeff Janes wrote:
> > > >
> > > > That could be something we look at. There are cases buried deep in the
> > > > VM where pages get shuffled to the end of the LRU and get tagged for
> > > > reclaim as soon as possible. Maybe you need access to something like
> > > > that via posix_fadvise to say "reclaim this page if you need memory but
> > > > leave it resident if there is no memory pressure" or something similar.
> > > > Not exactly sure what that interface would look like or offhand how it
> > > > could be reliably implemented.
> > > >
> > >
> > > I think the "reclaim this page if you need memory but leave it resident
> > if
> > > there is no memory pressure" hint would be more useful for temporary
> > > working files than for what was being discussed above (shared buffers).
> > >  When I do work that needs large temporary files, I often see physical
> > > write IO spike but physical read IO does not.  I interpret that to mean
> > > that the temporary data is being written to disk to satisfy either
> > > dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS
> > > cache and so disk reads are not needed to satisfy it.  So a hint that
> > says
> > > "this file will never be fsynced so please ignore dirty_*bytes and
> > > dirty_expire_centisecs.
> >
> > It would be good to know if dirty_expire_centisecs or dirty ratio|bytes
> > were the problem here.
> 
> 
> Is there an easy way to tell?  I would guess it has to be at least
> dirty_expire_centisecs, if not both, as a very large sort operation takes a
> lot more than 30 seconds to complete.
> 

There is not an easy way to tell. To be 100%, it would require an
instrumentation patch or a systemtap script to detect when a particular page
is being written back and track the context. There are approximations though.
Monitor nr_dirty pages over time. If at the time of the stall there are fewer
dirty pages than allowed by dirty_ratio then the dirty_expire_centisecs
kicked in. That or monitor the process for stalls, when it stalls check
/proc/PID/stack and see if it's stuck in balance_dirty_pages or something
similar which would indicate the process hit dirty_ratio.

> > An interface that forces a dirty page to stay dirty
> > regardless of the global system would be a major hazard. It potentially
> > allows the creator of the temporary file to stall all other processes
> > dirtying pages for an unbounded period of time.
> 
> Are the dirty ratio/bytes limits the mechanisms by which adequate clean
> memory is maintained? 

Yes, for file-backed pages.

> I thought those were there just to but a limit on
> long it would take to execute a sync call should one be issued, and there
> were other setting which said how much clean memory to maintain.  It should
> definitely write out the pages if it needs the memory for other things,
> just not write them out due to fear of how long it would take to sync it if
> a sync was called.  (And if it needs the memory, it should be able to write
> it out quickly as the writes would be mostly sequential, not
> random--although how the kernel can believe me that that will always be the
> case could a problem)
> 

It has been suggested on more than one occasion that a more sensible
interface would be to "do not allow more dirty data than it takes N seconds
to writeback". The details of how to implement this are tricky and no one
has taken up the challenge yet.

> > I proposed in another part
> > of the thread a hint for open inodes to have the background writer thread
> > ignore dirty pages belonging to that inode. Dirty limits and fsync would
> > still be obeyed. It might also be workable for temporary files but the
> > proposal could be full of holes.
> >
> 
> If calling fsync would fail with an error, would that lower the risk of DoS?
> 

I do not understand the proposal. If there are pages that must remain
dirty and the kernel cannot touch then there will be the risk that
dirty_ratio number of pages are all untouchable and the system livelocks
until userspace takes an action.

That still leaves the possibility of flagging temp pages that should
only be written to disk if the kernel really needs to.

-- 
Mel Gorman
SUSE Labs



From:
Hannu Krosing
Date:

On 01/17/2014 06:40 AM, Dave Chinner wrote:
> On Thu, Jan 16, 2014 at 08:48:24PM -0500, Robert Haas wrote:
>> On Thu, Jan 16, 2014 at 7:31 PM, Dave Chinner <> wrote:
>>> But there's something here that I'm not getting - you're talking
>>> about a data set that you want ot keep cache resident that is at
>>> least an order of magnitude larger than the cyclic 5-15 minute WAL
>>> dataset that ongoing operations need to manage to avoid IO storms.
>>> Where do these temporary files fit into this picture, how fast do
>>> they grow and why are do they need to be so large in comparison to
>>> the ongoing modifications being made to the database?
> [ snip ]
>
>> Temp files are something else again.  If PostgreSQL needs to sort a
>> small amount of data, like a kilobyte, it'll use quicksort.  But if it
>> needs to sort a large amount of data, like a terabyte, it'll use a
>> merge sort.[1] 
> IOWs the temp files contain data that requires transformation as
> part of a query operation. So, temp file size is bound by the
> dataset, 
Basically yes, though the size of the "dataset" can be orders of
magnitude bigger than the database in case of some queries.
> growth determined by data retreival and transformation
> rate.
>
> IOWs, there are two very different IO and caching requirements in
> play here and tuning the kernel for one actively degrades the
> performance of the other. Right, got it now.
Yes. A step in right solutions would be some way to tune this
on per-device basis, but as large part of this in linux seems
to be driven from the keeping-vm-clean side it guess it will
be far from simple.
>
> Cheers,
>
> Dave.


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




From:
Mel Gorman
Date:

On Wed, Jan 15, 2014 at 02:14:08PM +0000, Mel Gorman wrote:
> > One assumption would be that Postgres is perfectly happy with the current
> > kernel behaviour in which case our discussion here is done.
> 
> It has been demonstrated that this statement was farcical.  The thread is
> massive just from interaction with the LSF/MM program committee. I'm hoping
> that there will be Postgres representation at LSF/MM this year to bring
> the issues to a wider audience. I expect that LSF/MM can only commit to
> one person attending the whole summit due to limited seats but we could
> be more more flexible for the Postgres track itself so informal meetings
> can be arranged for the evenings and at collab summit.
> 

We still have not decided on a person that can definitely attend but we'll
get back to that shortly. I wanted to revise the summary mail so that
there is a record that can be easily digested without trawling through
archives. As before if I missed something important, prioritised poorly
or emphasised incorrectly then shout at me.

On testing of modern kernels
----------------------------

Josh Berkus claims that most people are using Postgres with 2.6.19 and
consequently there may be poor awareness of recent kernel developments.
This is a disturbingly large window of opportunity for problems to have
been introduced.

Minimally, Postgres has concerns about IO-related stalls which may or may
not exist in current kernels. There were indications that large writes
starve reads. There have been variants of this style of bug in the past but
it's unclear what the exact shape of this problem is and if IO-less dirty
throttling affected it. It is possible that Postgres was burned in the past
by data being written back from reclaim context in low memory situations.
That would have looked like massive stalls with drops in IO throughput
but it was fixed in relatively recent kernels. Any data on historical
tests would be helpful. Alternatively, a pgbench-based reproduction test
could potentially be used by people in the kernel community that track
performance over time and have access to a suitable testing rig.

Postgres bug reports and LKML
-----------------------------

It is claimed that LKML does not welcome bug reports but it's less clear
what the basis of this claim is.  Is it because the reports are ignored? A
possible explanation is that they are simply getting lost in the LKML noise
and there would be better luck if the bug report was cc'd to a specific
subsystem list. A second possibility is the bug report is against an old
kernel and unless it is reproduced on a recent kernel the bug report will
be ignored. Finally it is possible that there is not enough data available
to debug the problem. The worst explanation is that to date the problem
has not been fixable but the details of this have been lost and are now
unknown. Is is possible that some of these bug reports can be refreshed
so at least there is a chance they get addressed?

Apparently there were changes to the reclaim algorithms that crippled
performance without any sysctls. The problem may be compounded by the
introduction of adaptive replacement cache in the shape of the thrash
detection patches currently being reviewed.  Postgres investigated the
use of ARC in the past and ultimately abandoned it. Details are in the
archives (http://www.Postgres.org/search/?m=1&q=arc&l=1&d=-1&s=r). I
have not read then, just noting they exist for future reference.

Sysctls to control VM behaviour are not popular as such tuning parameters
are often used as an excuse to not properly fix the problem. Would it be
possible to describe a test case that shows 2.6.19 performing well and a
modern kernel failing? That would give the VM people a concrete basis to
work from to either fix the problem or identify exactly what sysctls are
required to make this work.

I am confident that any bug related to VM reclaim in this area has been lost.
At least, I recall no instances of it being discussed on linux-mm and it
has not featured on LSF/MM during the last years.

IO Scheduling
-------------

Kevin Grittner has stated that it is known that the DEADLINE and NOOP
schedulers perform better than any alternatives for most database loads.
It would be desirable to quantify this for some test case and see can the
default scheduler cope in some way.

The deadline scheduler makes sense to a large extent though. Postgres
is sensitive to large latencies due to IO write spikes. It is at least
plausible that deadline would give more deterministic behaviour for
parallel reads in the presence of large writes assuming there were not
ordering problems between the reads/writes and the underlying filesystem.

For reference, these IO spikes can be massive. If the shared buffer is
completely dirtied in a short space of time then it could be 20-25% of
RAM being dirtied and writeback required in typical configurations. There
have been cases where it was worked around by limiting the size of the
shared buffer to a small enough size so that it can be written back
quickly. There are other tuning options available such as altering when
dirty background writing starts within the kernel but that will not help if
the dirtying happens in a very short space of time. Dave Chinner described
the considerations as follows
There's no absolute rule here, but the threshold for backgroundwriteback needs to consider the amount of dirty data
beinggenerated,the rate at which it can be retired and the checkpoint period theapplication is configured with. i.e. it
needsto be slow enough tonot cause serious read IO perturbations, but still fast enough thatit avoids peaks at
synchronisationpoints. And most importantly, itneeds to be fast enought that it can complete writeback of all thedirty
datain a checkpoint before the next checkpoint is triggered.
 
In general, I find that threshold to be somewhere around 2-5sworth of data writeback - enough to keep a good amount of
writecombiningand the IO pipeline full as work is done, but no more.
 
e.g. if your workload results in writeback rates of 500MB/s,then I'd be setting the dirty limit somewhere around 1-2GB
asaninitial guess. It's basically a simple trade off bufferingspace for writeback latency. Some applications perform
wellwithincreased buffering space (e.g. 10-20s of writeback) while othersperform better with extremely low writeback
latency(e.g. 0.5-1s).
 

Some of this may have been addressed in recent changes with IO-less dirty
throttling. When considering stalls related to excessive IO it will be
important to check if the kernel was later than 3.2 and what the underlying
filesystem was.

Again, it really should be possible to demonstrate this with a test case,
one driven by pgbench maybe? Workload would generate a bunch of test data,
dirty a large percentage of it and try to sync. Metrics would be measuring
average read-only query latency when reading in parallel to the write,
average latencies from the underlying storage, IO queue lengths etc and
comparing default IO scheduler with deadline or noop.

NUMA Optimisations
------------------

The primary one that showed up was zone_reclaim_mode. Enabling that parameter
is a disaster for many workloads and apparently Postgres is one. It might
be time to revisit leaving that thing disabled by default and explicitly
requiring that NUMA-aware workloads that are correctly partitioned enable it.
Otherwise NUMA considerations are not that much of a concern right now.

Direct IO, buffered IO, double buffering and wishlists
------------------------------------------------------

The general position of Postgres is that the kernel knows more about
storage geometries and IO scheduling that an application can or should
know. It would be preferred to have interfaces that allow Postgres to
give hints to the kernel about how and when data should be written back.
The alternative is exposing details of the underlying storage to userspace
so Postgres can implement a full IO scheduler using direct IO. It has
been asserted on the kernel side that the optimal IO size and alignment
is the most important detail should be all the details that are required
in the majority of cases. While some database vendors have this option,
the Postgres community do not have the resources to implement something
of this magnitude. They also have tried direct IO in the past in the areas
where it should have mattered and had mixed results.

I can understand Postgres preference for using the kernel to handle these
details for them. They are a cross-platform application and the kernel
should not be washing its hands of the problem and hiding behind direct
IO as a solution. Ted Ts'o summarises the issues as
The high order bit is what's the right thing to do when databaseprogrammers come to kernel engineers saying, we want to
do<FOO>and the performance sucks.  Do we say, "Use O_DIRECT, dummy", notwithstanding Linus's past comments on the
issue? Or do we havesome general design principles that we tell database engineers thatthey should do for better
performance,and then all developers forall of the file systems can then try to optimize for a set of newAPI's, or
recommendedways of using the existing API's?
 

In an effort to avoid depending on direct IO there were some proposals
and/or wishlist items. These are listed in order of likliehood to be
implemented and usefulness to Postgres.
  1. Hint to asynchronously queue writeback now in preparation for a     fsync in the near future. Postgres dirties a
largeamount of data and     asks the kernel to push it to disk over the next few minutes. Postgres     still is
requiredto fsync later but the fsync time should be     minimised. vm.dirty_writeback_centisecs is unreliable for
this.
     One possibility would be an fadvise call that queues the data for     writeback by a flusher thread now and
returnsimmediately
 
  2. Hint that a page is a prime candidate for reclaim but only if there     is reclaim pressure. This avoids a problem
wherefadvise(DONTNEED)     discards a page only to have a read/write or WILLNEED hint immediately     read it back in
again.The requirements are similar to the volatile     range hinting but they do not use mmap() currently and would
needa     file-descriptor based interface. Robert Hass had some concerns with     the general concept and described
themthusly
 
This is an interesting idea but it stinks of impracticality.Essentially when the last buffer pin on a page is dropped
we'dhaveto mark it as discardable, and then the next person wantingto pin it would have to check whether it's still
there. But thesystem call overhead of calling vrange() every time the last pinon a page was dropped would probably hose
us.
Well, I guess it could be done lazily: make periodic sweeps throughshared_buffers, looking for pages that haven't been
touchedin awhile, and vrange() them.  That's quite a bit of new mechanism,but in theory it could work out to a win.
vrange()would haveto scale well to millions of separate ranges, though.  Will it?And a lot depends on whether the
kernelmakes the right decisionabout whether to chunk data from our vrange() vs. any other pageit could have reclaimed.
 
  3. Hint that a page should be dropped immediately when IO completes.     There is already something like this buried
inthe kernel internals     and sometimes called "immediate reclaim" which comes into play when     pages are bgin
invalidated.It should just be a case of investigating     if that is visible to userspace, if not why not and do it in
a    semi-sensible fashion.
 
  4. 8kB atomic write with OS support to avoid writing full page images     in the WAL. This is a feature that is
likelyto be delivered anyway     and one that Postgres is interested in.
 
  5. Only writeback some pages if explicitly synced or dirty limits     are violated. Jeff Janes states that he has
problemswith large     temporary files that generate IO spikes when the data starts hitting     the platter even though
thedata does not need to be preserved. Jim     Nasby agreed and commented that he "also frequently see this, and it
hasan even larger impact if pgsql_tmp is on the same filesystem as     WAL. Which *theoretically* shouldn't matter with
aBBU controller,     except that when the kernel suddenly +decides your *temporary*     data needs to hit the media
you'rescrewed."
 
     One proposal that may address this is
Allow a process with an open fd to hint that pages managed by thisinode will have dirty-sticky pages. Pages will be
ignoredby dirtybackground writing unless there is an fsync call or dirty page limitsare hit. The hint is cleared when
noprocess has the file open.
 
  6. Only writeback pages if explicitly synced. Postgres has strict write     ordering requirements. In the words of
TomLane -- "As things currently     stand, we dirty the page in our internal buffers, and we don't write     it to the
kerneluntil we've written and fsync'd the WAL data that     needs to get to disk first". mmap() would avoid double
bufferingbut     it has no control about the write ordering which is a show-stopper.     As Andres Freund described;
 
Postgres' durability works by guaranteeing that our journalentries (called WAL := Write Ahead Log) are written & synced
todiskbefore the corresponding entries of tables and indexes reachthe disk. That also allows to group together many
random-writesintoa few contiguous writes fdatasync()ed at once. Only duringa checkpointing phase the big bulk of the
datais then (slowly,in the background) synced to disk. I don't see how that's doablewith holding all pages in mmap()ed
buffers.
     There are also concerns there would be an absurd number of mappings.
     The problem with this sort of dirty pinning interface is that it     can deadlock the kernel if all dirty pages in
thesystem cannot be     written back by the kernel. James Bottomley stated
 
No, I'm sorry, that's never going to be possible.  No user spaceapplication has all the facts.    If we give you an
interfacetoforce unconditional holding of dirty pages in core you'll livelockthe system eventually because you made a
wrongdecision to holdtoo many dirty pages.
 
     However, it was very clearly stated that the writing ordering is     critical. If the kernel breaks the
requirementthen the database     can get trashed in the event of a power failure.
 
     This led to a discussion on write barriers which the kernel uses     internally but there are scaling concerns
bothwith the number of     constraints that would exist and the requirement that Postgres use     mapped buffers.
 
     There were few solid conclusions on this. It would need major     reworking on all sides and it would handing
controlof system safety     to userspace which is going to cause layering violations. This     whole idea may be a bust
butit is still worth recording. Greg Stark     outlined the motivation best as follows;
 
Ted T'so was concerned this would all be a massive layering violationand I have to admit that's a huge risk. It would
takesome cleverAPI engineering to come with a clean set of primitives to expressthe kind of ordering guarantees we need
withoutbeing too tied toPostgres's specific implementation. The reason I think it's moreinteresting though is that
Postgres'sjournalling and checkpointingarchitecture is pretty bog-standard CS stuff and there are hundredsor thousands
ofpieces of software out there that do pretty muchthe same work and trying to do it efficiently with fsync or
O_DIRECTislike working with both hands tied to your feet.
 
  7. Allow userspace process to insert data into the kernel page cache     without marking the page dirty. This would
allowthe application     to request that the OS use the application copy of data as page     cache if it does not have
acopy already. The difficulty here     is that the application has no way of knowing if something else     has altered
theunderlying file in the meantime via something like     direct IO. Granted, such activity has probably corrupted the
database    already but initial reactions are that this is not a safe interface     and there are coherency concerns.
 
     Dave Chinner asked "why, exactly, do you even need the kernel page     cache here?"  when Postgres already knows
howand when data should     be written back to disk. The answer boiled down to "To let kernel do     the job that it is
goodat, namely managing the write-back of dirty     buffers to disk and to manage (possible) read-ahead pages".
Postgres    has some ordering requirements but it does not want to be responsible     for all cache replacement and IO
scheduling.Hannu Krosing summarised     it best as
 
Again, as said above the linux file system is doing fine. What wewant is a few ways to interact with it to let it do
evenbetterwhen working with Postgres by telling it some stuff it otherwisewould have to second guess and by sometimes
givingit back somecache pages which were copied away for potential modifying butended up clean in the end.
 
And let the linux kernel decide if and how long to keep these pagesin its    cache using its superior knowledge of disk
subsystemandabout what else is going on in the system in general.
 
  8. Allow copy-on-write of page-cache pages to anonymous. This would limit     the double ram usage to some extent.
It'snot as simple as having a     MAP_PRIVATE mapping of a file-backed page because presumably they want     this data
ina shared buffer shared between Postgres processes. The     implementation details of something like this are hairy
becauseit's     mmap()-like but not mmap() as it does not have the same writeback     semantics due to the write
orderingrequirements Postgres has for     database integrity.
 
     Completely nuts and this was not mentioned on the list, but arguably     you could try implementing something like
thisas a character device     that allows MAP_SHARED with ioctls with ioctls controlling that file     and offset backs
pageswithin the mapping.  A new mapping would be     forced resident and read-only. A write would COW the page. It's a
  crazy way of doing something like this but avoids a lot of overhead.     Even considering the stupid solution might
makethe general solution     a bit more obvious.
 
     For reference, Tom Lane comprehensively     described the problems with mmap at
http://www.Postgres.org/message-id/
     There were some variants of how something like this could be achieved     but no finalised proposal at the time of
writing.
  9. Hint that a page in an anonymous buffer is a copy of a page cache      page and invalidate the page cache page on
COW.This limits the      amount of double buffering. It's in as a low priority item as it's      unclear if it's really
necessaryand also I suspect the implementation      would be very heavy because of the amount of information we'd have
   to track in the kernel.
 

It is important to note in general that Postgres has a problem with some
files being written back too aggressively and other files not written back
aggressively enough. Temp files for purposes such as sorting should have
writeback deferred as long as possible. Data file writes that must complete
before portions of the WAL can be discarded should begin writeback early
so the final fsync does not stall for too long.  As Dave Chinner says
IOWs, there are two very different IO and caching requirementsin play here and tuning the kernel for one actively
degradestheperformance of the other.
 

Robert Hass categorised the IO patterns as follows
- WAL files are written (and sometimes read) sequentially and  fsync'd very frequently and it's always good to write
thedata  out to disk as soon as possible
 
- Temp files are written and read sequentially and never fsync'd.  They should only be written to disk when memory
pressuredemands  it (but are a good candidate when that situation comes up)
 
- Data files are read and written randomly.  They are fsync'd at  checkpoint time; between checkpoints, it's best not
towrite  them sooner than necessary, but when the checkpoint arrives,  they all need to get out to the disk without
bringingthe system  to a standstill
 

At LSF/MM last year there was a discussion on whether userspace should
hint that files are "hot" or "cold" so the underlying layers could decide
to relocate some data to faster storage. I tuned out a bit during the
discussion and did not track what happened with it since but I guess that
any developments of that sort would be of interest to the Postgres community.

Some of these wish lists still need polish but could potentially be
discussed further at LSF/MM with a wider audience as well as on the
lists. Then in a of unicorns and ponies it's a case of picking some of
these hinting wishlists, seeing what it takes to implement it in kernel
and testing it with a suitably patched version of postgres running a test
case driven by something (pgbench presumably).

-- 
Mel Gorman
SUSE Labs



From:
Andres Freund
Date:

Hi Mel,

On 2014-01-17 16:31:48 +0000, Mel Gorman wrote:
> Direct IO, buffered IO, double buffering and wishlists
> ------------------------------------------------------
>    3. Hint that a page should be dropped immediately when IO completes.
>       There is already something like this buried in the kernel internals
>       and sometimes called "immediate reclaim" which comes into play when
>       pages are bgin invalidated. It should just be a case of investigating
>       if that is visible to userspace, if not why not and do it in a
>       semi-sensible fashion.

"bgin invalidated"?

Generally, +1 on the capability to achieve such a behaviour from
userspace.

>    7. Allow userspace process to insert data into the kernel page cache
>       without marking the page dirty. This would allow the application
>       to request that the OS use the application copy of data as page
>       cache if it does not have a copy already. The difficulty here
>       is that the application has no way of knowing if something else
>       has altered the underlying file in the meantime via something like
>       direct IO. Granted, such activity has probably corrupted the database
>       already but initial reactions are that this is not a safe interface
>       and there are coherency concerns.

I was one of the people suggesting that capability in this thread (after
pondering about it on the back on my mind for quite some time), and I
first though it would never be acceptable for pretty much those
reasons.
But on second thought I don't think that line of argument makes too much
sense. If such an API would require write permissions on the file -
which it surely would - it wouldn't allow an application to do anything
it previously wasn't able to.
And I don't see the dangers of concurrent direct IO as anything
new. Right now the page's contents reside in userspace memory and aren't
synced in any way with either the page cache or the actual on disk
state. And afaik there are already several data races if a file is
modified and read both via the page cache and direct io.

The scheme that'd allow us is the following:
When postgres reads a data page, it will continue to first look up the
page in its shared buffers, if it's not there, it will perform a page
cache backed read, but instruct that read to immediately remove from the
page cache afterwards (new API or, posix_fadvise() or whatever). As long
as it's in shared_buffers, postgres will not need to issue new reads, so
there's no no benefit keeping it in the page cache.
If the page is dirtied, it will be written out normally telling the
kernel to forget about the caching the page (using 3) or possibly direct
io).
When a page in postgres's buffers (which wouldn't be set to very large
values) isn't needed anymore and *not* dirty, it will seed the kernel
page cache with the current data.

Now, such a scheme wouldn't likely be zero-copy, but it would avoid
double buffering. I think the cost of buffer copying has been overstated
in this thread... he major advantage is that all that could easily
implemented in a very localized manner, without hurting other OSs and it
could easily degrade on kernels not providing that capability, which
would surely be the majority of installations for the next couple of
cases.

So, I think such an interface would be hugely beneficial - and I'd be
surprised if other applications couldn't reuse it. And I don't think
it'd be all that hard to implement on the kernel side?

>       Dave Chinner asked "why, exactly, do you even need the kernel page
>       cache here?"  when Postgres already knows how and when data should
>       be written back to disk. The answer boiled down to "To let kernel do
>       the job that it is good at, namely managing the write-back of dirty
>       buffers to disk and to manage (possible) read-ahead pages". Postgres
>       has some ordering requirements but it does not want to be responsible
>       for all cache replacement and IO scheduling. Hannu Krosing summarised
>       it best as

The other part is that using the page cache for the majority of warm,
but not burning hot pages, allows the kernel to much more sensibly adapt
to concurrent workloads requiring memory in some form or other (possibly
giving it to other VMs when mostly idle and such).

>    8. Allow copy-on-write of page-cache pages to anonymous. This would limit
>       the double ram usage to some extent. It's not as simple as having a
>       MAP_PRIVATE mapping of a file-backed page because presumably they want
>       this data in a shared buffer shared between Postgres processes. The
>       implementation details of something like this are hairy because it's
>       mmap()-like but not mmap() as it does not have the same writeback
>       semantics due to the write ordering requirements Postgres has for
>       database integrity.

>    9. Hint that a page in an anonymous buffer is a copy of a page cache
>        page and invalidate the page cache page on COW. This limits the
>        amount of double buffering. It's in as a low priority item as it's
>        unclear if it's really necessary and also I suspect the implementation
>        would be very heavy because of the amount of information we'd have
>        to track in the kernel.
> 

I don't see this kind of proposals going anywhere. The amounts of
changes to postgres and the kernel sound prohibitive to me, besides the
utter crummyiness.

Greetings,

Andres Freund



From:
Mel Gorman
Date:

On Fri, Jan 17, 2014 at 06:14:37PM +0100, Andres Freund wrote:
> Hi Mel,
> 
> On 2014-01-17 16:31:48 +0000, Mel Gorman wrote:
> > Direct IO, buffered IO, double buffering and wishlists
> > ------------------------------------------------------
> >    3. Hint that a page should be dropped immediately when IO completes.
> >       There is already something like this buried in the kernel internals
> >       and sometimes called "immediate reclaim" which comes into play when
> >       pages are bgin invalidated. It should just be a case of investigating
> >       if that is visible to userspace, if not why not and do it in a
> >       semi-sensible fashion.
> 
> "bgin invalidated"?
> 

s/bgin/being/

I admit that "invalidated" in this context is very vague and I did
not explain myself. This paragraph should remind anyone familiar with
VM internals about what happens when invalidate_mapping_pages calls
deactivate_page and how PageReclaim pages are treated by both page reclaim
and end_page_writeback handler. It's similar but not identical to what
Postgres wants and is a reasonable starting position for an implementation.

> Generally, +1 on the capability to achieve such a behaviour from
> userspace.
> 
> >    7. Allow userspace process to insert data into the kernel page cache
> >       without marking the page dirty. This would allow the application
> >       to request that the OS use the application copy of data as page
> >       cache if it does not have a copy already. The difficulty here
> >       is that the application has no way of knowing if something else
> >       has altered the underlying file in the meantime via something like
> >       direct IO. Granted, such activity has probably corrupted the database
> >       already but initial reactions are that this is not a safe interface
> >       and there are coherency concerns.
> 
> I was one of the people suggesting that capability in this thread (after
> pondering about it on the back on my mind for quite some time), and I
> first though it would never be acceptable for pretty much those
> reasons.
> But on second thought I don't think that line of argument makes too much
> sense. If such an API would require write permissions on the file -
> which it surely would - it wouldn't allow an application to do anything
> it previously wasn't able to.
> And I don't see the dangers of concurrent direct IO as anything
> new. Right now the page's contents reside in userspace memory and aren't
> synced in any way with either the page cache or the actual on disk
> state. And afaik there are already several data races if a file is
> modified and read both via the page cache and direct io.
> 

All of this is true.  The objections may not hold up over time and it may
be seem much more reasonable when/if the easier stuff is addressed.

> The scheme that'd allow us is the following:
> When postgres reads a data page, it will continue to first look up the
> page in its shared buffers, if it's not there, it will perform a page
> cache backed read, but instruct that read to immediately remove from the
> page cache afterwards (new API or, posix_fadvise() or whatever).
> As long
> as it's in shared_buffers, postgres will not need to issue new reads, so
> there's no no benefit keeping it in the page cache.
> If the page is dirtied, it will be written out normally telling the
> kernel to forget about the caching the page (using 3) or possibly direct
> io).
> When a page in postgres's buffers (which wouldn't be set to very large
> values) isn't needed anymore and *not* dirty, it will seed the kernel
> page cache with the current data.
> 

Ordinarily the initial read page could be discarded with fadvise but
the later write would cause the data to be read back in again which is a
waste. The details of avoiding that re-read are tricky from a core kernel
perspective because ordinarily the kernel at that point does not know if
the write is a full complete aligned write of an underlying filesystem
structure or not.  It may need a different write path which potentially
leads into needing changes to the address_space operations on a filesystem
basis -- that would get messy and be a Linux-specific extension. I have
not researched this properly at all, I could be way off but I have a
feeling the details get messy.

> Now, such a scheme wouldn't likely be zero-copy, but it would avoid
> double buffering.

It wouldn't be zero copy because minimally the data needs to be handed
over the filesystem for writing to the disk and the interface for that is
offset,length based, not page based. Maybe sometimes it will be zero copy
but it would be a filesystem-specific thing.

> I think the cost of buffer copying has been overstated
> in this thread... he major advantage is that all that could easily
> implemented in a very localized manner, without hurting other OSs and it
> could easily degrade on kernels not providing that capability, which
> would surely be the majority of installations for the next couple of
> cases.
> 
> So, I think such an interface would be hugely beneficial - and I'd be
> surprised if other applications couldn't reuse it. And I don't think
> it'd be all that hard to implement on the kernel side?
> 

Unfortunately I think this does get messy from a kernel perspective because
we are not guaranteed in the *general* case that we're dealing with a full
page write. As before, I have not researched this properly so I'll
update the summary at some stage in case someone can put in the proper
search and see a decent solution.

> >       Dave Chinner asked "why, exactly, do you even need the kernel page
> >       cache here?"  when Postgres already knows how and when data should
> >       be written back to disk. The answer boiled down to "To let kernel do
> >       the job that it is good at, namely managing the write-back of dirty
> >       buffers to disk and to manage (possible) read-ahead pages". Postgres
> >       has some ordering requirements but it does not want to be responsible
> >       for all cache replacement and IO scheduling. Hannu Krosing summarised
> >       it best as
> 
> The other part is that using the page cache for the majority of warm,
> but not burning hot pages, allows the kernel to much more sensibly adapt
> to concurrent workloads requiring memory in some form or other (possibly
> giving it to other VMs when mostly idle and such).
> 
> >    8. Allow copy-on-write of page-cache pages to anonymous. This would limit
> >       the double ram usage to some extent. It's not as simple as having a
> >       MAP_PRIVATE mapping of a file-backed page because presumably they want
> >       this data in a shared buffer shared between Postgres processes. The
> >       implementation details of something like this are hairy because it's
> >       mmap()-like but not mmap() as it does not have the same writeback
> >       semantics due to the write ordering requirements Postgres has for
> >       database integrity.
> 
> >    9. Hint that a page in an anonymous buffer is a copy of a page cache
> >        page and invalidate the page cache page on COW. This limits the
> >        amount of double buffering. It's in as a low priority item as it's
> >        unclear if it's really necessary and also I suspect the implementation
> >        would be very heavy because of the amount of information we'd have
> >        to track in the kernel.
> > 
> 
> I don't see this kind of proposals going anywhere. The amounts of
> changes to postgres and the kernel sound prohibitive to me, besides the
> utter crummyiness.
> 

Agreed. I'm including them just because they were discussed. Someone
else might read it and think "that is a terrible idea but what might work
instead is ...."

-- 
Mel Gorman
SUSE Labs



From:
Josh Berkus
Date:

Mel,

So we have a few interested parties.  What do we need to do to set up
the Collab session?


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



From:
Gregory Smith
Date:

On 1/17/14 10:37 AM, Mel Gorman wrote:
> There is not an easy way to tell. To be 100%, it would require an 
> instrumentation patch or a systemtap script to detect when a 
> particular page is being written back and track the context. There are 
> approximations though. Monitor nr_dirty pages over time.

I have a benchmarking wrapper for the pgbench testing program called 
pgbench-tools:  https://github.com/gregs1104/pgbench-tools  As of 
October, on Linux it now plots the "Dirty" value from /proc/meminfo over 
time.  You get that on the same time axis as the transaction latency 
data.  The report at the end includes things like the maximum amount of 
dirty memory observed during the test sampling. That doesn't tell you 
exactly what's happening to the level someone reworking the kernel logic 
might want, but you can easily see things like the database's checkpoint 
cycle reflected by watching the dirty memory total.  This works really 
well for monitoring production servers too.  I have a lot of data from a 
plugin for the Munin monitoring system that plots the same way.  Once 
you have some history about what's normal, it's easy to see when systems 
fall behind in a way that's ruining writes, and the high water mark 
often correlates with bad responsiveness periods.

Another recent change is that pgbench for the upcoming PostgreSQL 9.4 
now allows you to specify a target transaction rate.  Seeing the write 
latency behavior with that in place is far more interesting than 
anything we were able to watch with pgbench before.  The pgbench write 
tests we've been doing for years mainly told you the throughput rate 
when all of the caches were always as full as the database could make 
them, and tuning for that is not very useful. Turns out it's far more 
interesting to run at 50% of what the storage is capable of, then watch 
what happens to latency when you adjust things like the dirty_* parameters.

I've been working on the problem of how we can make a benchmark test 
case that acts enough like real busy PostgreSQL servers that we can 
share it with kernel developers, and then everyone has an objective way 
to measure changes.  These rate limited tests are working much better 
for that than anything I came up with before.

I am skeptical that the database will take over very much of this work 
and perform better than the Linux kernel does.  My take is that our most 
useful role would be providing test cases kernel developers can add to a 
performance regression suite.  Ugly "we never though that would happen" 
situations seems at the root of many of the kernel performance 
regressions people here get nailed by.

Effective I/O scheduling is very hard, and we are unlikely to ever out 
innovate the kernel hacking community by pulling more of that into the 
database.  It's already possible to experiment with moving in that 
direction with tuning changes.  Use a larger database shared_buffers 
value, tweak checkpoints to spread I/O out, and reduce things like 
dirty_ratio.  I do some of that, but I've learned it's dangerous to 
wander too far that way.

If instead you let Linux do even more work--give it a lot of memory to 
manage and room to re-order I/O--that can work out quite well. For 
example, I've seen a lot of people try to keep latency down by using the 
deadline scheduler and very low settings for the expire times.  Theory 
is great, but it never works out in the real world for me though.  
Here's the sort of deadline I deploy instead now:
    echo 500      > ${DEV}/queue/iosched/read_expire    echo 300000   > ${DEV}/queue/iosched/write_expire    echo
1048576 > ${DEV}/queue/iosched/writes_starved
 

These numbers look insane compared to the defaults, but I assure you 
they're from a server that's happily chugging through 5 to 10K 
transactions/second around the clock.  PostgreSQL forces writes out with 
fsync when they must go out, but this sort of tuning is basically giving 
up on it managing writes beyond that.  We really have no idea what order 
they should go out in.  I just let the kernel have a large pile of work 
queued up, and trust things like the kernel's block elevator and 
congestion code are smarter than the database can possibly be.

-- 
Greg Smith 
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/



From:
Greg Stark
Date:

On Fri, Jan 17, 2014 at 9:14 AM, Andres Freund <> wrote:
> The scheme that'd allow us is the following:
> When postgres reads a data page, it will continue to first look up the
> page in its shared buffers, if it's not there, it will perform a page
> cache backed read, but instruct that read to immediately remove from the
> page cache afterwards (new API or, posix_fadvise() or whatever). As long
> as it's in shared_buffers, postgres will not need to issue new reads, so
> there's no no benefit keeping it in the page cache.
> If the page is dirtied, it will be written out normally telling the
> kernel to forget about the caching the page (using 3) or possibly direct
> io).
> When a page in postgres's buffers (which wouldn't be set to very large
> values) isn't needed anymore and *not* dirty, it will seed the kernel
> page cache with the current data.

This seems like mostly an exact reimplementation of DIRECT_IO
semantics using posix_fadvise.

-- 
greg



From:
Andres Freund
Date:

On 2014-01-17 16:18:49 -0800, Greg Stark wrote:
> On Fri, Jan 17, 2014 at 9:14 AM, Andres Freund <> wrote:
> > The scheme that'd allow us is the following:
> > When postgres reads a data page, it will continue to first look up the
> > page in its shared buffers, if it's not there, it will perform a page
> > cache backed read, but instruct that read to immediately remove from the
> > page cache afterwards (new API or, posix_fadvise() or whatever). As long
> > as it's in shared_buffers, postgres will not need to issue new reads, so
> > there's no no benefit keeping it in the page cache.
> > If the page is dirtied, it will be written out normally telling the
> > kernel to forget about the caching the page (using 3) or possibly direct
> > io).
> > When a page in postgres's buffers (which wouldn't be set to very large
> > values) isn't needed anymore and *not* dirty, it will seed the kernel
> > page cache with the current data.
> 
> This seems like mostly an exact reimplementation of DIRECT_IO
> semantics using posix_fadvise.

Not at all. The significant bits are a) the kernel uses the pagecache
for writeback of dirty data and more importantly b) uses the pagecache
to cache data not in postgres's s_b.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



From:
Marti Raudsepp
Date:

On Wed, Jan 15, 2014 at 5:34 AM, Jim Nasby <> wrote:
> it's very common to create temporary file data that will never, ever, ever
> actually NEED to hit disk. Where I work being able to tell the kernel to
> avoid flushing those files unless the kernel thinks it's got better things
> to do with that memory would be EXTREMELY valuable

Windows has the FILE_ATTRIBUTE_TEMPORARY flag for this purpose.

ISTR that there was discussion about implementing something analogous
in Linux when ext4 got delayed allocation support, but I don't think
it got anywhere and I can't find the discussion now. I think the
proposed interface was to create and then unlink the file immediately,
which serves as a hint that the application doesn't care about
persistence.

Postgres is far from being the only application that wants this; many
people resort to tmpfs because of this:
https://lwn.net/Articles/499410/

Regards,
Marti



From:
Marti Raudsepp
Date:

On Mon, Jan 20, 2014 at 1:51 AM, Dave Chinner <> wrote:
>> Postgres is far from being the only application that wants this; many
>> people resort to tmpfs because of this:
>> https://lwn.net/Articles/499410/
>
> Yes, we covered the possibility of using tmpfs much earlier in the
> thread, and came to the conclusion that temp files can be larger
> than memory so tmpfs isn't the solution here. :)

What I meant is: lots of applications want this behavior. If Linux
filesystems had support for delaying writeback for temporary files,
then there would be no point in mounting tmpfs on /tmp at all and we'd
get the best of both worlds.

Right now people resort to tmpfs because of this missing feature. And
then have their machines end up in swap hell if they overuse it.

Regards,
Marti



From:
Mel Gorman
Date:

On Fri, Jan 17, 2014 at 11:01:15AM -0800, Josh Berkus wrote:
> Mel,
> 

Hi,

> So we have a few interested parties.  What do we need to do to set up
> the Collab session?
> 

This is great and thanks!

There are two summits of interest here -- LSF/MM which will have all the
filesystem, storage and memory managemnet people at it on March 24-25th
and Collaboration Summit which is on March 26-28th. We're interested in
both.

The LSF/MM committe are going through the first round of topic proposals at
the moment and we're aiming to send out the first set of invites soon. We're
hoping to invite two PostgreSQL people to LSF/MM itself for the dedicated
topic and your feedback on other topics and how they may help or hinder
PostgreSQL would be welcomed.

As LSF/MM is a relatively closed forum I'll be looking into having a
follow-up discussion at Collaboration Summit that is open to a wider and
more dedicated group. That hopefully will result in a small number of
concrete proposals that can be turned into patches over time.

-- 
Mel Gorman
SUSE Labs



From:
Andres Freund
Date:

On 2014-01-17 18:34:25 +0000, Mel Gorman wrote:
> > The scheme that'd allow us is the following:
> > When postgres reads a data page, it will continue to first look up the
> > page in its shared buffers, if it's not there, it will perform a page
> > cache backed read, but instruct that read to immediately remove from the
> > page cache afterwards (new API or, posix_fadvise() or whatever).
> > As long
> > as it's in shared_buffers, postgres will not need to issue new reads, so
> > there's no no benefit keeping it in the page cache.
> > If the page is dirtied, it will be written out normally telling the
> > kernel to forget about the caching the page (using 3) or possibly direct
> > io).
> > When a page in postgres's buffers (which wouldn't be set to very large
> > values) isn't needed anymore and *not* dirty, it will seed the kernel
> > page cache with the current data.
> > 
> 
> Ordinarily the initial read page could be discarded with fadvise but
> the later write would cause the data to be read back in again which is a
> waste. The details of avoiding that re-read are tricky from a core kernel
> perspective because ordinarily the kernel at that point does not know if
> the write is a full complete aligned write of an underlying filesystem
> structure or not.  It may need a different write path which potentially
> leads into needing changes to the address_space operations on a filesystem
> basis -- that would get messy and be a Linux-specific extension. I have
> not researched this properly at all, I could be way off but I have a
> feeling the details get messy.

Hm. This is surprising me a bit - and I bet it does hurt postgres
noticeably if that's the case since the most frequently modified buffers
will only be written out to the OS once every checkpoint but never be
read-in. So they are likely not to be hot enough to stay cached under
cache-pressure.
So this would be a generally beneficial feature - and I doubt it's only
postgres that'd benefit.

> > Now, such a scheme wouldn't likely be zero-copy, but it would avoid
> > double buffering.
> 
> It wouldn't be zero copy because minimally the data needs to be handed
> over the filesystem for writing to the disk and the interface for that is
> offset,length based, not page based. Maybe sometimes it will be zero copy
> but it would be a filesystem-specific thing.

Exactly.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



From:
Dave Chinner
Date:

On Sun, Jan 19, 2014 at 03:37:37AM +0200, Marti Raudsepp wrote:
> On Wed, Jan 15, 2014 at 5:34 AM, Jim Nasby <> wrote:
> > it's very common to create temporary file data that will never, ever, ever
> > actually NEED to hit disk. Where I work being able to tell the kernel to
> > avoid flushing those files unless the kernel thinks it's got better things
> > to do with that memory would be EXTREMELY valuable
> 
> Windows has the FILE_ATTRIBUTE_TEMPORARY flag for this purpose.
> 
> ISTR that there was discussion about implementing something analogous
> in Linux when ext4 got delayed allocation support, but I don't think
> it got anywhere and I can't find the discussion now. I think the
> proposed interface was to create and then unlink the file immediately,
> which serves as a hint that the application doesn't care about
> persistence.

You're thinking about O_TMPFILE, which is for making temp files that
can't be seen in the filesystem namespace, not for preventing them
from being written to disk.

I don't really like the idea of overloading a namespace directive to
have special writeback connotations. What we are getting into the
realm of here is generic user controlled allocation and writeback
policy...

> Postgres is far from being the only application that wants this; many
> people resort to tmpfs because of this:
> https://lwn.net/Articles/499410/

Yes, we covered the possibility of using tmpfs much earlier in the
thread, and came to the conclusion that temp files can be larger
than memory so tmpfs isn't the solution here. :)

Cheers,

Dave.
-- 
Dave Chinner




From:
Mel Gorman
Date:

On Mon, Jan 20, 2014 at 10:51:41AM +1100, Dave Chinner wrote:
> On Sun, Jan 19, 2014 at 03:37:37AM +0200, Marti Raudsepp wrote:
> > On Wed, Jan 15, 2014 at 5:34 AM, Jim Nasby <> wrote:
> > > it's very common to create temporary file data that will never, ever, ever
> > > actually NEED to hit disk. Where I work being able to tell the kernel to
> > > avoid flushing those files unless the kernel thinks it's got better things
> > > to do with that memory would be EXTREMELY valuable
> > 
> > Windows has the FILE_ATTRIBUTE_TEMPORARY flag for this purpose.
> > 
> > ISTR that there was discussion about implementing something analogous
> > in Linux when ext4 got delayed allocation support, but I don't think
> > it got anywhere and I can't find the discussion now. I think the
> > proposed interface was to create and then unlink the file immediately,
> > which serves as a hint that the application doesn't care about
> > persistence.
> 
> You're thinking about O_TMPFILE, which is for making temp files that
> can't be seen in the filesystem namespace, not for preventing them
> from being written to disk.
> 
> I don't really like the idea of overloading a namespace directive to
> have special writeback connotations. What we are getting into the
> realm of here is generic user controlled allocation and writeback
> policy...
> 

Such overloading would be unwelcome. FWIW, I assumed this would be an
fadvise thing. Initially something that controlled writeback on an inode
and not an fd context that ignored the offset and length parameters.
Granded, someone will probably throw a fit about adding a Linux-specific
flag to the fadvise64 syscall. POSIX_FADV_NOREUSE is currently unimplemented
and it could be argued that it could be used to flag temporary files that
have a different writeback policy but it's not clear if that matches the
original intent of the posix flag.

> > Postgres is far from being the only application that wants this; many
> > people resort to tmpfs because of this:
> > https://lwn.net/Articles/499410/
> 
> Yes, we covered the possibility of using tmpfs much earlier in the
> thread, and came to the conclusion that temp files can be larger
> than memory so tmpfs isn't the solution here. :)
> 

And swap IO patterns blow chunks because people rarely want to touch
that area of the code with a 50 foot pole. It gets filed under "if you're
swapping, you already lost"

-- 
Mel Gorman
SUSE Labs



From:
Mel Gorman
Date:

On Fri, Jan 17, 2014 at 03:24:01PM -0500, Gregory Smith wrote:
> On 1/17/14 10:37 AM, Mel Gorman wrote:
> >There is not an easy way to tell. To be 100%, it would require an
> >instrumentation patch or a systemtap script to detect when a
> >particular page is being written back and track the context. There
> >are approximations though. Monitor nr_dirty pages over time.
> 
> I have a benchmarking wrapper for the pgbench testing program called
> pgbench-tools:  https://github.com/gregs1104/pgbench-tools  As of
> October, on Linux it now plots the "Dirty" value from /proc/meminfo
> over time.
> <SNIP>

Cheers for pointing that out, I was not previously aware of its
existence. While I have some support for running pgbench via another kernel
testing framework (mmtests) the postgres-based tests are miserable. Right
now for me, pgbench is only setup to reproduce a workload that detected a
scheduler regression in the past so that it does not get reintroduced. I'd
like to have it running IO-based tests even though I typically do not
do proper regression testing for IO. I have used sysbench as a workload
generator before but it's not great for a number of reasons.

> I've been working on the problem of how we can make a benchmark test
> case that acts enough like real busy PostgreSQL servers that we can
> share it with kernel developers, and then everyone has an objective
> way to measure changes.  These rate limited tests are working much
> better for that than anything I came up with before.
> 

This would be very welcome and thanks for the other observations on IO
scheduler parameter tuning. They could potentially be used to evalate any IO
scheduler changes. For example -- deadline scheduler with these parameters
has X transactions/sec throughput with average latency of Y millieseconds
and a maximum fsync latency of Z seconds. Evaluate how well the out-of-box
behaviour compares against it with and without some set of patches.  At the
very least it would be useful for tracking historical kernel performance
over time and bisecting any regressions that got introduced. Once we have
a test I think many kernel developers (me at least) can run automated
bisections once a test case exists.

-- 
Mel Gorman
SUSE Labs



From:
Jeff Layton
Date:

On Mon, 20 Jan 2014 10:51:41 +1100
Dave Chinner <> wrote:

> On Sun, Jan 19, 2014 at 03:37:37AM +0200, Marti Raudsepp wrote:
> > On Wed, Jan 15, 2014 at 5:34 AM, Jim Nasby <> wrote:
> > > it's very common to create temporary file data that will never, ever, ever
> > > actually NEED to hit disk. Where I work being able to tell the kernel to
> > > avoid flushing those files unless the kernel thinks it's got better things
> > > to do with that memory would be EXTREMELY valuable
> > 
> > Windows has the FILE_ATTRIBUTE_TEMPORARY flag for this purpose.
> > 
> > ISTR that there was discussion about implementing something analogous
> > in Linux when ext4 got delayed allocation support, but I don't think
> > it got anywhere and I can't find the discussion now. I think the
> > proposed interface was to create and then unlink the file immediately,
> > which serves as a hint that the application doesn't care about
> > persistence.
> 
> You're thinking about O_TMPFILE, which is for making temp files that
> can't be seen in the filesystem namespace, not for preventing them
> from being written to disk.
> 
> I don't really like the idea of overloading a namespace directive to
> have special writeback connotations. What we are getting into the
> realm of here is generic user controlled allocation and writeback
> policy...
> 

Agreed -- O_TMPFILE semantics are a different beast entirely.

Perhaps what might be reasonable though is a fadvise POSIX_FADV_TMPFILE
hint that tells the kernel: "Don't write out this data unless it's
necessary due to memory pressure".

If the inode is only open with file descriptors that have that hint
set on them. Then we could exempt it from dirty_expire_interval and
dirty_writeback_interval?

Tracking that desire on an inode open multiple times might be
"interesting" though. We'd have to be quite careful not to allow that
to open an attack vector.

> > Postgres is far from being the only application that wants this; many
> > people resort to tmpfs because of this:
> > https://lwn.net/Articles/499410/
> 
> Yes, we covered the possibility of using tmpfs much earlier in the
> thread, and came to the conclusion that temp files can be larger
> than memory so tmpfs isn't the solution here. :)
> 

-- 
Jeff Layton <>



From:
Bruce Momjian
Date:

On Wed, Jan 15, 2014 at 11:49:09AM +0000, Mel Gorman wrote:
> It may be the case that mmap/madvise is still required to handle a double
> buffering problem but it's far from being a free lunch and it has costs
> that read/write does not have to deal with. Maybe some of these problems
> can be fixed or mitigated but it is a case where a test case demonstrates
> the problem even if that requires patching PostgreSQL.

We suspected trying to use mmap would have costs, but it is nice to hear
actual details about it.

--  Bruce Momjian  <>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



From:
Bruce Momjian
Date:

On Fri, Jan 17, 2014 at 04:31:48PM +0000, Mel Gorman wrote:
> NUMA Optimisations
> ------------------
> 
> The primary one that showed up was zone_reclaim_mode. Enabling that parameter
> is a disaster for many workloads and apparently Postgres is one. It might
> be time to revisit leaving that thing disabled by default and explicitly
> requiring that NUMA-aware workloads that are correctly partitioned enable it.
> Otherwise NUMA considerations are not that much of a concern right now.

Here is a blog post about our zone_reclaim_mode-disable recommendations:
http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html

> Direct IO, buffered IO, double buffering and wishlists
> ------------------------------------------------------
>    6. Only writeback pages if explicitly synced. Postgres has strict write
>       ordering requirements. In the words of Tom Lane -- "As things currently
>       stand, we dirty the page in our internal buffers, and we don't write
>       it to the kernel until we've written and fsync'd the WAL data that
>       needs to get to disk first". mmap() would avoid double buffering but
>       it has no control about the write ordering which is a show-stopper.
>       As Andres Freund described;

What was not explicitly stated here is that the Postgres design is
taking advantage of the double-buffering "feature" here and writing to a
memory copy of the page while there is still an unmodified copy in the
kernel cache, or on disk.  In the case of a crash, we rely on the fact
that the disk page is unchanged.  Certainly any design that requires the
kernel to mange two different copies of the same page is going to be
confusing.

One larger question is how many of these things that Postgres needs are
needed by other applications?  I doubt Postgres is large enough to
warrant changes on its own.

--  Bruce Momjian  <>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



From:
Миша Тюрин
Date:

Hi
But maybe postgres should provide its own subsystem like linux active/inactive memory over and/or near shared buffers? There
could be some postgres special heuristics in its own approach.

And does anyone know how mysql-innodb guys are getting with similar issues?

Thank you!

From:
Claudio Freire
Date:

On Tue, Jan 21, 2014 at 5:01 PM, Миша Тюрин <> wrote:
> And does anyone know how mysql-innodb guys are getting with similar issues?


I'm no innodb dev, but from managing mysql databases, I can say that
mysql simply eats all the RAM the admin is willing to allocate for the
DB, and is content with the page cache almost not working.

IOW: mysql manages its own cache and doesn't need or want the page
cache. That *does* result in terrible performance when I/O is needed.
Some workloads are nigh-impossible to optimize with this scheme.



From:
Robert Haas
Date:

On Tue, Jan 21, 2014 at 3:20 PM, Jan Kara <> wrote:
>> But that still doesn't work out very well, because now the guy who
>> does the write() has to wait for it to finish before he can do
>> anything else.  That's not always what we want, because WAL gets
>> written out from our internal buffers for multiple different reasons.
>   Well, you can always use AIO (io_submit) to submit direct IO without
> waiting for it to finish. But then you might need to track the outstanding
> IO so that you can watch with io_getevents() when it is finished.

Yeah.  That wouldn't work well for us; the process that did the
io_submit() would want to move on to other things, and how would it,
or any other process, know that the I/O had completed?

>   As I wrote in some other email in this thread, using IO priorities for
> data file checkpoint might be actually the right answer. They will work for
> IO submitted by fsync(). The downside is that currently IO priorities / IO
> scheduling classes work only with CFQ IO scheduler.

IMHO, the problem is simpler than that: no single process should be
allowed to completely screw over every other process on the system.
When the checkpointer process starts calling fsync(), the system
begins writing out the data that needs to be fsync()'d so aggressively
that service times for I/O requests from other process go through the
roof.  It's difficult for me to imagine that any application on any
I/O scheduler is ever happy with that behavior.  We shouldn't need to
sprinkle of fsync() calls with special magic juju sauce that says
"hey, when you do this, could you try to avoid causing the rest of the
system to COMPLETELY GRIND TO A HALT?".  That should be the *default*
behavior, if not the *only* behavior.

Now, that is not to say that we're unwilling to sprinkle magic juju
sauce if that's what it takes to solve this problem.  If calling
fadvise() or sync_file_range() or some new API that you invent at some
point prior to calling fsync() helps the kernel do the right thing,
we're willing to do that.  Or if you/the Linux community wants to
invent a new API fsync_but_do_not_crush_system() and have us call that
instead of the regular fsync(), we're willing to do that, too.  But I
think there's an excellent case to be made, at least as far as
checkpoint I/O spikes are concerned, that the API is just fine as it
is and Linux's implementation is simply naive.  We'd be perfectly
happy to wait longer for fsync() to complete in exchange for not
starving the rest of the system - and really, who wouldn't?  Linux is
a multi-user system, and apportioning resources among multiple tasks
is a basic function of a multi-user kernel.

</rant>

Anyway, if CFQ or any other Linux I/O scheduler gets an option to
lower the priority of the fsyncs, I'm sure somebody here will test it
out and see whether it solves this problem.  AFAICT, experiments to
date have pretty much universally shown CFQ to be worse than not-CFQ
and everything else to be more or less equivalent - but if that
changes, I'm sure many PostgreSQL DBAs will be more than happy to flip
CFQ back on.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From:
Dave Chinner
Date:

On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote:
> On Fri 17-01-14 08:57:25, Robert Haas wrote:
> > On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton <> wrote:
> > > So this says to me that the WAL is a place where DIO should really be
> > > reconsidered. It's mostly sequential writes that need to hit the disk
> > > ASAP, and you need to know that they have hit the disk before you can
> > > proceed with other operations.
> > 
> > Ironically enough, we actually *have* an option to use O_DIRECT here.
> > But it doesn't work well.  See below.
> > 
> > > Also, is the WAL actually ever read under normal (non-recovery)
> > > conditions or is it write-only under normal operation? If it's seldom
> > > read, then using DIO for them also avoids some double buffering since
> > > they wouldn't go through pagecache.
> > 
> > This is the first problem: if replication is in use, then the WAL gets
> > read shortly after it gets written.  Using O_DIRECT bypasses the
> > kernel cache for the writes, but then the reads stink.
>   OK, yes, this is hard to fix with direct IO.

Actually, it's not. Block level caching is the time-honoured answer
to this problem, and it's been used very successfully on a large
scale by many organisations. e.g. facebook with MySQL, O_DIRECT, XFS
and flashcache sitting on an SSD in front of rotating storage.
There's multiple choices for this now - bcache, dm-cache,
flahscache, etc, and they all solve this same problem. And in many
cases do it better than using the page cache because you can
independently scale the size of the block level cache...

And given the size of SSDs these days, being able to put half a TB
of flash cache in front of spinning disks is a pretty inexpensive
way of solving such IO problems....

> > If we're forcing the WAL out to disk because of transaction commit or
> > because we need to write the buffer protected by a certain WAL record
> > only after the WAL hits the platter, then it's fine.  But sometimes
> > we're writing WAL just because we've run out of internal buffer space,
> > and we don't want to block waiting for the write to complete.  Opening
> > the file with O_SYNC deprives us of the ability to control the timing
> > of the sync relative to the timing of the write.
>   O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
> transaction commit whenever there's any metadata changed on the filesystem.
> Since mtime & ctime of files will be changed often, the will be a case very
> often.

Therefore: O_DATASYNC.

> > Maybe it'll be useful to have hints that say "always write this file
> > to disk as quick as you can" and "always postpone writing this file to
> > disk for as long as you can" for WAL and temp files respectively.  But
> > the rule for the data files, which are the really important case, is
> > not so simple.  fsync() is actually a fine API except that it tends to
> > destroy system throughput.  Maybe what we need is just for fsync() to
> > be less aggressive, or a less aggressive version of it.  We wouldn't
> > mind waiting an almost arbitrarily long time for fsync to complete if
> > other processes could still get their I/O requests serviced in a
> > reasonable amount of time in the meanwhile.
>   As I wrote in some other email in this thread, using IO priorities for
> data file checkpoint might be actually the right answer. They will work for
> IO submitted by fsync(). The downside is that currently IO priorities / IO
> scheduling classes work only with CFQ IO scheduler.

And I don't see it being implemented anywhere else because it's the
priority aware scheduling infrastructure in CFQ that causes all the
problems with IO concurrency and scalability...

Cheers,

Dave.
-- 
Dave Chinner




From:
Jan Kara
Date:

On Wed 22-01-14 09:07:19, Dave Chinner wrote:
> On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote:
> > > If we're forcing the WAL out to disk because of transaction commit or
> > > because we need to write the buffer protected by a certain WAL record
> > > only after the WAL hits the platter, then it's fine.  But sometimes
> > > we're writing WAL just because we've run out of internal buffer space,
> > > and we don't want to block waiting for the write to complete.  Opening
> > > the file with O_SYNC deprives us of the ability to control the timing
> > > of the sync relative to the timing of the write.
> >   O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
> > transaction commit whenever there's any metadata changed on the filesystem.
> > Since mtime & ctime of files will be changed often, the will be a case very
> > often.
> 
> Therefore: O_DATASYNC. O_DSYNC to be exact.

> > > Maybe it'll be useful to have hints that say "always write this file
> > > to disk as quick as you can" and "always postpone writing this file to
> > > disk for as long as you can" for WAL and temp files respectively.  But
> > > the rule for the data files, which are the really important case, is
> > > not so simple.  fsync() is actually a fine API except that it tends to
> > > destroy system throughput.  Maybe what we need is just for fsync() to
> > > be less aggressive, or a less aggressive version of it.  We wouldn't
> > > mind waiting an almost arbitrarily long time for fsync to complete if
> > > other processes could still get their I/O requests serviced in a
> > > reasonable amount of time in the meanwhile.
> >   As I wrote in some other email in this thread, using IO priorities for
> > data file checkpoint might be actually the right answer. They will work for
> > IO submitted by fsync(). The downside is that currently IO priorities / IO
> > scheduling classes work only with CFQ IO scheduler.
> 
> And I don't see it being implemented anywhere else because it's the
> priority aware scheduling infrastructure in CFQ that causes all the
> problems with IO concurrency and scalability... So CFQ has all sorts of problems but I never had the impression that
priority aware scheduling is the culprit. It is all just complex - sync
idling, seeky writer detection, cooperating threads detection, sometimes
even sync vs async distinction isn't exactly what one would want. And I'm
not speaking about the cgroup stuff... So it doesn't seem to me that some
other IO scheduler couldn't reasonably efficiently implement stuff like IO
scheduling classes.
                            Honza
-- 
Jan Kara <>
SUSE Labs, CR



From:
Jan Kara
Date:

On Fri 17-01-14 08:57:25, Robert Haas wrote:
> On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton <> wrote:
> > So this says to me that the WAL is a place where DIO should really be
> > reconsidered. It's mostly sequential writes that need to hit the disk
> > ASAP, and you need to know that they have hit the disk before you can
> > proceed with other operations.
> 
> Ironically enough, we actually *have* an option to use O_DIRECT here.
> But it doesn't work well.  See below.
> 
> > Also, is the WAL actually ever read under normal (non-recovery)
> > conditions or is it write-only under normal operation? If it's seldom
> > read, then using DIO for them also avoids some double buffering since
> > they wouldn't go through pagecache.
> 
> This is the first problem: if replication is in use, then the WAL gets
> read shortly after it gets written.  Using O_DIRECT bypasses the
> kernel cache for the writes, but then the reads stink. OK, yes, this is hard to fix with direct IO.

> However, if you configure wal_sync_method=open_sync and disable
> replication, then you will in fact get O_DIRECT|O_SYNC behavior.
> 
> But that still doesn't work out very well, because now the guy who
> does the write() has to wait for it to finish before he can do
> anything else.  That's not always what we want, because WAL gets
> written out from our internal buffers for multiple different reasons. Well, you can always use AIO (io_submit) to
submitdirect IO without
 
waiting for it to finish. But then you might need to track the outstanding
IO so that you can watch with io_getevents() when it is finished.

> If we're forcing the WAL out to disk because of transaction commit or
> because we need to write the buffer protected by a certain WAL record
> only after the WAL hits the platter, then it's fine.  But sometimes
> we're writing WAL just because we've run out of internal buffer space,
> and we don't want to block waiting for the write to complete.  Opening
> the file with O_SYNC deprives us of the ability to control the timing
> of the sync relative to the timing of the write. O_SYNC has a heavy performance penalty. For ext4 it means an extra
fs
transaction commit whenever there's any metadata changed on the filesystem.
Since mtime & ctime of files will be changed often, the will be a case very
often.

> > Again, I think this discussion would really benefit from an outline of
> > the different files used by pgsql, and what sort of data access
> > patterns you expect with them.
> 
> I think I more or less did that in my previous email, but here it is
> again in briefer form:
> 
> - WAL files are written (and sometimes read) sequentially and fsync'd
> very frequently and it's always good to write the data out to disk as
> soon as possible
> - Temp files are written and read sequentially and never fsync'd.
> They should only be written to disk when memory pressure demands it
> (but are a good candidate when that situation comes up)
> - Data files are read and written randomly.  They are fsync'd at
> checkpoint time; between checkpoints, it's best not to write them
> sooner than necessary, but when the checkpoint arrives, they all need
> to get out to the disk without bringing the system to a standstill
> 
> We have other kinds of files, but off-hand I'm not thinking of any
> that are really very interesting, apart from those.
> 
> Maybe it'll be useful to have hints that say "always write this file
> to disk as quick as you can" and "always postpone writing this file to
> disk for as long as you can" for WAL and temp files respectively.  But
> the rule for the data files, which are the really important case, is
> not so simple.  fsync() is actually a fine API except that it tends to
> destroy system throughput.  Maybe what we need is just for fsync() to
> be less aggressive, or a less aggressive version of it.  We wouldn't
> mind waiting an almost arbitrarily long time for fsync to complete if
> other processes could still get their I/O requests serviced in a
> reasonable amount of time in the meanwhile. As I wrote in some other email in this thread, using IO priorities for
data file checkpoint might be actually the right answer. They will work for
IO submitted by fsync(). The downside is that currently IO priorities / IO
scheduling classes work only with CFQ IO scheduler.
                            Honza
-- 
Jan Kara <>
SUSE Labs, CR



From:
Bruce Momjian
Date:

On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote:
> > If we're forcing the WAL out to disk because of transaction commit or
> > because we need to write the buffer protected by a certain WAL record
> > only after the WAL hits the platter, then it's fine.  But sometimes
> > we're writing WAL just because we've run out of internal buffer space,
> > and we don't want to block waiting for the write to complete.  Opening
> > the file with O_SYNC deprives us of the ability to control the timing
> > of the sync relative to the timing of the write.
>   O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
> transaction commit whenever there's any metadata changed on the filesystem.
> Since mtime & ctime of files will be changed often, the will be a case very
> often.

Also, there is the issue of writes that don't need sycning being synced
because sync is set on the file descriptor.  Here is output from our
pg_test_fsync tool when run on an SSD with a BBU:
$ pg_test_fsync5 seconds per testO_DIRECT supported on this platform for open_datasync and open_sync.Compare file sync
methodsusing one 8kB write:(in wal_sync_method preference order, except fdatasyncis Linux's default)
open_datasync                                  n/a        fdatasync                          8424.785 ops/sec     119
usecs/op       fsync                              7127.072 ops/sec     140 usecs/op        fsync_writethrough
                  n/a        open_sync                         10548.469 ops/sec      95 usecs/opCompare file sync
methodsusing two 8kB writes:(in wal_sync_method preference order, except fdatasyncis Linux's default)
open_datasync                                  n/a        fdatasync                          4367.375 ops/sec     229
usecs/op       fsync                              4427.761 ops/sec     226 usecs/op        fsync_writethrough
                  n/a        open_sync                          4303.564 ops/sec     232 usecs/opCompare open_sync with
differentwrite sizes:(This is designed to compare the cost of writing 16kBin different write open_sync sizes.)
 
-->             1 * 16kB open_sync write          4938.711 ops/sec     202 usecs/op
-->             2 *  8kB open_sync writes         4233.897 ops/sec     236 usecs/op
-->             4 *  4kB open_sync writes         2904.710 ops/sec     344 usecs/op
-->             8 *  2kB open_sync writes         1736.720 ops/sec     576 usecs/op
-->            16 *  1kB open_sync writes          935.917 ops/sec    1068 usecs/opTest if fsync on non-write file
descriptoris honored:(If the times are similar, fsync() can sync data writtenon a different descriptor.)        write,
fsync,close                7626.783 ops/sec     131 usecs/op        write, close, fsync                6492.697 ops/sec
   154 usecs/opNon-Sync'ed 8kB writes:        write                            351517.178 ops/sec       3 usecs/op
 

--  Bruce Momjian  <>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



From:
Jim Nasby
Date:

On 1/17/14, 7:57 AM, Robert Haas wrote:
> - WAL files are written (and sometimes read) sequentially and fsync'd
> very frequently and it's always good to write the data out to disk as
> soon as possible
> - Temp files are written and read sequentially and never fsync'd.
> They should only be written to disk when memory pressure demands it
> (but are a good candidate when that situation comes up)
> - Data files are read and written randomly.  They are fsync'd at
> checkpoint time; between checkpoints, it's best not to write them
> sooner than necessary, but when the checkpoint arrives, they all need
> to get out to the disk without bringing the system to a standstill

For sake of completeness... there are also data files that are temporary and don't need to be written to disk unless
thekernel thinks there's better things to use that memory for. AFAIK those files are never fsync'd.
 

In other words, these are the same as the temp files Robert describes except they also have random access. Dunno if
thatmatters.
 
-- 
Jim C. Nasby, Data Architect                       
512.569.9461 (cell)                         http://jim.nasby.net



From:
Jim Nasby
Date:

On 1/17/14, 2:24 PM, Gregory Smith wrote:
> I am skeptical that the database will take over very much of this work and perform better than the Linux kernel does.
My take is that our most useful role would be providing test cases kernel developers can add to a performance
regressionsuite.  Ugly "we never though that would happen" situations seems at the root of many of the kernel
performanceregressions people here get nailed by.
 

FWIW, there are some scenarios where we could potentially provide additional info to the kernel scheduler; stuff that
weknow that it never will.
 

For example, if we have a limit clause we can (sometimes) provide a rough estimate of how many pages we'll need to read
froma relation.
 

Probably more useful is the case of index scans; if we pre-read more data from the index we could hand the kernel a
listof base relation blocks that we know we'll need.
 

There's some other things that have been mentioned, such as cases where files will only be accessed sequentially.

Outside of that though, the kernel is going to be in a way better position to schedule IO than we will ever be. Not
onlydoes it understand the underlying hardware, it can also see everything else that's going on.
 
-- 
Jim C. Nasby, Data Architect                       
512.569.9461 (cell)                         http://jim.nasby.net



From:
Jim Nasby
Date:

On 1/19/14, 5:51 PM, Dave Chinner wrote:
>> Postgres is far from being the only application that wants this; many
>> >people resort to tmpfs because of this:
>> >https://lwn.net/Articles/499410/
> Yes, we covered the possibility of using tmpfs much earlier in the
> thread, and came to the conclusion that temp files can be larger
> than memory so tmpfs isn't the solution here.:)

Although... instead of inventing new APIs and foisting this work onto applications, perhaps it would be better to
modifytmpfs such that it can handle a temp space that's larger than memory... possibly backing it with X amount of real
diskand allowing it/the kernel to decide when to passively move files out of the in-memory tmpfs and onto disk.
 

Of course that's theoretically what swapping is supposed to do, but if that's not up to the job...
-- 
Jim C. Nasby, Data Architect                       
512.569.9461 (cell)                         http://jim.nasby.net



From:
Claudio Freire
Date:

On Wed, Jan 22, 2014 at 10:08 PM, Jim Nasby <> wrote:
>
> Probably more useful is the case of index scans; if we pre-read more data
> from the index we could hand the kernel a list of base relation blocks that
> we know we'll need.


Actually, I've already tried this. The most important part is fetching
heap pages, not index. Tried that too.

Currently, fadvising those pages works *in detriment* of physically
correlated scans. That's a kernel bug I've reported to LKML, and I
could probably come up with a patch. I've just never had time to set
up the testing machinery to test the patch myself.



From:
Gregory Smith
Date:

On 1/20/14 9:46 AM, Mel Gorman wrote:
> They could potentially be used to evalate any IO scheduler changes. 
> For example -- deadline scheduler with these parameters has X 
> transactions/sec throughput with average latency of Y millieseconds 
> and a maximum fsync latency of Z seconds. Evaluate how well the 
> out-of-box behaviour compares against it with and without some set of 
> patches. At the very least it would be useful for tracking historical 
> kernel performance over time and bisecting any regressions that got 
> introduced. Once we have a test I think many kernel developers (me at 
> least) can run automated bisections once a test case exists. 

That's the long term goal.  What we used to get out of pgbench were 
things like >60 second latencies when a checkpoint hit with GBs of dirty 
memory.  That does happen in the real world, but that's not a realistic 
case you can tune for very well.  In fact, tuning for it can easily 
degrade performance on more realistic workloads.

The main complexity I don't have a clear view of yet is how much 
unavoidable storage level latency there is in all of the common 
deployment types.  For example, I can take a server with a 256MB 
battery-backed write cache and set dirty_background_bytes to be smaller 
than that.  So checkpoint spikes go away, right?  No. Eventually you 
will see dirty_background_bytes of data going into an already full 256MB 
cache.  And when that happens, the latency will be based on how long it 
takes to write the cached 256MB out to the disks.  If you have a single 
disk or RAID-1 pair, that random I/O could easily happen at 5MB/s or 
less, and that makes for a 51 second cache clearing time.  This is a lot 
better now than it used to be because fsync hasn't flushed the whole 
cache in many years now. (Only RHEL5 systems still in the field suffer 
much from that era of code)  But you do need to look at the distribution 
of latency a bit because of how the cache impact things, you can't just 
consider min/max values.

Take the BBWC out of the equation, and you'll see latency proportional 
to how long it takes to clear the disk's cache out. It's fun "upgrading" 
from a disk with 32MB of cache to 64MB only to watch worst case latency 
double.  At least the kernel does the right thing now, using that cache 
when it can while forcing data out when fsync calls arrive.  (That's 
another important kernel optimization we'll never be able to teach the 
database)

-- 
Greg Smith 
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/



From:
Andreas Pflug
Date:

Am 23.01.14 02:14, schrieb Jim Nasby:
> On 1/19/14, 5:51 PM, Dave Chinner wrote:
>>> Postgres is far from being the only application that wants this; many
>>> >people resort to tmpfs because of this:
>>> >https://lwn.net/Articles/499410/
>> Yes, we covered the possibility of using tmpfs much earlier in the
>> thread, and came to the conclusion that temp files can be larger
>> than memory so tmpfs isn't the solution here.:)
>
> Although... instead of inventing new APIs and foisting this work onto
> applications, perhaps it would be better to modify tmpfs such that it
> can handle a temp space that's larger than memory... possibly backing
> it with X amount of real disk and allowing it/the kernel to decide
> when to passively move files out of the in-memory tmpfs and onto disk.

This is exactly what I'd expect from a file system that's suitable for
tmp purposes. The current tmpfs better should have been named memfs or
so, since it lacks the dedicated disk backing storage.

Regards,
Andreas