Thread: Page replacement algorithm in buffer cache

Page replacement algorithm in buffer cache

From
Atri Sharma
Date:
Hello all,

Sorry if this is a naive question.

I was going through Greg Smith's slides on buffer
cache(http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCache.pdf).
When going through the page replacement algorithm that we use i.e.
clocksweep algorithm, I felt a potential problem in our current
system.

Specifically, when a new entry is allocated in the buffer, it's
USAGE_COUNT is set to 1. On each sweep of the algorithm, the
USAGE_COUNT is decremented and an entry whose  USAGE_COUNT becomes
zero is replaced.

I feel that this could lead to a bias towards replacement of
relatively younger pages in the  cache over older pages. An entry
which has just entered the cache with USAGE_COUNT=1 could be replaced
soon, but it may be needed frequently in the near future, which would
result in it being repeatedly brought into the cache, leading to
replacement overheads.

I think this is the well known face off between LRU and MRU algorithms.

How do we work around this problem?

Regards,

Atri

--
Regards,

Atri
l'apprenant



Re: Page replacement algorithm in buffer cache

From
Amit Kapila
Date:
On Friday, March 22, 2013 10:22 AM Atri Sharma wrote:
> Hello all,
> 
> Sorry if this is a naive question.
> 
> I was going through Greg Smith's slides on buffer
> cache(http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCac
> he.pdf).
> When going through the page replacement algorithm that we use i.e.
> clocksweep algorithm, I felt a potential problem in our current
> system.
> 
> Specifically, when a new entry is allocated in the buffer, it's
> USAGE_COUNT is set to 1. On each sweep of the algorithm, the
> USAGE_COUNT is decremented and an entry whose  USAGE_COUNT becomes
> zero is replaced.

Yes, it is replaced but in the next clock sweep pass, not immediately after
making 0.
So till the time of next pass if nobody accesses the buffer and all other
buffers have higher count, it can be replaced.
Also the buffer, it has returned for which the usage count becomes 1, it
will come to reduce the usage count only in next pass.
So in whole, I think it needs 2 passes for a freshly returned buffer to be
re-used incase no one uses it again.

With Regards,
Amit Kapila.




Re: Page replacement algorithm in buffer cache

From
Atri Sharma
Date:

Sent from my iPad

On 22-Mar-2013, at 11:28, Amit Kapila <amit.kapila@huawei.com> wrote:

> On Friday, March 22, 2013 10:22 AM Atri Sharma wrote:
>> Hello all,
>>
>> Sorry if this is a naive question.
>>
>> I was going through Greg Smith's slides on buffer
>> cache(http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCac
>> he.pdf).
>> When going through the page replacement algorithm that we use i.e.
>> clocksweep algorithm, I felt a potential problem in our current
>> system.
>>
>> Specifically, when a new entry is allocated in the buffer, it's
>> USAGE_COUNT is set to 1. On each sweep of the algorithm, the
>> USAGE_COUNT is decremented and an entry whose  USAGE_COUNT becomes
>> zero is replaced.
>
> Yes, it is replaced but in the next clock sweep pass, not immediately after
> making 0.
> So till the time of next pass if nobody accesses the buffer and all other
> buffers have higher count, it can be replaced.
> Also the buffer, it has returned for which the usage count becomes 1, it
> will come to reduce the usage count only in next pass.
> So in whole, I think it needs 2 passes for a freshly returned buffer to be
> re-used incase no one uses it again.
>
> With Regards,
> Amit Kapila.
>

Hmm,so in the second pass,it gets replaced,right?

I think that if the initialization of USAGE_COUNT starts at the maximum allowed value instead of one, we can have a
bettersolution to this problem. 

Another,more complex solution could be to introduce an ageing factor as well.

Regards,

Atri


Re: Page replacement algorithm in buffer cache

From
Amit Kapila
Date:
On Friday, March 22, 2013 12:00 PM Atri Sharma wrote:
> 
> 
> Sent from my iPad
> 
> On 22-Mar-2013, at 11:28, Amit Kapila <amit.kapila@huawei.com> wrote:
> 
> > On Friday, March 22, 2013 10:22 AM Atri Sharma wrote:
> >> Hello all,
> >>
> >> Sorry if this is a naive question.
> >>
> >> I was going through Greg Smith's slides on buffer
> >>
> cache(http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCac
> >> he.pdf).
> >> When going through the page replacement algorithm that we use i.e.
> >> clocksweep algorithm, I felt a potential problem in our current
> >> system.
> >>
> >> Specifically, when a new entry is allocated in the buffer, it's
> >> USAGE_COUNT is set to 1. On each sweep of the algorithm, the
> >> USAGE_COUNT is decremented and an entry whose  USAGE_COUNT becomes
> >> zero is replaced.
> >
> > Yes, it is replaced but in the next clock sweep pass, not immediately
> after
> > making 0.
> > So till the time of next pass if nobody accesses the buffer and all
> other
> > buffers have higher count, it can be replaced.
> > Also the buffer, it has returned for which the usage count becomes 1,
> it
> > will come to reduce the usage count only in next pass.
> > So in whole, I think it needs 2 passes for a freshly returned buffer
> to be
> > re-used incase no one uses it again.
> >
> > With Regards,
> > Amit Kapila.
> >
> 
> Hmm,so in the second pass,it gets replaced,right? Yes.


> I think that if the initialization of USAGE_COUNT starts at the maximum
> allowed value instead of one, we can have a better solution to this
> problem.

So what is your idea, if you start at maximum, what we will do for further
accesses to it?
Why do you want to give more priority to just loaded page?


> Another,more complex solution could be to introduce an ageing factor as
> well.

With Regards,
Amit Kapila.




Re: Page replacement algorithm in buffer cache

From
Atri Sharma
Date:
>
>> I think that if the initialization of USAGE_COUNT starts at the maximum
>> allowed value instead of one, we can have a better solution to this
>> problem.
>
> So what is your idea, if you start at maximum, what we will do for further
> accesses to it?

I havent chalked out a detailed plan yet, but I think the idea of
initializing USAGE_COUNT to maximum value is not at all good. I was
just thinking off the top of my head.

> Why do you want to give more priority to just loaded page?

I just want it to have more chances to stay, rather than being
replaced pretty early. This is because,  as I said earlier, a new page
could be in high demand in near future, which would lead to repeated
replacement-bringing in of page and hence cause overheads.




>> Another,more complex solution could be to introduce an aging factor

This is the one I think would work out best, add an age factor as to
the time duration which an entry has spent in the cache along with its
usage count.

So, what I am proposing here is to add another factor in the
clocksweep algorithm when it selects victim pages for replacement.
Specifically, the selection of victim pages should be done with the
usage_count AND the time spent by the entry in the cache. This would
give priority to pages with high accesses and not ignore relatively
young pages as well. If a page is not accessed for a long time after
it was allocated, it would be the ideal victim for replacement both in
terms of USAGE_COUNT as well as age.

Regards,

Atri




--
Regards,

Atri
l'apprenant



Re: Page replacement algorithm in buffer cache

From
Amit Kapila
Date:
On Friday, March 22, 2013 4:16 PM Atri Sharma wrote:
> >
> >> I think that if the initialization of USAGE_COUNT starts at the
> maximum
> >> allowed value instead of one, we can have a better solution to this
> >> problem.
> >
> > So what is your idea, if you start at maximum, what we will do for
> further
> > accesses to it?
> 
> I havent chalked out a detailed plan yet, but I think the idea of
> initializing USAGE_COUNT to maximum value is not at all good. I was
> just thinking off the top of my head.
> 
> > Why do you want to give more priority to just loaded page?
> 
> I just want it to have more chances to stay, rather than being
> replaced pretty early. This is because,  as I said earlier, a new page
> could be in high demand in near future, which would lead to repeated
> replacement-bringing in of page and hence cause overheads.
> 
> 
> 
> 
> >> Another,more complex solution could be to introduce an aging factor
> 
> This is the one I think would work out best, add an age factor as to
> the time duration which an entry has spent in the cache along with its
> usage count.
> 
> So, what I am proposing here is to add another factor in the
> clocksweep algorithm when it selects victim pages for replacement.
> Specifically, the selection of victim pages should be done with the
> usage_count AND the time spent by the entry in the cache. This would
> give priority to pages with high accesses and not ignore relatively
> young pages as well. If a page is not accessed for a long time after
> it was allocated, it would be the ideal victim for replacement both in
> terms of USAGE_COUNT as well as age.

What would you do if the only young page has usage count zero during second
sweep.
I don't think introducing another factor along with usage count would do any
much help. 
Have you encountered any such workload very relatively young pages are
getting victimized
and the same is causing performance issues?

With Regards,
Amit Kapila.





Re: Page replacement algorithm in buffer cache

From
Atri Sharma
Date:
>
> What would you do if the only young page has usage count zero during second
> sweep.

Umm....The same approach we take when there is no page with usage
count zero in a sweep in the current algorithm?


> I don't think introducing another factor along with usage count would do any
> much help.

Which else approach can we take here?


> Have you encountered any such workload very relatively young pages are
> getting victimized
> and the same is causing performance issues?

Not yet, I figured this might be a problem and am designing test cases
for the same. I would be glad for some help there please.

Regards,

Atri




--
Regards,

Atri
l'apprenant



Re: Page replacement algorithm in buffer cache

From
Amit Kapila
Date:
On Friday, March 22, 2013 4:36 PM Atri Sharma wrote:
> >
> > What would you do if the only young page has usage count zero during
> second
> > sweep.
> 
> Umm....The same approach we take when there is no page with usage
> count zero in a sweep in the current algorithm?

It would give more priority to young page as compare to more used page.
I don't know if that would be correct thing to do.

With Regards,
Amit Kapila.




Re: Page replacement algorithm in buffer cache

From
Atri Sharma
Date:
On Fri, Mar 22, 2013 at 4:53 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
> On Friday, March 22, 2013 4:36 PM Atri Sharma wrote:
>> >
>> > What would you do if the only young page has usage count zero during
>> second
>> > sweep.
>>
>> Umm....The same approach we take when there is no page with usage
>> count zero in a sweep in the current algorithm?
>
> It would give more priority to young page as compare to more used page.
> I don't know if that would be correct thing to do.
This is my idea, give equal priority to new pages when they enter the
cache, so that they all have an equal chance to be replaced initially.
With time, usage_count shall become the deciding factor in victim
selection.

Regards,

Atri



--
Regards,

Atri
l'apprenant



Re: Page replacement algorithm in buffer cache

From
Ants Aasma
Date:
On Mar 22, 2013 12:46 PM, "Atri Sharma" <atri.jiit@gmail.com> wrote:

> This is the one I think would work out best, add an age factor as to
> the time duration which an entry has spent in the cache along with its
> usage count.

You might want to check out the LIRS cache replacement algorithm [1].
That algorithm tries to estimate least frequently used instead of
least recently used. Mysql uses it for their buffer replacement
policy. There is also a clock sweep based approximation called
CLOCK-Pro. Papers describing and evaluating both are available on the
net. The evaluations in the papers showed significantly better
performance for both of those compared to regular clock sweep or even
ARC.

However, I think the main issue isn't finding new algorithms that are
better in some specific circumstances. The hard part is figuring out
whether their performance is better in general. My idea was to create
a patch to capture page pinning traffic from PostgreSQL (maybe stream
out into a per backend file), run it with some production workloads
and use that to generate testing workloads for the cache replacement
policies. Haven't gotten round to actually doing that though.

[1] http://en.wikipedia.org/wiki/LIRS_caching_algorithm

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Page replacement algorithm in buffer cache

From
Atri Sharma
Date:
>
> However, I think the main issue isn't finding new algorithms that are
> better in some specific circumstances. The hard part is figuring out
> whether their performance is better in general. My idea was to create
> a patch to capture page pinning traffic from PostgreSQL (maybe stream
> out into a per backend file), run it with some production workloads
> and use that to generate testing workloads for the cache replacement
> policies. Haven't gotten round to actually doing that though.
>
> [1] http://en.wikipedia.org/wiki/LIRS_caching_algorithm


Thanks for the link. I think LIRS can indeed be helpful in our case.

We should indeed build some test cases for testing this theory. I am
all for capturing page replacement and usage data and analyzing it.

Atri

--
Regards,

Atri
l'apprenant



Re: Page replacement algorithm in buffer cache

From
Tom Lane
Date:
Ants Aasma <ants@cybertec.at> writes:
> You might want to check out the LIRS cache replacement algorithm [1].
> That algorithm tries to estimate least frequently used instead of
> least recently used. Mysql uses it for their buffer replacement
> policy. There is also a clock sweep based approximation called
> CLOCK-Pro. Papers describing and evaluating both are available on the
> net. The evaluations in the papers showed significantly better
> performance for both of those compared to regular clock sweep or even
> ARC.

I seem to recall that CLOCK-Pro, or something named similarly to that,
was one of the alternatives discussed when we went over to the current
clock-sweep approach.  And we definitely looked at ARC.  It might be
worth checking the archives from back then to see what's already been
considered.

> However, I think the main issue isn't finding new algorithms that are
> better in some specific circumstances. The hard part is figuring out
> whether their performance is better in general.

Yeah. You can prove almost anything with the right set of test cases :-(
        regards, tom lane



Re: Page replacement algorithm in buffer cache

From
Greg Stark
Date:
On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> And we definitely looked at ARC

We didn't just look at it. At least one release used it. Then patent
issues were raised (and I think the implementation had some contention
problems).


-- 
greg



Re: Page replacement algorithm in buffer cache

From
Atri Sharma
Date:
On Fri, Mar 22, 2013 at 11:36 PM, Greg Stark <stark@mit.edu> wrote:
> On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> And we definitely looked at ARC
>
> We didn't just look at it. At least one release used it. Then patent
> issues were raised (and I think the implementation had some contention
> problems).
>
>
> --
> greg

What is the general thinking? Is it time to start testing again and
thinking about improvements to the current algorithm?

Regards,

Atri

--
Regards,

Atri
l'apprenant



Re: Page replacement algorithm in buffer cache

From
Merlin Moncure
Date:
On Fri, Mar 22, 2013 at 1:13 PM, Atri Sharma <atri.jiit@gmail.com> wrote:
> On Fri, Mar 22, 2013 at 11:36 PM, Greg Stark <stark@mit.edu> wrote:
>> On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> And we definitely looked at ARC
>>
>> We didn't just look at it. At least one release used it. Then patent
>> issues were raised (and I think the implementation had some contention
>> problems).
>>
>>
>> --
>> greg
>
> What is the general thinking? Is it time to start testing again and
> thinking about improvements to the current algorithm?

well, what problem are you trying to solve exactly?  the main problems
I see today are not so much in terms of page replacement but spinlock
and lwlock contention.

merlin



Re: Page replacement algorithm in buffer cache

From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes:
> On Fri, Mar 22, 2013 at 1:13 PM, Atri Sharma <atri.jiit@gmail.com> wrote:
>> What is the general thinking? Is it time to start testing again and
>> thinking about improvements to the current algorithm?

> well, what problem are you trying to solve exactly?  the main problems
> I see today are not so much in terms of page replacement but spinlock
> and lwlock contention.

Even back when we last hacked on that algorithm, the concerns were not
so much about which pages it replaced as how much overhead and
contention was created by the management algorithm.  I haven't seen any
reason to think we have a problem with the quality of the replacement
choices.  The proposal to increase the initial usage count would
definitely lead to more overhead/contention, though, because it would
result in having to circle around all the buffers more times (on
average) to get a free buffer.
        regards, tom lane



Re: Page replacement algorithm in buffer cache

From
Merlin Moncure
Date:
On Fri, Mar 22, 2013 at 2:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> On Fri, Mar 22, 2013 at 1:13 PM, Atri Sharma <atri.jiit@gmail.com> wrote:
>>> What is the general thinking? Is it time to start testing again and
>>> thinking about improvements to the current algorithm?
>
>> well, what problem are you trying to solve exactly?  the main problems
>> I see today are not so much in terms of page replacement but spinlock
>> and lwlock contention.
>
> Even back when we last hacked on that algorithm, the concerns were not
> so much about which pages it replaced as how much overhead and
> contention was created by the management algorithm.  I haven't seen any
> reason to think we have a problem with the quality of the replacement
> choices.  The proposal to increase the initial usage count would
> definitely lead to more overhead/contention, though, because it would
> result in having to circle around all the buffers more times (on
> average) to get a free buffer.


yup...absolutely.  I have a hunch that the occasional gripes we see
about server stalls under high load with read only (or mostly read
only) loads are coming from spinlock contention under the lwlock
hitting a critical point and shutting the server down effectively
until by chance the backend with the lwlock gets lucky and lands the
spinlock.

I think there is some very low hanging optimization fruit in the clock
sweep loop.   first and foremost, I see no good reason why when
scanning pages we have to spin and wait on a buffer in order to
pedantically adjust usage_count.  some simple refactoring there could
set it up so that a simple TAS (or even a TTAS with the first test in
front of the cache line lock as we done automatically in x86 IIRC)
could guard the buffer and, in the event of any lock detected, simply
move on to the next candidate without messing around with that buffer
at all.   This could construed as a 'trylock' variant of a spinlock
and might help out with cases where an especially hot buffer is
locking up the sweep.  This is exploiting the fact that from
StrategyGetBuffer we don't need a *particular* buffer, just *a*
buffer.

I also wonder if we shouldn't (perhaps in addition to the above)
resuscitate Jeff Jane's idea to get rid of the lwlock completely and
manage everything with spinlocks..

Naturally, all of this would have to be confirmed with some very robust testing.

merlin



Re: Page replacement algorithm in buffer cache

From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes:
> I think there is some very low hanging optimization fruit in the clock
> sweep loop.   first and foremost, I see no good reason why when
> scanning pages we have to spin and wait on a buffer in order to
> pedantically adjust usage_count.  some simple refactoring there could
> set it up so that a simple TAS (or even a TTAS with the first test in
> front of the cache line lock as we done automatically in x86 IIRC)
> could guard the buffer and, in the event of any lock detected, simply
> move on to the next candidate without messing around with that buffer
> at all.   This could construed as a 'trylock' variant of a spinlock
> and might help out with cases where an especially hot buffer is
> locking up the sweep.  This is exploiting the fact that from
> StrategyGetBuffer we don't need a *particular* buffer, just *a*
> buffer.

Hm.  You could argue in fact that if there's contention for the buffer
header, that's proof that it's busy and shouldn't have its usage count
decremented.  So this seems okay from a logical standpoint.

However, I'm not real sure that it's possible to do a conditional
spinlock acquire that doesn't create just as much hardware-level
contention as a full acquire (ie, TAS is about as bad whether it
gets the lock or not).  So the actual benefit is a bit less clear.
        regards, tom lane



Re: Page replacement algorithm in buffer cache

From
Merlin Moncure
Date:
On Fri, Mar 22, 2013 at 3:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> I think there is some very low hanging optimization fruit in the clock
>> sweep loop.   first and foremost, I see no good reason why when
>> scanning pages we have to spin and wait on a buffer in order to
>> pedantically adjust usage_count.  some simple refactoring there could
>> set it up so that a simple TAS (or even a TTAS with the first test in
>> front of the cache line lock as we done automatically in x86 IIRC)
>> could guard the buffer and, in the event of any lock detected, simply
>> move on to the next candidate without messing around with that buffer
>> at all.   This could construed as a 'trylock' variant of a spinlock
>> and might help out with cases where an especially hot buffer is
>> locking up the sweep.  This is exploiting the fact that from
>> StrategyGetBuffer we don't need a *particular* buffer, just *a*
>> buffer.
>
> Hm.  You could argue in fact that if there's contention for the buffer
> header, that's proof that it's busy and shouldn't have its usage count
> decremented.  So this seems okay from a logical standpoint.
>
> However, I'm not real sure that it's possible to do a conditional
> spinlock acquire that doesn't create just as much hardware-level
> contention as a full acquire (ie, TAS is about as bad whether it
> gets the lock or not).  So the actual benefit is a bit less clear.

well if you do a non-locking test first you could at least avoid some
cases (and, if you get the answer wrong, so what?) by jumping to the
next buffer immediately.  if the non locking test comes good, only
then do you do a hardware TAS.

you could in fact go further and dispense with all locking in front of
usage_count, on the premise that it's only advisory and not a real
refcount.  so you only then lock if/when it's time to select a
candidate buffer, and only then when you did a non locking test first.this would of course require some amusing
adjustmentsto various
 
logical checks (usage_count <= 0, heh).

merlin



Re: Page replacement algorithm in buffer cache

From
Ants Aasma
Date:
On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> well if you do a non-locking test first you could at least avoid some
> cases (and, if you get the answer wrong, so what?) by jumping to the
> next buffer immediately.  if the non locking test comes good, only
> then do you do a hardware TAS.
>
> you could in fact go further and dispense with all locking in front of
> usage_count, on the premise that it's only advisory and not a real
> refcount.  so you only then lock if/when it's time to select a
> candidate buffer, and only then when you did a non locking test first.
>  this would of course require some amusing adjustments to various
> logical checks (usage_count <= 0, heh).

Moreover, if the buffer happens to miss a decrement due to a data
race, there's a good chance that the buffer is heavily used and
wouldn't need to be evicted soon anyway. (if you arrange it to be a
read-test-inc/dec-store operation then you will never go out of
bounds) However, clocksweep and usage_count maintenance is not what is
causing contention because that workload is distributed. The issue is
pinning and unpinning. There we need an accurate count and there are
some pages like index roots that get hit very heavily. Things to do
there would be in my opinion convert to a futex based spinlock so when
there is contention it doesn't completely kill performance and then
try to get rid of the contention. Converting to lock-free pinning
won't help much here as what is killing us here is the cacheline
bouncing.

One way to get rid of contention is the buffer nailing idea that
Robert came up with. If some buffer gets so hot that maintaining
refcount on the buffer header leads to contention, promote that buffer
to a nailed status, let everyone keep their pin counts locally and
sometime later revisit the nailing decision and if necessary convert
pins back to the buffer header.

One other interesting idea I have seen is closeable scalable nonzero
indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
use a tree structure to dynamically stripe access to the shared lock
counter when contention is detected. Downside is that considerable
amount of shared memory is needed so there needs to be some way to
limit the resource usage. This is actually somewhat isomorphic to the
nailing idea.

The issue with the current buffer management algorithm is that it
seems to scale badly with increasing shared_buffers. I think the
improvements should concentrate on finding out what is the problem
there and figuring out how to fix it. A simple idea to test would be
to just partition shared buffers along with the whole clock sweep
machinery into smaller ones, like the buffer mapping hash tables
already are. This should at the very least reduce contention for the
clock sweep even if it doesn't reduce work done per page miss.

[1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Page replacement algorithm in buffer cache

From
Atri Sharma
Date:
>
> Moreover, if the buffer happens to miss a decrement due to a data
> race, there's a good chance that the buffer is heavily used and
> wouldn't need to be evicted soon anyway. (if you arrange it to be a
> read-test-inc/dec-store operation then you will never go out of
> bounds) However, clocksweep and usage_count maintenance is not what is
> causing contention because that workload is distributed. The issue is
> pinning and unpinning. There we need an accurate count and there are
> some pages like index roots that get hit very heavily. Things to do
> there would be in my opinion convert to a futex based spinlock so when
> there is contention it doesn't completely kill performance and then
> try to get rid of the contention. Converting to lock-free pinning
> won't help much here as what is killing us here is the cacheline
> bouncing.
>
> One way to get rid of contention is the buffer nailing idea that
> Robert came up with. If some buffer gets so hot that maintaining
> refcount on the buffer header leads to contention, promote that buffer
> to a nailed status, let everyone keep their pin counts locally and
> sometime later revisit the nailing decision and if necessary convert
> pins back to the buffer header.
>
> One other interesting idea I have seen is closeable scalable nonzero
> indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
> use a tree structure to dynamically stripe access to the shared lock
> counter when contention is detected. Downside is that considerable
> amount of shared memory is needed so there needs to be some way to
> limit the resource usage. This is actually somewhat isomorphic to the
> nailing idea.
>
> The issue with the current buffer management algorithm is that it
> seems to scale badly with increasing shared_buffers. I think the
> improvements should concentrate on finding out what is the problem
> there and figuring out how to fix it. A simple idea to test would be
> to just partition shared buffers along with the whole clock sweep
> machinery into smaller ones, like the buffer mapping hash tables
> already are. This should at the very least reduce contention for the
> clock sweep even if it doesn't reduce work done per page miss.
>

One way to distribute memory contention in case of spinlocks could be
to utilize the fundamentals of NUMA architecture. Specifically, we can
let the contending backends spin on local flags instead on the buffer
header flags directly. As access to local cache lines is much cheaper
and faster than memory locations which are far away in NUMA, we could
potentially reduce the memory overhead for a specific line and reduce
the overall overheads as well.

Regards.

Atri


--
Regards,

Atri
l'apprenant



Re: Page replacement algorithm in buffer cache

From
Jim Nasby
Date:
On 3/22/13 7:27 PM, Ants Aasma wrote:
> On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> well if you do a non-locking test first you could at least avoid some
>> cases (and, if you get the answer wrong, so what?) by jumping to the
>> next buffer immediately.  if the non locking test comes good, only
>> then do you do a hardware TAS.
>>
>> you could in fact go further and dispense with all locking in front of
>> usage_count, on the premise that it's only advisory and not a real
>> refcount.  so you only then lock if/when it's time to select a
>> candidate buffer, and only then when you did a non locking test first.
>>   this would of course require some amusing adjustments to various
>> logical checks (usage_count <= 0, heh).
>
> Moreover, if the buffer happens to miss a decrement due to a data
> race, there's a good chance that the buffer is heavily used and
> wouldn't need to be evicted soon anyway. (if you arrange it to be a
> read-test-inc/dec-store operation then you will never go out of
> bounds) However, clocksweep and usage_count maintenance is not what is
> causing contention because that workload is distributed. The issue is
> pinning and unpinning. There we need an accurate count and there are
> some pages like index roots that get hit very heavily. Things to do
> there would be in my opinion convert to a futex based spinlock so when
> there is contention it doesn't completely kill performance and then
> try to get rid of the contention. Converting to lock-free pinning
> won't help much here as what is killing us here is the cacheline
> bouncing.
>
> One way to get rid of contention is the buffer nailing idea that
> Robert came up with. If some buffer gets so hot that maintaining
> refcount on the buffer header leads to contention, promote that buffer
> to a nailed status, let everyone keep their pin counts locally and
> sometime later revisit the nailing decision and if necessary convert
> pins back to the buffer header.
>
> One other interesting idea I have seen is closeable scalable nonzero
> indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
> use a tree structure to dynamically stripe access to the shared lock
> counter when contention is detected. Downside is that considerable
> amount of shared memory is needed so there needs to be some way to
> limit the resource usage. This is actually somewhat isomorphic to the
> nailing idea.
>
> The issue with the current buffer management algorithm is that it
> seems to scale badly with increasing shared_buffers. I think the
> improvements should concentrate on finding out what is the problem
> there and figuring out how to fix it. A simple idea to test would be
> to just partition shared buffers along with the whole clock sweep
> machinery into smaller ones, like the buffer mapping hash tables
> already are. This should at the very least reduce contention for the
> clock sweep even if it doesn't reduce work done per page miss.
>
> [1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf

Partitioned clock sweep strikes me as a bad idea... you could certainly get unlucky and end up with a lot of hot stuff
inone partition.
 

Another idea that'sbeen broughht up inthe past is to have something in the background keep a minimum number of buffers
onthe free list. That's how OS VM systems I'm familiar with work, so there's precedent for it.
 

I recall there were at least some theoretical concerns about this, but I don't remember if anyone actually tested the
idea.



Re: Page replacement algorithm in buffer cache

From
Amit Kapila
Date:
On Saturday, March 23, 2013 9:34 AM Jim Nasby wrote:
> On 3/22/13 7:27 PM, Ants Aasma wrote:
> > On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure@gmail.com>
> wrote:
> 
> > One other interesting idea I have seen is closeable scalable nonzero
> > indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
> > use a tree structure to dynamically stripe access to the shared lock
> > counter when contention is detected. Downside is that considerable
> > amount of shared memory is needed so there needs to be some way to
> > limit the resource usage. This is actually somewhat isomorphic to the
> > nailing idea.
> >
> > The issue with the current buffer management algorithm is that it
> > seems to scale badly with increasing shared_buffers. I think the
> > improvements should concentrate on finding out what is the problem
> > there and figuring out how to fix it. A simple idea to test would be
> > to just partition shared buffers along with the whole clock sweep
> > machinery into smaller ones, like the buffer mapping hash tables
> > already are. This should at the very least reduce contention for the
> > clock sweep even if it doesn't reduce work done per page miss.
> >
> > [1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf
> 
> Partitioned clock sweep strikes me as a bad idea... you could certainly
> get unlucky and end up with a lot of hot stuff in one partition.
> 
> Another idea that'sbeen broughht up inthe past is to have something in
> the background keep a minimum number of buffers on the free list.
> That's how OS VM systems I'm familiar with work, so there's precedent
> for it.
> 
> I recall there were at least some theoretical concerns about this, but
> I don't remember if anyone actually tested the idea.

I have tried one of the idea's : Adding the buffers background writer finds
reusable to freelist. 
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382852FF9
7@szxeml509-mbs
This can reduce the clock swipe as it can find buffers from freelist. 

It shows performance improvement for read loads when data can be contained
in shared buffers, 
but when the data becomes large and (I/O) is involved, it shows some dip as
well.


With Regards,
Amit Kapila.




Re: Page replacement algorithm in buffer cache

From
Ants Aasma
Date:
On Sat, Mar 23, 2013 at 6:29 AM, Atri Sharma <atri.jiit@gmail.com> wrote:
> One way to distribute memory contention in case of spinlocks could be
> to utilize the fundamentals of NUMA architecture. Specifically, we can
> let the contending backends spin on local flags instead on the buffer
> header flags directly. As access to local cache lines is much cheaper
> and faster than memory locations which are far away in NUMA, we could
> potentially reduce the memory overhead for a specific line and reduce
> the overall overheads as well.

This is not even something for NUMA architectures (which is by now all
multiprocessor machines), even multicore machines have overheads for
bouncing cache lines. The locks don't even have to be local, it's good
enough to just have better probability of each backend contending
hitting a different lock, if we take care of not having the locks
share cache lines. IMO that's the whole point of striping locks, the
critical section is usually cheaper than the cost of getting the cache
line in an exclusive state.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Page replacement algorithm in buffer cache

From
Ants Aasma
Date:
On Sat, Mar 23, 2013 at 6:04 AM, Jim Nasby <jim@nasby.net> wrote:
> Partitioned clock sweep strikes me as a bad idea... you could certainly get
> unlucky and end up with a lot of hot stuff in one partition.

Surely that is not worse than having everything in a single partition.
Given a decent partitioning function it's very highly unlikely to have
more than a few of the hottest buffers end up in a single partition.

> Another idea that'sbeen broughht up inthe past is to have something in the
> background keep a minimum number of buffers on the free list. That's how OS
> VM systems I'm familiar with work, so there's precedent for it.
>
> I recall there were at least some theoretical concerns about this, but I
> don't remember if anyone actually tested the idea.

Yes, having bgwriter do the actual cleaning up seems like a good idea.
The whole bgwriter infrastructure will need some serious tuning. There
are many things that could be shifted to background if we knew it
could keep up, like hint bit setting on dirty buffers being flushed
out. But again, we have the issue of having good tests to see where
the changes hurt.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Page replacement algorithm in buffer cache

From
Atri Sharma
Date:
>
>
> Partitioned clock sweep strikes me as a bad idea... you could certainly get
> unlucky and end up with a lot of hot stuff in one partition.
>
> Another idea that'sbeen broughht up inthe past is to have something in the
> background keep a minimum number of buffers on the free list. That's how OS
> VM systems I'm familiar with work, so there's precedent for it.
>
> I recall there were at least some theoretical concerns about this, but I
> don't remember if anyone actually tested the idea.
One way to handle this could be to have dynamic membership of pages
in the partitions. Based on activity for a page, it could be moved to
another partition. In this manner, we *could* distribute the hot and
not so hot buffer pages and hence it could help.

Regards,

Atri

--
Regards,

Atri
l'apprenant



Re: Page replacement algorithm in buffer cache

From
Jeff Janes
Date:
On Thu, Mar 21, 2013 at 9:51 PM, Atri Sharma <atri.jiit@gmail.com> wrote:
Hello all,

Sorry if this is a naive question.

I was going through Greg Smith's slides on buffer
cache(http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCache.pdf).
When going through the page replacement algorithm that we use i.e.
clocksweep algorithm, I felt a potential problem in our current
system.

Specifically, when a new entry is allocated in the buffer, it's
USAGE_COUNT is set to 1. On each sweep of the algorithm, the
USAGE_COUNT is decremented and an entry whose  USAGE_COUNT becomes
zero is replaced.

It is replaced when the usage_count is already found to be zero, not when it is made zero.
 
I feel that this could lead to a bias towards replacement of
relatively younger pages in the  cache over older pages. An entry
which has just entered the cache with USAGE_COUNT=1 could be replaced
soon, but it may be needed frequently in the near future,

Well, it may be needed.  But then again, it may not be needed.  And that old page, that also may be needed frequently in the future (which is far more likely than a new page--after all an old page likely got old for a reason).  The best evidence that it will be needed again is that it actually has been needed again.

I'm more convinced in the other direction, new pages should enter with 0 rather than with 1.  I think that the argument that a new buffer needs to be given more of an opportunity to get used again is mostly bogus.  You cannot bully the shared_buffers into being larger than it is.  If all the incoming buffers get "more opportunity", that just means the buffer-clock ticks twice as fast, and really none of them has more opportunity when you measure that opportunity against an outside standard (wall time, or work-load accomplished).  All of our children cannot be above average.

Cheers,

Jeff

Re: Page replacement algorithm in buffer cache

From
Merlin Moncure
Date:
On Fri, Mar 22, 2013 at 7:27 PM, Ants Aasma <ants@cybertec.at> wrote:
> On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> well if you do a non-locking test first you could at least avoid some
>> cases (and, if you get the answer wrong, so what?) by jumping to the
>> next buffer immediately.  if the non locking test comes good, only
>> then do you do a hardware TAS.
>>
>> you could in fact go further and dispense with all locking in front of
>> usage_count, on the premise that it's only advisory and not a real
>> refcount.  so you only then lock if/when it's time to select a
>> candidate buffer, and only then when you did a non locking test first.
>>  this would of course require some amusing adjustments to various
>> logical checks (usage_count <= 0, heh).
>
> Moreover, if the buffer happens to miss a decrement due to a data
> race, there's a good chance that the buffer is heavily used and
> wouldn't need to be evicted soon anyway. (if you arrange it to be a
> read-test-inc/dec-store operation then you will never go out of
> bounds)

yeah. There's something to be said to have an upper bound in the
length of time to get a page out (except in the special case when most
of them are pinned).  Right now, any page contention on a buffer
header for any reason can shut down buffer allocation, and that's just
not good.  It's obviously not very likely to happen but I think it can
does does happen.  The more I think about it the more I think's a bad
idea to spin during buffer allocation for any reason, ever.

> However, clocksweep and usage_count maintenance is not what is
> causing contention because that workload is distributed. The issue is
> pinning and unpinning. There we need an accurate count and there are
> some pages like index roots that get hit very heavily. Things to do
> there would be in my opinion convert to a futex based spinlock so when
> there is contention it doesn't completely kill performance and then
> try to get rid of the contention. Converting to lock-free pinning
> won't help much here as what is killing us here is the cacheline
> bouncing.

Yup -- futexes are another way to go.  They are linux specific though.

> One way to get rid of contention is the buffer nailing idea that
> Robert came up with. If some buffer gets so hot that maintaining
> refcount on the buffer header leads to contention, promote that buffer
> to a nailed status, let everyone keep their pin counts locally and
> sometime later revisit the nailing decision and if necessary convert
> pins back to the buffer header.

Yeah this is a more general (albeit more complicated) solution and
would likely be fantastic.  Is it safe to assume that refcounting is
the only likely cause of contention?

> One other interesting idea I have seen is closeable scalable nonzero
> indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
> use a tree structure to dynamically stripe access to the shared lock
> counter when contention is detected. Downside is that considerable
> amount of shared memory is needed so there needs to be some way to
> limit the resource usage. This is actually somewhat isomorphic to the
> nailing idea.
>
> The issue with the current buffer management algorithm is that it
> seems to scale badly with increasing shared_buffers. I think the
> improvements should concentrate on finding out what is the problem
> there and figuring out how to fix it. A simple idea to test would be
> to just partition shared buffers along with the whole clock sweep
> machinery into smaller ones, like the buffer mapping hash tables
> already are. This should at the very least reduce contention for the
> clock sweep even if it doesn't reduce work done per page miss.
>
> [1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf

I'll have to take a look.  Removing *all spinning* from from page
allocation though feels like it might be worthwhile to test (got to
give some bonus points for being a very local change and simple to
implement).  I wonder if with more shared buffers you tend to sweep
more buffers per allocation.  (IIRC Jeff J was skeptical of that).

merlin



Re: Page replacement algorithm in buffer cache

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> I'm more convinced in the other direction, new pages should enter with 0
> rather than with 1.  I think that the argument that a new buffer needs to
> be given more of an opportunity to get used again is mostly bogus.

IIRC, the argument for starting at 1 not 0 is that otherwise a new page
might have an infinitesmally small lifespan, if the clock sweep should
reach it just after it gets entered into the buffers.  By starting at
1, the uncertainty in a new page's lifespan runs from 1 to 2 sweep times
not 0 to 1 sweep time.

I think though that this argument only holds water if the buffer didn't
get found via the clock sweep to start with --- otherwise, it ought to
have just about one clock sweep of time before the sweep comes back to
it.  It does apply to buffers coming off the freelist, though.

Thus, if we were to get rid of the freelist then maybe we could change
the starting usage_count ... but whether that's a good idea in itself
is pretty uncertain.
        regards, tom lane



Re: Page replacement algorithm in buffer cache

From
Jeff Janes
Date:
On Fri, Mar 22, 2013 at 4:06 AM, Atri Sharma <atri.jiit@gmail.com> wrote:
 

Not yet, I figured this might be a problem and am designing test cases
for the same. I would be glad for some help there please.

Perhaps this isn't the help you were looking for, but I spent a long time looking into this a few years ago.  Then I stopped and decided to work on other things.  I would recommend you do so too.

If I have to struggle to come up with an artificial test case that shows that there is a problem, then why should I believe that there actually is a problem?  If you take a well-known problem (like, say, bad performance at shared_buffers > 8GB (or even lower, on Windows)) and create an artificial test case to exercise and investigate that, that is one thing.  But why invent pathological test cases with no known correspondence to reality?  There are plenty of real problems to work on, and some of them are just as intellectually interesting as the artificial problems are.

My conclusions were:

1) If everything fits in shared_buffers, then the replacement policy doesn't matter.

2) If shared_buffers is much smaller than RAM (the most common case, I believe), then what mostly matters is your OS's replacement policy, not pgsql's.  Not much a pgsql hacker can do about this, other than turn into a kernel hacker.

3) If little of the highly-used data fits in RAM. then any non-absurd replacement policy is about as good as any other non-absurd one.

4) If most, but not quite all, of the highly-used data fits shared_buffers and shared_buffers takes most of RAM (or at least, most of RAM not already needed for other things like work_mem and executables), then the replacement policy matters a lot.  But different policies suit different work-loads, and there is little reason to think we can come up with a way to choose between them.  (Also, in these conditions, performance is very chaotic.  You can run the same algorithm for a long time, and it can suddenly switch from good to bad or the other way around, and then stay in that new mode for a long time).  Also, even if you come up with a good algorithm, if you make the data set 20% smaller or 20% larger, it is no longer a good algorithm.

5) Having buffers enter with usage_count=0 rather than 1 would probably be slightly better most of the time under conditions described in 4, but there is no way get enough evidence of this over enough conditions to justify making a change.  And besides, how often do people run with shared_buffers being most of RAM, and the hot data just barely not fitting in it?


If you want some known problems that are in this general area, we have:

1) If all data fits in RAM but not shared_buffers, and you have a very large number of CPUs and a read-only or read-mostly workload, then BufFreelistLock can be a major bottle neck.  (But, on a Amazon high-CPU instance, I did not see this very much.  I suspect the degree of problem depends a lot on whether you have a lot of sockets with a few CPUs each, versus one chip with many CPUs).  This is very easy to come up with model cases for, pgbench -S -c8 -j8, for example, can often show it.

2) A major reason that people run with shared_buffers much lower than RAM is that performance seems to suffer with shared_buffers > 8GB under write-heavy workloads, even with spread-out checkpoints.  This is frequently reported as a real world problem, but as far as I know has never been reduced to a simple reproducible test case. (Although there was a recent thread, maybe "High CPU usage / load average after upgrading to Ubuntu 12.04", that I thought might be relevant to this.  I haven't had the time to seriously study the thread, or the hardware to investigate it myself)

Cheers,

Jeff

Re: Page replacement algorithm in buffer cache

From
Atri Sharma
Date:
> Perhaps this isn't the help you were looking for, but I spent a long time
> looking into this a few years ago.  Then I stopped and decided to work on
> other things.  I would recommend you do so too.

Agreed. It seems that my concerns were not valid, and since you have
already done some testing here, it further closes the matter.


> 4) If most, but not quite all, of the highly-used data fits shared_buffers
> and shared_buffers takes most of RAM (or at least, most of RAM not already
> needed for other things like work_mem and executables), then the replacement
> policy matters a lot.  But different policies suit different work-loads, and
> there is little reason to think we can come up with a way to choose between
> them.  (Also, in these conditions, performance is very chaotic.  You can run
> the same algorithm for a long time, and it can suddenly switch from good to
> bad or the other way around, and then stay in that new mode for a long
> time).  Also, even if you come up with a good algorithm, if you make the
> data set 20% smaller or 20% larger, it is no longer a good algorithm.

Does that mean that an ideal, high performance postgres setup should
*never* set the shared_buffers to a large percentage of the system's
RAM? If so, have we ever encountered issues with low specs systems?


> 5) Having buffers enter with usage_count=0 rather than 1 would probably be
> slightly better most of the time under conditions described in 4, but there
> is no way get enough evidence of this over enough conditions to justify
> making a change.  And besides, how often do people run with shared_buffers
> being most of RAM, and the hot data just barely not fitting in it?

Agreed.

> 1) If all data fits in RAM but not shared_buffers, and you have a very large
> number of CPUs and a read-only or read-mostly workload, then BufFreelistLock
> can be a major bottle neck.  (But, on a Amazon high-CPU instance, I did not
> see this very much.  I suspect the degree of problem depends a lot on
> whether you have a lot of sockets with a few CPUs each, versus one chip with
> many CPUs).  This is very easy to come up with model cases for, pgbench -S
> -c8 -j8, for example, can often show it.

I will try that, thanks.

> 2) A major reason that people run with shared_buffers much lower than RAM is
> that performance seems to suffer with shared_buffers > 8GB under write-heavy
> workloads, even with spread-out checkpoints.  This is frequently reported as
> a real world problem, but as far as I know has never been reduced to a
> simple reproducible test case. (Although there was a recent thread, maybe
> "High CPU usage / load average after upgrading to Ubuntu 12.04", that I
> thought might be relevant to this.  I haven't had the time to seriously
> study the thread, or the hardware to investigate it myself)
>

This seems interesting.Do we have some indications as to what the
problems could be?

Regards,

Atri



--
Regards,

Atri
l'apprenant



Re: Page replacement algorithm in buffer cache

From
Atri Sharma
Date:
>
> I'll have to take a look.  Removing *all spinning* from from page
> allocation though feels like it might be worthwhile to test (got to
> give some bonus points for being a very local change and simple to
> implement).  I wonder if with more shared buffers you tend to sweep
> more buffers per allocation.  (IIRC Jeff J was skeptical of that).

Replacing the spinning with TAS for buffer header sounds like a good idea.

Regards,

Atri



--
Regards,

Atri
l'apprenant



Re: Page replacement algorithm in buffer cache

From
Ants Aasma
Date:
On Sun, Mar 24, 2013 at 9:29 AM, Atri Sharma <atri.jiit@gmail.com> wrote:
>> I'll have to take a look.  Removing *all spinning* from from page
>> allocation though feels like it might be worthwhile to test (got to
>> give some bonus points for being a very local change and simple to
>> implement).  I wonder if with more shared buffers you tend to sweep
>> more buffers per allocation.  (IIRC Jeff J was skeptical of that).
>
> Replacing the spinning with TAS for buffer header sounds like a good idea.

Well, TAS is exactly what spinlocks are spinning on. Plain old
unlocked read-modify-write should be good enough for clock sweep usage
count update with the header lock taken only when we decide to try and
evict something. The data raciness will mean a higher or lower than
normal usage count when two an increment and decrement race each
other. The race is only likely for buffers with high contention. If we
use the value calculated locally to decide on eviction, the highly
used buffers where this is likely will get at least one clock sweep
cycle of grace time. If they are indeed highly used it's likely that
someone will manage to bump the usage_count in the meanwhile. If they
are not hot, a rare speedup or delay in eviction won't matter much.

Getting rid of memory barriers associated with locking in the clock
sweep will pipeline the loads and stores and so should bring on a good
performance increase for scanning large swathes of buffer headers.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Page replacement algorithm in buffer cache

From
Greg Smith
Date:
On 3/22/13 8:45 AM, Ants Aasma wrote:
> However, I think the main issue isn't finding new algorithms that are
> better in some specific circumstances. The hard part is figuring out
> whether their performance is better in general.

Right.  The current page replacement method works as expected.  Many 
frequently accessed pages accumulate a usage count of 5 before the clock 
sweep hits them.  Pages that are accessed once and not again before the 
clock sweep are evicted.  There are several theoretically better ways to 
approach this.  Anyone who hasn't already been working on this for a few 
years is very unlikely to come up with a brand new idea, one that hasn't 
already been tested in the academic research.

But the real blocker here isn't ideas, it's creating benchmark workloads 
to validate any change.  Right now I see the most promising work that 
could lead toward the "performance farm" idea as all of the Jenkins 
based testing that's been going on recently.  Craig Ringer has using 
that for 2ndQuadrant work here, Peter Eisentraut has been working with 
it: 
http://petereisentraut.blogspot.com/2013/01/postgresql-and-jenkins.html 
and the PostGIS project uses it too.  There's some good momentum brewing 
there.

When we have regular performance testing with a mix of workloads--I have 
about 10 in mind to start--at that point we could start the testing 
performance changes to the buffer replacement.  Until then this whole 
area is hard to touch usefully.  You have to assume that any tuning you 
do for one type of workload might accidentally slow another.  Starting 
with a lot of baseline workloads is the only way to move usefully 
forward when facing that problem.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: Page replacement algorithm in buffer cache

From
Atri Sharma
Date:
On Sun, Mar 24, 2013 at 6:11 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 3/22/13 8:45 AM, Ants Aasma wrote:
>>
>> However, I think the main issue isn't finding new algorithms that are
>> better in some specific circumstances. The hard part is figuring out
>> whether their performance is better in general.
>
>
> Right.  The current page replacement method works as expected.  Many
> frequently accessed pages accumulate a usage count of 5 before the clock
> sweep hits them.  Pages that are accessed once and not again before the
> clock sweep are evicted.  There are several theoretically better ways to
> approach this.  Anyone who hasn't already been working on this for a few
> years is very unlikely to come up with a brand new idea, one that hasn't
> already been tested in the academic research.
>
> But the real blocker here isn't ideas, it's creating benchmark workloads to
> validate any change.  Right now I see the most promising work that could
> lead toward the "performance farm" idea as all of the Jenkins based testing
> that's been going on recently.  Craig Ringer has using that for 2ndQuadrant
> work here, Peter Eisentraut has been working with it:
> http://petereisentraut.blogspot.com/2013/01/postgresql-and-jenkins.html and
> the PostGIS project uses it too.  There's some good momentum brewing there.
>
> When we have regular performance testing with a mix of workloads--I have
> about 10 in mind to start--at that point we could start the testing
> performance changes to the buffer replacement.  Until then this whole area
> is hard to touch usefully.  You have to assume that any tuning you do for
> one type of workload might accidentally slow another.  Starting with a lot
> of baseline workloads is the only way to move usefully forward when facing
> that problem.

Wow! Jenkins based performance farm should be just what we want.

Can we add tests for lock contentions(as discussed above), so that we
can move to some actual diagnostics?

Regards,

Atri


--
Regards,

Atri
l'apprenant



Re: Page replacement algorithm in buffer cache

From
Atri Sharma
Date:
 >If we
> use the value calculated locally to decide on eviction, the highly
> used buffers where this is likely will get at least one clock sweep
> cycle of grace time. If they are indeed highly used it's likely that
> someone will manage to bump the usage_count in the meanwhile. If they
> are not hot, a rare speedup or delay in eviction won't matter much.

Yeah, a buffer page getting an extra clock sweep life in lieu of
potential performance improvement isn't much of a cost to pay.

So, essentially, we decide locally which page to evict, then try to
get a lock on the header only when we want to evict the victim page?
Sounds like the contention for the header should go down considerably.

Unlocked incrementing/decrementing of USAGE_COUNT may lead to the
values of USAGE_COUNT differing from the actual value they should be
having by 1 or 2 counts in case of high contention buffer pages, which
shouldn't matter, as they are in high contention. I agree, it is
likely that a process(s) will increase the usage_count anyways.

> Getting rid of memory barriers associated with locking in the clock
> sweep will pipeline the loads and stores and so should bring on a good
> performance increase for scanning large swathes of buffer headers.
>
This does sound interesting. If the Jenkins based performance farm
gets ready, we can do some tests and see the kind of results we get.

Regards,

Atri



--
Regards,

Atri
l'apprenant



Re: Page replacement algorithm in buffer cache

From
Ants Aasma
Date:
On Sun, Mar 24, 2013 at 7:03 PM, Atri Sharma <atri.jiit@gmail.com> wrote:
> So, essentially, we decide locally which page to evict, then try to
> get a lock on the header only when we want to evict the victim page?
> Sounds like the contention for the header should go down considerably.

Not exactly locally, the idea is still to use the shared buffer header
structures. We just accept any errors in usage_count coming from data
races. This won't solve buffer header contention as modifying
usage_count still brings the cacheline in an exclusive state. It does
help with reducing the time BufferFreeListLock is held and it
eliminates spinning on the clocksweep side so only queries that access
the contended buffer are hurt by spinning.

> Unlocked incrementing/decrementing of USAGE_COUNT may lead to the
> values of USAGE_COUNT differing from the actual value they should be
> having by 1 or 2 counts in case of high contention buffer pages, which
> shouldn't matter, as they are in high contention. I agree, it is
> likely that a process(s) will increase the usage_count anyways.

There is no point in an unlocked increment as this is done in
conjunction with a buffer pin that requires the header spinlock
anyway.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Page replacement algorithm in buffer cache

From
Bruce Momjian
Date:
On Fri, Mar 22, 2013 at 06:06:18PM +0000, Greg Stark wrote:
> On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > And we definitely looked at ARC
> 
> We didn't just look at it. At least one release used it. Then patent
> issues were raised (and I think the implementation had some contention
> problems).

The problem was cache line overhead between CPUs to manage the ARC
queues.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Page replacement algorithm in buffer cache

From
Bruce Momjian
Date:
On Fri, Mar 22, 2013 at 04:16:18PM -0400, Tom Lane wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
> > I think there is some very low hanging optimization fruit in the clock
> > sweep loop.   first and foremost, I see no good reason why when
> > scanning pages we have to spin and wait on a buffer in order to
> > pedantically adjust usage_count.  some simple refactoring there could
> > set it up so that a simple TAS (or even a TTAS with the first test in
> > front of the cache line lock as we done automatically in x86 IIRC)
> > could guard the buffer and, in the event of any lock detected, simply
> > move on to the next candidate without messing around with that buffer
> > at all.   This could construed as a 'trylock' variant of a spinlock
> > and might help out with cases where an especially hot buffer is
> > locking up the sweep.  This is exploiting the fact that from
> > StrategyGetBuffer we don't need a *particular* buffer, just *a*
> > buffer.
> 
> Hm.  You could argue in fact that if there's contention for the buffer
> header, that's proof that it's busy and shouldn't have its usage count
> decremented.  So this seems okay from a logical standpoint.
> 
> However, I'm not real sure that it's possible to do a conditional
> spinlock acquire that doesn't create just as much hardware-level
> contention as a full acquire (ie, TAS is about as bad whether it
> gets the lock or not).  So the actual benefit is a bit less clear.

Could we view the usage count, and if it is 5, and if there is someone
holding the lock, we just ignore the buffer and move on to the next
buffer?  Seems that could be done with no locking.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Page replacement algorithm in buffer cache

From
Merlin Moncure
Date:
On Tue, Mar 26, 2013 at 11:40 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Fri, Mar 22, 2013 at 04:16:18PM -0400, Tom Lane wrote:
>> Merlin Moncure <mmoncure@gmail.com> writes:
>> > I think there is some very low hanging optimization fruit in the clock
>> > sweep loop.   first and foremost, I see no good reason why when
>> > scanning pages we have to spin and wait on a buffer in order to
>> > pedantically adjust usage_count.  some simple refactoring there could
>> > set it up so that a simple TAS (or even a TTAS with the first test in
>> > front of the cache line lock as we done automatically in x86 IIRC)
>> > could guard the buffer and, in the event of any lock detected, simply
>> > move on to the next candidate without messing around with that buffer
>> > at all.   This could construed as a 'trylock' variant of a spinlock
>> > and might help out with cases where an especially hot buffer is
>> > locking up the sweep.  This is exploiting the fact that from
>> > StrategyGetBuffer we don't need a *particular* buffer, just *a*
>> > buffer.
>>
>> Hm.  You could argue in fact that if there's contention for the buffer
>> header, that's proof that it's busy and shouldn't have its usage count
>> decremented.  So this seems okay from a logical standpoint.
>>
>> However, I'm not real sure that it's possible to do a conditional
>> spinlock acquire that doesn't create just as much hardware-level
>> contention as a full acquire (ie, TAS is about as bad whether it
>> gets the lock or not).  So the actual benefit is a bit less clear.
>
> Could we view the usage count, and if it is 5, and if there is someone
> holding the lock, we just ignore the buffer and move on to the next
> buffer?  Seems that could be done with no locking.

The idea is that if someone is "holding the lock" to completely ignore
the buffer regardless of usage.  Quotes there because we test the lock
without cacheline lock.  Now if the buffer is apparently unlocked but
returns locked once you *do* acquire cache line lock in anticipation
of refcounting, again immediately bail and go to next buffer.

I see no reason whatsoever to have buffer allocator spin and wait on a
blocked buffer.  This is like jumping onto a merry-go-round being spun
by sadistic teenagers.

merlin



Re: Page replacement algorithm in buffer cache

From
Jeff Janes
Date:
On Friday, March 22, 2013, Ants Aasma wrote:
On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> well if you do a non-locking test first you could at least avoid some
> cases (and, if you get the answer wrong, so what?) by jumping to the
> next buffer immediately.  if the non locking test comes good, only
> then do you do a hardware TAS.
>
> you could in fact go further and dispense with all locking in front of
> usage_count, on the premise that it's only advisory and not a real
> refcount.  so you only then lock if/when it's time to select a
> candidate buffer, and only then when you did a non locking test first.
>  this would of course require some amusing adjustments to various
> logical checks (usage_count <= 0, heh).

Moreover, if the buffer happens to miss a decrement due to a data
race, there's a good chance that the buffer is heavily used and
wouldn't need to be evicted soon anyway. (if you arrange it to be a
read-test-inc/dec-store operation then you will never go out of
bounds) However, clocksweep and usage_count maintenance is not what is
causing contention because that workload is distributed. The issue is
pinning and unpinning.

That is one of multiple issues.  Contention on the BufFreelistLock is another one.  I agree that usage_count maintenance is unlikely to become a bottleneck unless one or both of those is fixed first (and maybe not even then)

...

 
The issue with the current buffer management algorithm is that it
seems to scale badly with increasing shared_buffers.

I do not think that this is the case.  Neither of the SELECT-only contention points (pinning/unpinning of index root blocks when all data is in shared_buffers, and BufFreelistLock when all data is not in shared_buffers) are made worse by increasing shared_buffers that I have seen.  They do scale badly with number of concurrent processes, though.

The reports of write-heavy workloads not scaling well with shared_buffers do not seem to be driven by the buffer management algorithm, or at least not the freelist part of it.  They mostly seem to center on the kernel and the IO controllers.

 Cheers,

Jeff

Re: Page replacement algorithm in buffer cache

From
Merlin Moncure
Date:
On Sun, Mar 31, 2013 at 1:27 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Friday, March 22, 2013, Ants Aasma wrote:
>>
>> On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure@gmail.com>
>> wrote:
>> > well if you do a non-locking test first you could at least avoid some
>> > cases (and, if you get the answer wrong, so what?) by jumping to the
>> > next buffer immediately.  if the non locking test comes good, only
>> > then do you do a hardware TAS.
>> >
>> > you could in fact go further and dispense with all locking in front of
>> > usage_count, on the premise that it's only advisory and not a real
>> > refcount.  so you only then lock if/when it's time to select a
>> > candidate buffer, and only then when you did a non locking test first.
>> >  this would of course require some amusing adjustments to various
>> > logical checks (usage_count <= 0, heh).
>>
>> Moreover, if the buffer happens to miss a decrement due to a data
>> race, there's a good chance that the buffer is heavily used and
>> wouldn't need to be evicted soon anyway. (if you arrange it to be a
>> read-test-inc/dec-store operation then you will never go out of
>> bounds) However, clocksweep and usage_count maintenance is not what is
>> causing contention because that workload is distributed. The issue is
>> pinning and unpinning.
>
>
> That is one of multiple issues.  Contention on the BufFreelistLock is
> another one.  I agree that usage_count maintenance is unlikely to become a
> bottleneck unless one or both of those is fixed first (and maybe not even
> then)

usage_count manipulation is not a bottleneck but that is irrelevant.
It can be affected by other page contention which can lead to priority
inversion.  I don't be believe there is any reasonable argument that
sitting and spinning while holding the BufFreelistLock is a good idea.

merlin



Re: Page replacement algorithm in buffer cache

From
Robert Haas
Date:
On Mon, Apr 1, 2013 at 9:28 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> That is one of multiple issues.  Contention on the BufFreelistLock is
>> another one.  I agree that usage_count maintenance is unlikely to become a
>> bottleneck unless one or both of those is fixed first (and maybe not even
>> then)
>
> usage_count manipulation is not a bottleneck but that is irrelevant.
> It can be affected by other page contention which can lead to priority
> inversion.  I don't be believe there is any reasonable argument that
> sitting and spinning while holding the BufFreelistLock is a good idea.

Hmm.  I'm not sure who if anyone I'm agreeing or disagreeing with, but
I think a big part of the reason why BufFreelistLock contention is
such a big problem is that we do other operations that involve atomics
while we're holding that lock.  You can have contention on a lock even
if you just take it, do some stuff, and release it.  But the longer
you hold the lock for, the less concurrency you need to have in order
to get a contention problem.  And atomics take a LOT longer to execute
than regular instructions - so it seems to me that usage_count
manipulation is relevant not so much because we get contention on the
buffer spinlocks as because it means we're sitting there serially
taking and releasing locks while sitting on BufFreelistLock.

In fact, BufFreelistLock is really misnamed, because for the most
part, the "free list" as we implement is going to be empty.  What the
BufFreelistLock is really doing is serializing the process of scanning
for a free buffer.  I think THAT is the problem.  If we could arrange
things so as to hold BufFreelistLock only for the amount of time
needed to remove a buffer from a freelist ... we'd probably buy
ourselves quite a bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Page replacement algorithm in buffer cache

From
Merlin Moncure
Date:
On Mon, Apr 1, 2013 at 10:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Apr 1, 2013 at 9:28 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> That is one of multiple issues.  Contention on the BufFreelistLock is
>>> another one.  I agree that usage_count maintenance is unlikely to become a
>>> bottleneck unless one or both of those is fixed first (and maybe not even
>>> then)
>>
>> usage_count manipulation is not a bottleneck but that is irrelevant.
>> It can be affected by other page contention which can lead to priority
>> inversion.  I don't be believe there is any reasonable argument that
>> sitting and spinning while holding the BufFreelistLock is a good idea.
>
> Hmm.  I'm not sure who if anyone I'm agreeing or disagreeing with, but
> I think a big part of the reason why BufFreelistLock contention is
> such a big problem is that we do other operations that involve atomics
> while we're holding that lock.  You can have contention on a lock even
> if you just take it, do some stuff, and release it.  But the longer
> you hold the lock for, the less concurrency you need to have in order
> to get a contention problem.  And atomics take a LOT longer to execute
> than regular instructions - so it seems to me that usage_count
> manipulation is relevant not so much because we get contention on the
> buffer spinlocks as because it means we're sitting there serially
> taking and releasing locks while sitting on BufFreelistLock.
>
> In fact, BufFreelistLock is really misnamed, because for the most
> part, the "free list" as we implement is going to be empty.  What the
> BufFreelistLock is really doing is serializing the process of scanning
> for a free buffer.  I think THAT is the problem.  If we could arrange
> things so as to hold BufFreelistLock only for the amount of time
> needed to remove a buffer from a freelist ... we'd probably buy
> ourselves quite a bit.

right.  I'm imaging a buffer scan loop that looks something like
(uncompiled, untested) this.  "TryLockBufHdr" does a simple TAS
without spin, returning the lock state (well, true if it acquired the
lock).  usage_count is specifically and deliberately adjusted without
having a lock on the buffer header (this would require some careful
testing and possible changes elsewhere):
 for (;;) {   buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
   if (++StrategyControl->nextVictimBuffer >= NBuffers)   {     StrategyControl->nextVictimBuffer = 0;
StrategyControl->completePasses++;  }
 
   /*    * If the buffer is pinned or has a nonzero usage_count, we cannot use    * it; decrement the usage_count
(unlesspinned) and keep scanning.    */   if (buf->refcount == 0)   {     if (buf->usage_count > 0)     {
buf->usage_count--;      trycounter = NBuffers;     }     else     {       if (TryLockBufHdr(buf)       {         /*
Founda usable buffer */         if (strategy != NULL)           AddBufferToRing(strategy, buf);         return buf;
 }     }   }
 
   if (--trycounter == 0)   {     /*      * We've scanned all the buffers without making any state changes,      * so
allthe buffers are pinned (or were when we looked at them).      * We could hope that someone will free one eventually,
butit's      * probably better to fail than to risk getting stuck in an      * infinite loop.      */     elog(ERROR,
"nounpinned buffers available");   } }
 

The advantages are:
*) no spinning under the free list lock (ever)
*) consequently pretty much zero chance of getting sheduled out
*) far less (in some cases drastically less) atomic operations unless
the sweep is generally returning a buffer immediately upon every
request, in which case it's about the same
*) less cpu cycles overall, unless you somehow miss a whole bunch of
buffers that claim locked when in fact they are not (which seems
improbable)

I think you could implement something approximating the above in
conjunction with your buffer nailing, or without (although your stuff
would likely reduce the number of cases where you'd really need it).
Ditto Jeff J's idea to completely replace BufFreeListLock with
spinlock implementation.

merlin



Re: Page replacement algorithm in buffer cache

From
Bruce Momjian
Date:
On Mon, Apr  1, 2013 at 11:55:07AM -0500, Merlin Moncure wrote:
> > In fact, BufFreelistLock is really misnamed, because for the most
> > part, the "free list" as we implement is going to be empty.  What the
> > BufFreelistLock is really doing is serializing the process of scanning
> > for a free buffer.  I think THAT is the problem.  If we could arrange
> > things so as to hold BufFreelistLock only for the amount of time
> > needed to remove a buffer from a freelist ... we'd probably buy
> > ourselves quite a bit.
> 
> right.  I'm imaging a buffer scan loop that looks something like
> (uncompiled, untested) this.  "TryLockBufHdr" does a simple TAS
> without spin, returning the lock state (well, true if it acquired the
> lock).  usage_count is specifically and deliberately adjusted without
> having a lock on the buffer header (this would require some careful
> testing and possible changes elsewhere):

TAS does a CPU 'lock' instruction which affects the cpu cache.  Why not
just read the value with no lock?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Page replacement algorithm in buffer cache

From
Merlin Moncure
Date:
On Mon, Apr 1, 2013 at 3:32 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Mon, Apr  1, 2013 at 11:55:07AM -0500, Merlin Moncure wrote:
>> > In fact, BufFreelistLock is really misnamed, because for the most
>> > part, the "free list" as we implement is going to be empty.  What the
>> > BufFreelistLock is really doing is serializing the process of scanning
>> > for a free buffer.  I think THAT is the problem.  If we could arrange
>> > things so as to hold BufFreelistLock only for the amount of time
>> > needed to remove a buffer from a freelist ... we'd probably buy
>> > ourselves quite a bit.
>>
>> right.  I'm imaging a buffer scan loop that looks something like
>> (uncompiled, untested) this.  "TryLockBufHdr" does a simple TAS
>> without spin, returning the lock state (well, true if it acquired the
>> lock).  usage_count is specifically and deliberately adjusted without
>> having a lock on the buffer header (this would require some careful
>> testing and possible changes elsewhere):
>
> TAS does a CPU 'lock' instruction which affects the cpu cache.  Why not
> just read the value with no lock?

check again, that's exactly what it does. Note the old implementation
did a LockBufHdr() before examining refcount.  The key logic is here:

   if (buf->refcount == 0)   {     if (buf->usage_count > 0)     {       buf->usage_count--;       trycounter =
NBuffers;    }     else     {       if (TryLockBufHdr(buf)       {
 

So we do an unlocked read of refcount and immediately bail if the
buffer is "locked" according to the cpu cache.  Then we check (still
unlocked) usage_count and decrement it:  Our adjustment may be lost,
but so what?  Finally, we attempt one (and only one) cache line lock
(via TAS_SPIN) of the buffer and again bail if we see any problems
there.   Thus, it's impossible to get stuck in a potentially yielding
spin while holding the free list lock.

I dub this: "The Frightened Turtle" strategy of buffer allocation.
It's an idea based on my research trying to solve Vlad's issue of
having server stalls during read-only loads (see here:
http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
for a general backgrounder.  The idea may not actually fix his issue,
or there may be other aggravating aspects such as the
always-capricious linux scheduler, but I'm suspicious.

merlin



Re: Page replacement algorithm in buffer cache

From
Andres Freund
Date:
On 2013-04-01 08:28:13 -0500, Merlin Moncure wrote:
> On Sun, Mar 31, 2013 at 1:27 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > On Friday, March 22, 2013, Ants Aasma wrote:
> >>
> >> On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure@gmail.com>
> >> wrote:
> >> > well if you do a non-locking test first you could at least avoid some
> >> > cases (and, if you get the answer wrong, so what?) by jumping to the
> >> > next buffer immediately.  if the non locking test comes good, only
> >> > then do you do a hardware TAS.
> >> >
> >> > you could in fact go further and dispense with all locking in front of
> >> > usage_count, on the premise that it's only advisory and not a real
> >> > refcount.  so you only then lock if/when it's time to select a
> >> > candidate buffer, and only then when you did a non locking test first.
> >> >  this would of course require some amusing adjustments to various
> >> > logical checks (usage_count <= 0, heh).
> >>
> >> Moreover, if the buffer happens to miss a decrement due to a data
> >> race, there's a good chance that the buffer is heavily used and
> >> wouldn't need to be evicted soon anyway. (if you arrange it to be a
> >> read-test-inc/dec-store operation then you will never go out of
> >> bounds) However, clocksweep and usage_count maintenance is not what is
> >> causing contention because that workload is distributed. The issue is
> >> pinning and unpinning.
> >
> >
> > That is one of multiple issues.  Contention on the BufFreelistLock is
> > another one.  I agree that usage_count maintenance is unlikely to become a
> > bottleneck unless one or both of those is fixed first (and maybe not even
> > then)
> 
> usage_count manipulation is not a bottleneck but that is irrelevant.
> It can be affected by other page contention which can lead to priority
> inversion.  I don't be believe there is any reasonable argument that
> sitting and spinning while holding the BufFreelistLock is a good idea.

In my experience the mere fact of (unlockedly, but still) accessing all the
buffer headers can cause noticeable slowdowns in write only/mostly workloads with
big amounts of shmem.
Due to the write only nature large amounts of the buffers have a similar
usagecounts (since they are infrequently touched after the initial insertion)
and there are no free ones around so the search for a buffer frequently runs
through *all* buffer headers multiple times till it decremented all usagecounts
to 0. Then comes a period where free buffers are found easily (since all
usagecounts from the current sweep point onwards are zero). After that it
starts all over.
I now have seen that scenario multiple times :(

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Page replacement algorithm in buffer cache

From
Merlin Moncure
Date:
On Mon, Apr 1, 2013 at 4:09 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-04-01 08:28:13 -0500, Merlin Moncure wrote:
>> On Sun, Mar 31, 2013 at 1:27 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> > On Friday, March 22, 2013, Ants Aasma wrote:
>> >>
>> >> On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure@gmail.com>
>> >> wrote:
>> >> > well if you do a non-locking test first you could at least avoid some
>> >> > cases (and, if you get the answer wrong, so what?) by jumping to the
>> >> > next buffer immediately.  if the non locking test comes good, only
>> >> > then do you do a hardware TAS.
>> >> >
>> >> > you could in fact go further and dispense with all locking in front of
>> >> > usage_count, on the premise that it's only advisory and not a real
>> >> > refcount.  so you only then lock if/when it's time to select a
>> >> > candidate buffer, and only then when you did a non locking test first.
>> >> >  this would of course require some amusing adjustments to various
>> >> > logical checks (usage_count <= 0, heh).
>> >>
>> >> Moreover, if the buffer happens to miss a decrement due to a data
>> >> race, there's a good chance that the buffer is heavily used and
>> >> wouldn't need to be evicted soon anyway. (if you arrange it to be a
>> >> read-test-inc/dec-store operation then you will never go out of
>> >> bounds) However, clocksweep and usage_count maintenance is not what is
>> >> causing contention because that workload is distributed. The issue is
>> >> pinning and unpinning.
>> >
>> >
>> > That is one of multiple issues.  Contention on the BufFreelistLock is
>> > another one.  I agree that usage_count maintenance is unlikely to become a
>> > bottleneck unless one or both of those is fixed first (and maybe not even
>> > then)
>>
>> usage_count manipulation is not a bottleneck but that is irrelevant.
>> It can be affected by other page contention which can lead to priority
>> inversion.  I don't be believe there is any reasonable argument that
>> sitting and spinning while holding the BufFreelistLock is a good idea.
>
> In my experience the mere fact of (unlockedly, but still) accessing all the
> buffer headers can cause noticeable slowdowns in write only/mostly workloads with
> big amounts of shmem.
> Due to the write only nature large amounts of the buffers have a similar
> usagecounts (since they are infrequently touched after the initial insertion)
> and there are no free ones around so the search for a buffer frequently runs
> through *all* buffer headers multiple times till it decremented all usagecounts
> to 0. Then comes a period where free buffers are found easily (since all
> usagecounts from the current sweep point onwards are zero). After that it
> starts all over.
> I now have seen that scenario multiple times :(

Interesting -- I was thinking about that too, but it's a separate
problem with a different trigger.  Maybe a bailout should be in there
so that after X usage_count adjustments the sweeper summarily does an
eviction, or maybe the "max" declines from 5 once per hundred buffers
inspected or some such.

merlin



Re: Page replacement algorithm in buffer cache

From
Jim Nasby
Date:
On 3/23/13 7:41 AM, Ants Aasma wrote:
> On Sat, Mar 23, 2013 at 6:04 AM, Jim Nasby <jim@nasby.net> wrote:
>> Partitioned clock sweep strikes me as a bad idea... you could certainly get
>> unlucky and end up with a lot of hot stuff in one partition.
>
> Surely that is not worse than having everything in a single partition.
> Given a decent partitioning function it's very highly unlikely to have
> more than a few of the hottest buffers end up in a single partition.

One could argue that it is worse because you've added another layer of unpredictability to performance. If something
happensto suddenly put two heavily hit sets in the same partition your previously good performance suddenly tanks.
 

Maybe that issue isn't real enough to be worth worrying about, but I still think it'd be easier and cleaner to try
keepingstuff on the free list first...
 

>> Another idea that'sbeen broughht up inthe past is to have something in the
>> background keep a minimum number of buffers on the free list. That's how OS
>> VM systems I'm familiar with work, so there's precedent for it.
>>
>> I recall there were at least some theoretical concerns about this, but I
>> don't remember if anyone actually tested the idea.
>
> Yes, having bgwriter do the actual cleaning up seems like a good idea.
> The whole bgwriter infrastructure will need some serious tuning. There
> are many things that could be shifted to background if we knew it
> could keep up, like hint bit setting on dirty buffers being flushed
> out. But again, we have the issue of having good tests to see where
> the changes hurt.

I think at some point we need to stop depending on just bgwriter for all this stuff. I believe it would be much cleaner
ifwe had separate procs for everything we needed (although some synergies might exist; if we wanted to set hint bits
duringwrite then bgwriter *is* the logical place to put that).
 

In this case, I don't think keeping stuff on the free list is close enough to checkpoints that we'd want bgwriter to
handleboth. At most we might want them to pass some metrics back in forth.
 
-- 
--
Jim C. Nasby, Data Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Page replacement algorithm in buffer cache

From
Jim Nasby
Date:
On 4/1/13 4:55 PM, Merlin Moncure wrote:
> On Mon, Apr 1, 2013 at 4:09 PM, Andres Freund<andres@2ndquadrant.com>  wrote:
>> >On 2013-04-01 08:28:13 -0500, Merlin Moncure wrote:
>>> >>On Sun, Mar 31, 2013 at 1:27 PM, Jeff Janes<jeff.janes@gmail.com>  wrote:
>>>> >> >On Friday, March 22, 2013, Ants Aasma wrote:
>>>>> >> >>
>>>>> >> >>On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure<mmoncure@gmail.com>
>>>>> >> >>wrote:
>>>>>> >> >> >well if you do a non-locking test first you could at least avoid some
>>>>>> >> >> >cases (and, if you get the answer wrong, so what?) by jumping to the
>>>>>> >> >> >next buffer immediately.  if the non locking test comes good, only
>>>>>> >> >> >then do you do a hardware TAS.
>>>>>> >> >> >
>>>>>> >> >> >you could in fact go further and dispense with all locking in front of
>>>>>> >> >> >usage_count, on the premise that it's only advisory and not a real
>>>>>> >> >> >refcount.  so you only then lock if/when it's time to select a
>>>>>> >> >> >candidate buffer, and only then when you did a non locking test first.
>>>>>> >> >> >  this would of course require some amusing adjustments to various
>>>>>> >> >> >logical checks (usage_count <= 0, heh).
>>>>> >> >>
>>>>> >> >>Moreover, if the buffer happens to miss a decrement due to a data
>>>>> >> >>race, there's a good chance that the buffer is heavily used and
>>>>> >> >>wouldn't need to be evicted soon anyway. (if you arrange it to be a
>>>>> >> >>read-test-inc/dec-store operation then you will never go out of
>>>>> >> >>bounds) However, clocksweep and usage_count maintenance is not what is
>>>>> >> >>causing contention because that workload is distributed. The issue is
>>>>> >> >>pinning and unpinning.
>>>> >> >
>>>> >> >
>>>> >> >That is one of multiple issues.  Contention on the BufFreelistLock is
>>>> >> >another one.  I agree that usage_count maintenance is unlikely to become a
>>>> >> >bottleneck unless one or both of those is fixed first (and maybe not even
>>>> >> >then)
>>> >>
>>> >>usage_count manipulation is not a bottleneck but that is irrelevant.
>>> >>It can be affected by other page contention which can lead to priority
>>> >>inversion.  I don't be believe there is any reasonable argument that
>>> >>sitting and spinning while holding the BufFreelistLock is a good idea.
>> >
>> >In my experience the mere fact of (unlockedly, but still) accessing all the
>> >buffer headers can cause noticeable slowdowns in write only/mostly workloads with
>> >big amounts of shmem.
>> >Due to the write only nature large amounts of the buffers have a similar
>> >usagecounts (since they are infrequently touched after the initial insertion)
>> >and there are no free ones around so the search for a buffer frequently runs
>> >through*all*  buffer headers multiple times till it decremented all usagecounts
>> >to 0. Then comes a period where free buffers are found easily (since all
>> >usagecounts from the current sweep point onwards are zero). After that it
>> >starts all over.
>> >I now have seen that scenario multiple times:(
> Interesting -- I was thinking about that too, but it's a separate
> problem with a different trigger.  Maybe a bailout should be in there
> so that after X usage_count adjustments the sweeper summarily does an
> eviction, or maybe the "max" declines from 5 once per hundred buffers
> inspected or some such.

What's the potential downside on that though? IE: what happens if this scheme suddenly evicts the root page on a
heavilyused index? You'll suddenly have a ton of stuff blocked waiting for that page to come back in.
 

This is a use case that I think would benefit greatly from a background process that keeps pages in the free list.

That said, I now suspect that your "frightened turtle" approach would be of higher value than "bgfreelist"... but I
suspectwe'll ultimately want both of them for different reasons.
 
--
Jim C. Nasby, Data Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Page replacement algorithm in buffer cache

From
Jim Nasby
Date:
On 3/23/13 4:43 AM, Amit Kapila wrote:
> I have tried one of the idea's : Adding the buffers background writer finds
> reusable to freelist.
> http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs
> This can reduce the clock swipe as it can find buffers from freelist.

That's a nice potential efficiency gain, but it's not the same as having a separate bg process charged with keeping
pageson the freelist. I believe a separate process would be useful in a wider variety of workloads, because it's not
dependenton stumbling across 0 count blocks; it would actively work to "produce" zero count blocks when none existed
andthen free-list them.
 

> It shows performance improvement for read loads when data can be contained
> in shared buffers,
> but when the data becomes large and (I/O) is involved, it shows some dip as
> well.

Do you remember off-hand why it slowed down with I/O so I don't have to read the whole thread? :) Was it just a matter
ofit evicting dirty pages sooner than it would otherwise?
 
--
Jim C. Nasby, Data Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Page replacement algorithm in buffer cache

From
Jim Nasby
Date:
On 3/24/13 8:11 AM, Greg Smith wrote:
> On 3/22/13 8:45 AM, Ants Aasma wrote:
>> However, I think the main issue isn't finding new algorithms that are
>> better in some specific circumstances. The hard part is figuring out
>> whether their performance is better in general.
>
> Right.  The current page replacement method works as expected.  Many frequently accessed pages accumulate a usage
countof 5 before the clock sweep hits them.  Pages that are accessed once and not again before the clock sweep are
evicted. There are several theoretically better ways to approach this.  Anyone who hasn't already been working on this
fora few years is very unlikely to come up with a brand new idea, one that hasn't already been tested in the academic
research.
>
> But the real blocker here isn't ideas, it's creating benchmark workloads to validate any change.  Right now I see the
mostpromising work that could lead toward the "performance farm" idea as all of the Jenkins based testing that's been
goingon recently.  Craig Ringer has using that for 2ndQuadrant work here, Peter Eisentraut has been working with it:
http://petereisentraut.blogspot.com/2013/01/postgresql-and-jenkins.htmland the PostGIS project uses it too.  There's
somegood momentum brewing there.
 
>
> When we have regular performance testing with a mix of workloads--I have about 10 in mind to start--at that point we
couldstart the testing performance changes to the buffer replacement.  Until then this whole area is hard to touch
usefully. You have to assume that any tuning you do for one type of workload might accidentally slow another.  Starting
witha lot of baseline workloads is the only way to move usefully forward when facing that problem.
 

The other thing I think would be tremendously useful would be the ability to get performance data from systems in the
field*without having to install extra stuff or do a special build*. The last point is critical because there are so
manyplaces where deviating from a standard package takes an act of Congress.
 

In this case, if I could run some queries to get stats about clock sweep waits and what-not then I could get our shared
buffersize changed on some hosts and see how those changes affect the numbers. But doing this with a non-standard build
ispretty much a non-starter.
 

I know there's been some improvement in this area, but I suspect there's still more to go.
-- 
Jim C. Nasby, Data Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Page replacement algorithm in buffer cache

From
Atri Sharma
Date:
>  I don't be believe there is any reasonable argument that
> sitting and spinning while holding the BufFreelistLock is a good idea.

I completely agree. The idea of spinning for a lock while already
inside a lock seems like a source of a hit to performance.

Regards,

Atri




--
Regards,

Atri
l'apprenant

On Mon, Apr 1, 2013 at 6:58 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Sun, Mar 31, 2013 at 1:27 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Friday, March 22, 2013, Ants Aasma wrote:
>>>
>>> On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure@gmail.com>
>>> wrote:
>>> > well if you do a non-locking test first you could at least avoid some
>>> > cases (and, if you get the answer wrong, so what?) by jumping to the
>>> > next buffer immediately.  if the non locking test comes good, only
>>> > then do you do a hardware TAS.
>>> >
>>> > you could in fact go further and dispense with all locking in front of
>>> > usage_count, on the premise that it's only advisory and not a real
>>> > refcount.  so you only then lock if/when it's time to select a
>>> > candidate buffer, and only then when you did a non locking test first.
>>> >  this would of course require some amusing adjustments to various
>>> > logical checks (usage_count <= 0, heh).
>>>
>>> Moreover, if the buffer happens to miss a decrement due to a data
>>> race, there's a good chance that the buffer is heavily used and
>>> wouldn't need to be evicted soon anyway. (if you arrange it to be a
>>> read-test-inc/dec-store operation then you will never go out of
>>> bounds) However, clocksweep and usage_count maintenance is not what is
>>> causing contention because that workload is distributed. The issue is
>>> pinning and unpinning.
>>
>>
>> That is one of multiple issues.  Contention on the BufFreelistLock is
>> another one.  I agree that usage_count maintenance is unlikely to become a
>> bottleneck unless one or both of those is fixed first (and maybe not even
>> then)
>
> usage_count manipulation is not a bottleneck but that is irrelevant.
> It can be affected by other page contention which can lead to priority
> inversion.  I don't be believe there is any reasonable argument that
> sitting and spinning while holding the BufFreelistLock is a good idea.
>
> merlin



-- 
Regards,

Atri
l'apprenant



Re: Page replacement algorithm in buffer cache

From
Merlin Moncure
Date:
On Mon, Apr 1, 2013 at 6:09 PM, Jim Nasby <jim@nasby.net> wrote:
> On 4/1/13 4:55 PM, Merlin Moncure wrote:
>>
>> On Mon, Apr 1, 2013 at 4:09 PM, Andres Freund<andres@2ndquadrant.com>
>> wrote:
>>>
>>> >On 2013-04-01 08:28:13 -0500, Merlin Moncure wrote:
>>>>
>>>> >>On Sun, Mar 31, 2013 at 1:27 PM, Jeff Janes<jeff.janes@gmail.com>
>>>> >> wrote:
>>>>>
>>>>> >> >On Friday, March 22, 2013, Ants Aasma wrote:
>>>>>>
>>>>>> >> >>
>>>>>> >> >>On Fri, Mar 22, 2013 at 10:22 PM, Merlin
>>>>>> >> >> Moncure<mmoncure@gmail.com>
>>>>>> >> >>wrote:
>>>>>>>
>>>>>>> >> >> >well if you do a non-locking test first you could at least
>>>>>>> >> >> > avoid some
>>>>>>> >> >> >cases (and, if you get the answer wrong, so what?) by jumping
>>>>>>> >> >> > to the
>>>>>>> >> >> >next buffer immediately.  if the non locking test comes good,
>>>>>>> >> >> > only
>>>>>>> >> >> >then do you do a hardware TAS.
>>>>>>> >> >> >
>>>>>>> >> >> >you could in fact go further and dispense with all locking in
>>>>>>> >> >> > front of
>>>>>>> >> >> >usage_count, on the premise that it's only advisory and not a
>>>>>>> >> >> > real
>>>>>>> >> >> >refcount.  so you only then lock if/when it's time to select a
>>>>>>> >> >> >candidate buffer, and only then when you did a non locking
>>>>>>> >> >> > test first.
>>>>>>> >> >> >  this would of course require some amusing adjustments to
>>>>>>> >> >> > various
>>>>>>> >> >> >logical checks (usage_count <= 0, heh).
>>>>>>
>>>>>> >> >>
>>>>>> >> >>Moreover, if the buffer happens to miss a decrement due to a data
>>>>>> >> >>race, there's a good chance that the buffer is heavily used and
>>>>>> >> >>wouldn't need to be evicted soon anyway. (if you arrange it to be
>>>>>> >> >> a
>>>>>> >> >>read-test-inc/dec-store operation then you will never go out of
>>>>>> >> >>bounds) However, clocksweep and usage_count maintenance is not
>>>>>> >> >> what is
>>>>>> >> >>causing contention because that workload is distributed. The
>>>>>> >> >> issue is
>>>>>> >> >>pinning and unpinning.
>>>>>
>>>>> >> >
>>>>> >> >
>>>>> >> >That is one of multiple issues.  Contention on the BufFreelistLock
>>>>> >> > is
>>>>> >> >another one.  I agree that usage_count maintenance is unlikely to
>>>>> >> > become a
>>>>> >> >bottleneck unless one or both of those is fixed first (and maybe
>>>>> >> > not even
>>>>> >> >then)
>>>>
>>>> >>
>>>> >>usage_count manipulation is not a bottleneck but that is irrelevant.
>>>> >>It can be affected by other page contention which can lead to priority
>>>> >>inversion.  I don't be believe there is any reasonable argument that
>>>> >>sitting and spinning while holding the BufFreelistLock is a good idea.
>>>
>>> >
>>> >In my experience the mere fact of (unlockedly, but still) accessing all
>>> > the
>>> >buffer headers can cause noticeable slowdowns in write only/mostly
>>> > workloads with
>>> >big amounts of shmem.
>>> >Due to the write only nature large amounts of the buffers have a similar
>>> >usagecounts (since they are infrequently touched after the initial
>>> > insertion)
>>> >and there are no free ones around so the search for a buffer frequently
>>> > runs
>>> >through*all*  buffer headers multiple times till it decremented all
>>> > usagecounts
>>>
>>> >to 0. Then comes a period where free buffers are found easily (since all
>>> >usagecounts from the current sweep point onwards are zero). After that
>>> > it
>>> >starts all over.
>>> >I now have seen that scenario multiple times:(
>>
>> Interesting -- I was thinking about that too, but it's a separate
>> problem with a different trigger.  Maybe a bailout should be in there
>> so that after X usage_count adjustments the sweeper summarily does an
>> eviction, or maybe the "max" declines from 5 once per hundred buffers
>> inspected or some such.
>
> What's the potential downside on that though? IE: what happens if this
> scheme suddenly evicts the root page on a heavily used index? You'll
> suddenly have a ton of stuff blocked waiting for that page to come back in.

That seems pretty unlikely because of A sheer luck of hitting that
page for the dropout (if your buffer count is N the chances of losing
it would seem to be 1/N at most) and B highly used pages are much more
likely to be pinned and thus immune from eviction.  But my issue with
this whole line of analysis is that I've never been able to to turn it
up in simulated testing.   Probably to do it you'd need very very fast
storage.

> This is a use case that I think would benefit greatly from a background
> process that keeps pages in the free list.

> That said, I now suspect that your "frightened turtle" approach would be of
> higher value than "bgfreelist"... but I suspect we'll ultimately want both
> of them for different reasons.

Well maybe.  Performance analysis of patches like this has to be
systematic and extremely thorough, so who knows what's the best way
forward? (hint: not me).  I'm hoping that adjusting clock sweep
behavior will shave a few cycles in uncontended workloads -- this will
make it a lot easier to sell performance wise.  But I'm optimistic and
maybe we can greatly reduce allocation contention without a lot of
work.

merlin



Re: Page replacement algorithm in buffer cache

From
Amit Kapila
Date:

> -----Original Message-----
> From: Jim Nasby [mailto:jim@nasby.net]
> Sent: Tuesday, April 02, 2013 4:43 AM
> To: Amit Kapila
> Cc: 'Ants Aasma'; 'Merlin Moncure'; 'Tom Lane'; 'Atri Sharma'; 'Greg
> Stark'; 'PostgreSQL-development'
> Subject: Re: [HACKERS] Page replacement algorithm in buffer cache
> 
> On 3/23/13 4:43 AM, Amit Kapila wrote:
> > I have tried one of the idea's : Adding the buffers background writer
> finds
> > reusable to freelist.
> > http://www.postgresql.org/message-
> id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs
> > This can reduce the clock swipe as it can find buffers from freelist.
> 
> That's a nice potential efficiency gain, but it's not the same as
> having a separate bg process charged with keeping pages on the
> freelist. I believe a separate process would be useful in a wider
> variety of workloads, because it's not dependent on stumbling across 0
> count blocks; it would actively work to "produce" zero count blocks
> when none existed and then free-list them.

If bg process work to produce zero count blocks, then wouldn't it give more
priority to
reduce usage count without any demand, or would you like to do the work in
bg process
based on some statistics for buffer usage.

> > It shows performance improvement for read loads when data can be
> contained
> > in shared buffers,
> > but when the data becomes large and (I/O) is involved, it shows some
> dip as
> > well.
> 
> Do you remember off-hand why it slowed down with I/O so I don't have to
> read the whole thread? :) 

You can go through below mail to read some of my observation regarding the
performance improvement
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C383BC76A4
C@szxeml509-mbx

For case when I/O is involved you can once check the readings in below mail:
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C383BC76A4
C@szxeml509-mbx


> Was it just a matter of it evicting dirty pages sooner than it would
otherwise?

The exact reason was not nailed down, but as it is I/O heavy test, there is
some chance that OS scheduling also would have played role.
The reason I think why OS can be involved, because when we are flushing
buffers internally OS also might change their usage count based on access of
buffer.
The other could be as pointed out by you, early eviction.


With Regards,
Amit Kapila.




Re: Page replacement algorithm in buffer cache

From
Andres Freund
Date:
On 2013-04-01 17:56:19 -0500, Jim Nasby wrote:
> On 3/23/13 7:41 AM, Ants Aasma wrote:
> >Yes, having bgwriter do the actual cleaning up seems like a good idea.
> >The whole bgwriter infrastructure will need some serious tuning. There
> >are many things that could be shifted to background if we knew it
> >could keep up, like hint bit setting on dirty buffers being flushed
> >out. But again, we have the issue of having good tests to see where
> >the changes hurt.
> 
> I think at some point we need to stop depending on just bgwriter for all this stuff. I believe it would be much
cleanerif we had separate procs for everything we needed (although some synergies might exist; if we wanted to set hint
bitsduring write then bgwriter *is* the logical place to put that).
 
> 
> In this case, I don't think keeping stuff on the free list is close enough to checkpoints that we'd want bgwriter to
handleboth. At most we might want them to pass some metrics back in forth.
 

bgwriter isn't doing checkpoints anymore, there's the checkpointer since 9.2.

In my personal experience and measurement bgwriter is pretty close to
useless right now. I think - pretty similar to what Amit has done - it
should perform part of a real clock sweep instead of just looking ahead
of the current position without changing usagecounts and the sweep
position and put enough buffers on the freelist to sustain the need till
its next activity phase. I hacked around that one night in a hotel and
got impressive speedups (and quite some breakage) for bigger than s_b
workloads.

That would reduce quite a bit of pain points:
- fewer different processes/cpus looking at buffer headers ahead in the cycle
- fewer cpus changing usagecounts
- dirty pages are far more likely to be flushed out already when a new page is needed
- stuff like the relation extension lock which right now frequently have to search far and wide for new pages while
holdingthe extension lock exlusively should finish quite a bit faster
 

If the freelist lock is separated from the lock protecting the clock
sweep this should get quite a bit of a scalability boost without having
potential unfairness you can have with partitioning the lock or such.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Page replacement algorithm in buffer cache

From
Robert Haas
Date:
On Tue, Apr 2, 2013 at 6:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-04-01 17:56:19 -0500, Jim Nasby wrote:
>> On 3/23/13 7:41 AM, Ants Aasma wrote:
>> >Yes, having bgwriter do the actual cleaning up seems like a good idea.
>> >The whole bgwriter infrastructure will need some serious tuning. There
>> >are many things that could be shifted to background if we knew it
>> >could keep up, like hint bit setting on dirty buffers being flushed
>> >out. But again, we have the issue of having good tests to see where
>> >the changes hurt.
>>
>> I think at some point we need to stop depending on just bgwriter for all this stuff. I believe it would be much
cleanerif we had separate procs for everything we needed (although some synergies might exist; if we wanted to set hint
bitsduring write then bgwriter *is* the logical place to put that). 
>>
>> In this case, I don't think keeping stuff on the free list is close enough to checkpoints that we'd want bgwriter to
handleboth. At most we might want them to pass some metrics back in forth. 
>
> bgwriter isn't doing checkpoints anymore, there's the checkpointer since 9.2.
>
> In my personal experience and measurement bgwriter is pretty close to
> useless right now. I think - pretty similar to what Amit has done - it
> should perform part of a real clock sweep instead of just looking ahead
> of the current position without changing usagecounts and the sweep
> position and put enough buffers on the freelist to sustain the need till
> its next activity phase. I hacked around that one night in a hotel and
> got impressive speedups (and quite some breakage) for bigger than s_b
> workloads.
>
> That would reduce quite a bit of pain points:
> - fewer different processes/cpus looking at buffer headers ahead in the cycle
> - fewer cpus changing usagecounts
> - dirty pages are far more likely to be flushed out already when a new
>   page is needed
> - stuff like the relation extension lock which right now frequently have
>   to search far and wide for new pages while holding the extension lock
>   exlusively should finish quite a bit faster
>
> If the freelist lock is separated from the lock protecting the clock
> sweep this should get quite a bit of a scalability boost without having
> potential unfairness you can have with partitioning the lock or such.

I agree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Page replacement algorithm in buffer cache

From
Robert Haas
Date:
On Tue, Apr 2, 2013 at 1:53 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> That seems pretty unlikely because of A sheer luck of hitting that
> page for the dropout (if your buffer count is N the chances of losing
> it would seem to be 1/N at most) and B highly used pages are much more
> likely to be pinned and thus immune from eviction.  But my issue with
> this whole line of analysis is that I've never been able to to turn it
> up in simulated testing.   Probably to do it you'd need very very fast
> storage.

Well, if you have shared_buffers=8GB, that's a million buffers.  One
in a million events happen pretty frequently on a heavily loaded
server, which, on recent versions of PostgreSQL, can support several
hundred thousand queries per second, each of which accesses multiple
buffers.

I've definitely seen evidence that poor choices of which CLOG buffer
to evict can result in a noticeable system-wide stall while everyone
waits for it to be read back in.  I don't have any similar evidence
for shared buffers, but I wouldn't be very surprised if the same
danger exists there, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Page replacement algorithm in buffer cache

From
Merlin Moncure
Date:
On Tue, Apr 2, 2013 at 9:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Apr 2, 2013 at 1:53 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> That seems pretty unlikely because of A sheer luck of hitting that
>> page for the dropout (if your buffer count is N the chances of losing
>> it would seem to be 1/N at most) and B highly used pages are much more
>> likely to be pinned and thus immune from eviction.  But my issue with
>> this whole line of analysis is that I've never been able to to turn it
>> up in simulated testing.   Probably to do it you'd need very very fast
>> storage.
>
> Well, if you have shared_buffers=8GB, that's a million buffers.  One
> in a million events happen pretty frequently on a heavily loaded
> server, which, on recent versions of PostgreSQL, can support several
> hundred thousand queries per second, each of which accesses multiple
> buffers.
>
> I've definitely seen evidence that poor choices of which CLOG buffer
> to evict can result in a noticeable system-wide stall while everyone
> waits for it to be read back in.  I don't have any similar evidence
> for shared buffers, but I wouldn't be very surprised if the same
> danger exists there, too.

That's a very fair point, although not being able to evict pinned
buffers is a highly mitigating aspect.  Also CLOG is a different beast
entirely -- it's much more dense (2 bits!) vs a tuple so a single page
can a lot of high priority things.  But you could be right anyways.

Given that, I wouldn't feel very comfortable with forced eviction
without knowing for sure high priority buffers were immune from that.
Your nailing idea is maybe the ideal solution.   Messing around with
the usage_count mechanic is tempting (like raising the cap and making
the sweeper more aggressive as it iterates), but probably really
difficult to get right, and, hopefully, ultimately moot.
merlin



Re: Page replacement algorithm in buffer cache

From
Robert Haas
Date:
On Tue, Apr 2, 2013 at 11:32 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> That's a very fair point, although not being able to evict pinned
> buffers is a highly mitigating aspect.  Also CLOG is a different beast
> entirely -- it's much more dense (2 bits!) vs a tuple so a single page
> can a lot of high priority things.  But you could be right anyways.
>
> Given that, I wouldn't feel very comfortable with forced eviction
> without knowing for sure high priority buffers were immune from that.
> Your nailing idea is maybe the ideal solution.   Messing around with
> the usage_count mechanic is tempting (like raising the cap and making
> the sweeper more aggressive as it iterates), but probably really
> difficult to get right, and, hopefully, ultimately moot.

One thought I had for fiddling with usage_count is to make it grow
additively (x = x + 1) and decay exponentially (x = x >> 1).  I'm not
sure the idea is any good, but one problem with the current system is
that it's pretty trivial for a buffer to accumulate five touches, and
after that we lose all memory of what the frequency of access is, so a
pages of varying different levels of "hotness" can all have usage
count 5.  This might allow a little more refinement without letting
the time to degrade the usage count get out of control.

But, having said that, I still think the best idea is what Andres
proposed, which pretty much matches my own thoughts: the bgwriter
needs to populate the free list, so that buffer allocations don't have
to wait for linear scans of the buffer array.  That's just plain too
slow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Page replacement algorithm in buffer cache

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> But, having said that, I still think the best idea is what Andres
> proposed, which pretty much matches my own thoughts: the bgwriter
> needs to populate the free list, so that buffer allocations don't have
> to wait for linear scans of the buffer array.  That's just plain too
> slow.

I agree in general, though I'm not sure the bgwriter process can
reasonably handle this need along with what it's already supposed to be
doing.  We may need another background process that is just responsible
for keeping the freelist populated.
        regards, tom lane



Re: Page replacement algorithm in buffer cache

From
Andres Freund
Date:
On 2013-04-02 11:54:32 -0400, Robert Haas wrote:
> On Tue, Apr 2, 2013 at 11:32 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> > That's a very fair point, although not being able to evict pinned
> > buffers is a highly mitigating aspect.  Also CLOG is a different beast
> > entirely -- it's much more dense (2 bits!) vs a tuple so a single page
> > can a lot of high priority things.  But you could be right anyways.
> >
> > Given that, I wouldn't feel very comfortable with forced eviction
> > without knowing for sure high priority buffers were immune from that.
> > Your nailing idea is maybe the ideal solution.   Messing around with
> > the usage_count mechanic is tempting (like raising the cap and making
> > the sweeper more aggressive as it iterates), but probably really
> > difficult to get right, and, hopefully, ultimately moot.
> 
> One thought I had for fiddling with usage_count is to make it grow
> additively (x = x + 1) and decay exponentially (x = x >> 1).  I'm not
> sure the idea is any good, but one problem with the current system is
> that it's pretty trivial for a buffer to accumulate five touches, and
> after that we lose all memory of what the frequency of access is, so a
> pages of varying different levels of "hotness" can all have usage
> count 5.  This might allow a little more refinement without letting
> the time to degrade the usage count get out of control.
> 
> But, having said that, I still think the best idea is what Andres
> proposed, which pretty much matches my own thoughts: the bgwriter
> needs to populate the free list, so that buffer allocations don't have
> to wait for linear scans of the buffer array.  That's just plain too
> slow.

That way the usagecount should go down far more slowly which essentially makes
it more granular. And I think fiddling on that level before we have a more
sensible buffer acquiration implementation is pretty premature since that will
change too much.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Page replacement algorithm in buffer cache

From
Andres Freund
Date:
On 2013-04-02 12:22:03 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > But, having said that, I still think the best idea is what Andres
> > proposed, which pretty much matches my own thoughts: the bgwriter
> > needs to populate the free list, so that buffer allocations don't have
> > to wait for linear scans of the buffer array.  That's just plain too
> > slow.
> 
> I agree in general, though I'm not sure the bgwriter process can
> reasonably handle this need along with what it's already supposed to be
> doing.  We may need another background process that is just responsible
> for keeping the freelist populated.

What else is the bgwriter actually doing otherwise? Sure, it doesn't put the
pages on the freelist, but otherwise its trying to do the above isn't it?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Page replacement algorithm in buffer cache

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-04-02 12:22:03 -0400, Tom Lane wrote:
>> I agree in general, though I'm not sure the bgwriter process can
>> reasonably handle this need along with what it's already supposed to be
>> doing.  We may need another background process that is just responsible
>> for keeping the freelist populated.

> What else is the bgwriter actually doing otherwise? Sure, it doesn't put the
> pages on the freelist, but otherwise its trying to do the above isn't it?

I think it will be problematic to tie buffer-undirtying to putting both
clean and dirty buffers into the freelist.  It might chance to work all
right to use one scan process for both, but I'm afraid it's more likely
that we'd end up either overserving one goal or underserving the other.

Also note the entire design of BgBufferSync right now is predicated on
the assumption that the rate of motion of the scan strategy point
reflects the rate at which new buffers are needed.  If backends are
supposed to always get new buffers from the freelist, that logic becomes
circular: the bgwriter would be watching itself.  Perhaps we can
refactor the feedback control loop logic so that the buffer scan rate is
driven by changes in the length of the freelist, but I'm not sure
exactly what the consequences would be.
        regards, tom lane



Re: Page replacement algorithm in buffer cache

From
Andres Freund
Date:
On 2013-04-02 12:56:56 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-04-02 12:22:03 -0400, Tom Lane wrote:
> >> I agree in general, though I'm not sure the bgwriter process can
> >> reasonably handle this need along with what it's already supposed to be
> >> doing.  We may need another background process that is just responsible
> >> for keeping the freelist populated.
> 
> > What else is the bgwriter actually doing otherwise? Sure, it doesn't put the
> > pages on the freelist, but otherwise its trying to do the above isn't it?
> 
> I think it will be problematic to tie buffer-undirtying to putting both
> clean and dirty buffers into the freelist.  It might chance to work all
> right to use one scan process for both, but I'm afraid it's more likely
> that we'd end up either overserving one goal or underserving the other.

Hm. I had imagined that we would only ever put clean buffers into the
freelist and that we would never write out a buffer that we don't need
for a new page. I don't see much point in randomly writing out buffers
that won't be needed for something else soon. Currently we can't do much
better than basically undirtying random buffers since we don't really know
which page will reach a usagecount of zero since bgwriter doesn't
manipulate usagecounts.

One other scenario I can see is the problem of strategy buffer reusage
of dirtied pages (hint bits, pruning) during seqscans where we would
benefit from pages being written out fast, but I can't imagine that that
could be handled very well by something like the bgwriter?

Am I missing something?

> Also note the entire design of BgBufferSync right now is predicated on
> the assumption that the rate of motion of the scan strategy point
> reflects the rate at which new buffers are needed.  If backends are
> supposed to always get new buffers from the freelist, that logic becomes
> circular: the bgwriter would be watching itself.  Perhaps we can
> refactor the feedback control loop logic so that the buffer scan rate is
> driven by changes in the length of the freelist, but I'm not sure
> exactly what the consequences would be.

Yea, that will definitely need refactoring. What I am imagining is that
that pacing is basically built ontop of a few StragetyControl variables
like:
* number of buffers from the freelist
* number of buffers acquired by backend because freelist was empty
* number of buffers written out by backend because freelist was empty

Those should be pretty cheap to maintain and should be enough to
implement sensible pacing for bgwriter.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Page replacement algorithm in buffer cache

From
Greg Stark
Date:
I'm confused by this thread. We *used* to maintain an LRU. The whole
reason for the clock-sweep algorithm is precisely to avoid maintaining
a linked list of least recently used buffers since the head of that
list is a point of contention.



Re: Page replacement algorithm in buffer cache

From
Andres Freund
Date:
On 2013-04-02 18:26:23 +0100, Greg Stark wrote:
> I'm confused by this thread. We *used* to maintain an LRU. The whole
> reason for the clock-sweep algorithm is precisely to avoid maintaining
> a linked list of least recently used buffers since the head of that
> list is a point of contention.

I don't think anybody is proposing to put the LRU back into a linked list,
given the frequency of usagecount manipulations that would probably end pretty
badly. What I think Robert, Tom and I are talking are talking about is putting
*some* buffers with usagecount = 0 into a linked list so that when a backend
requires a new page it can take one buffer from the freelist instead of

a) possibly touching quite some (I have seen 4 times *every* existing header)
pages to find one with usagecount = 0
b) having to write the page out itself

If everything is going well that would mean only the bgwritter (or if
bgfreelist or whatever) performs the clock sweep. Others take *new* pages from
the freelist instead of performing part of the sweep themselves.

Makes more sense?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Page replacement algorithm in buffer cache

From
Atri Sharma
Date:
On Tue, Apr 2, 2013 at 9:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> One thought I had for fiddling with usage_count is to make it grow
> additively (x = x + 1) and decay exponentially (x = x >> 1).  I'm not
> sure the idea is any good, but one problem with the current system is
> that it's pretty trivial for a buffer to accumulate five touches, and
> after that we lose all memory of what the frequency of access is, so a
> pages of varying different levels of "hotness" can all have usage
> count 5.  This might allow a little more refinement without letting
> the time to degrade the usage count get out of control.

This is just off the top of my head, but one possible solution could
be to quantize the levels of hotness. Specifically, we could
categorize buffers based on hotness. All buffers start in level 1 and
usage_count 0. Whichever buffer reaches usage_count of 5, and next
clock sweep which wants to increment its usage_count(hence taking it
above 5) sees that, it promotes the buffer to the next level, and
resets its usage_count to 0. Same logic applies for each level. When
we decrement usage_count and see that it is zero(for some buffer), if
it is in a level > 1, we demote the buffer to the next lower level. If
the buffer is in level 1, it is a potential candidate for replacement.

This will allow us to have a loose idea about the hotness of a page,
without actually storing the usage_count for a buffer. We can still
update usage_count without locking, as buffers in high contention
which miss an update in their usage_count wont be affected by that
missed update, in accordance with all the discussion upthread.

Thoughts/Comments?

Regards,

Atri


--
Regards,

Atri
l'apprenant



Re: Page replacement algorithm in buffer cache

From
Merlin Moncure
Date:
On Tue, Apr 2, 2013 at 12:50 PM, Atri Sharma <atri.jiit@gmail.com> wrote:
> On Tue, Apr 2, 2013 at 9:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> One thought I had for fiddling with usage_count is to make it grow
>> additively (x = x + 1) and decay exponentially (x = x >> 1).  I'm not
>> sure the idea is any good, but one problem with the current system is
>> that it's pretty trivial for a buffer to accumulate five touches, and
>> after that we lose all memory of what the frequency of access is, so a
>> pages of varying different levels of "hotness" can all have usage
>> count 5.  This might allow a little more refinement without letting
>> the time to degrade the usage count get out of control.
>
> This is just off the top of my head, but one possible solution could
> be to quantize the levels of hotness. Specifically, we could
> categorize buffers based on hotness. All buffers start in level 1 and
> usage_count 0. Whichever buffer reaches usage_count of 5, and next
> clock sweep which wants to increment its usage_count(hence taking it
> above 5) sees that, it promotes the buffer to the next level, and
> resets its usage_count to 0. Same logic applies for each level. When
> we decrement usage_count and see that it is zero(for some buffer), if
> it is in a level > 1, we demote the buffer to the next lower level. If
> the buffer is in level 1, it is a potential candidate for replacement.
>
> This will allow us to have a loose idea about the hotness of a page,
> without actually storing the usage_count for a buffer. We can still
> update usage_count without locking, as buffers in high contention
> which miss an update in their usage_count wont be affected by that
> missed update, in accordance with all the discussion upthread.

how is that different from usage_count itself? usage_count *is* a
measure of hotness.  the arbitrary cap at 5 is paranoia to prevent the
already considerable damage that occurs in the situation Andres is
talking about (where everyhing is marked 'hot' so you have to sweep a
lot more).

also, any added complexity in terms of manipulating usage_count is a
move away from the lockless maintenance I'm proposing.  maybe my idea
is a non-starter on that basis alone, but the mechanic should be kept
as simple as possible.  the idea to move it to the bgwriter is to
pre-emptively do the work that backends are now doing: try and keep
ahead of the allocations being done so that buffer requests are
satisfied quickly.

merlin



Re: Page replacement algorithm in buffer cache

From
Atri Sharma
Date:

Sent from my iPad

On 02-Apr-2013, at 23:41, Merlin Moncure <mmoncure@gmail.com> wrote:

> On Tue, Apr 2, 2013 at 12:50 PM, Atri Sharma <atri.jiit@gmail.com> wrote:
>> On Tue, Apr 2, 2013 at 9:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>>> One thought I had for fiddling with usage_count is to make it grow
>>> additively (x = x + 1) and decay exponentially (x = x >> 1).  I'm not
>>> sure the idea is any good, but one problem with the current system is
>>> that it's pretty trivial for a buffer to accumulate five touches, and
>>> after that we lose all memory of what the frequency of access is, so a
>>> pages of varying different levels of "hotness" can all have usage
>>> count 5.  This might allow a little more refinement without letting
>>> the time to degrade the usage count get out of control.
>>
>> This is just off the top of my head, but one possible solution could
>> be to quantize the levels of hotness. Specifically, we could
>> categorize buffers based on hotness. All buffers start in level 1 and
>> usage_count 0. Whichever buffer reaches usage_count of 5, and next
>> clock sweep which wants to increment its usage_count(hence taking it
>> above 5) sees that, it promotes the buffer to the next level, and
>> resets its usage_count to 0. Same logic applies for each level. When
>> we decrement usage_count and see that it is zero(for some buffer), if
>> it is in a level > 1, we demote the buffer to the next lower level. If
>> the buffer is in level 1, it is a potential candidate for replacement.
>>
>> This will allow us to have a loose idea about the hotness of a page,
>> without actually storing the usage_count for a buffer. We can still
>> update usage_count without locking, as buffers in high contention
>> which miss an update in their usage_count wont be affected by that
>> missed update, in accordance with all the discussion upthread.
>
> how is that different from usage_count itself? usage_count *is* a
> measure of hotness.  the arbitrary cap at 5 is paranoia to prevent the
> already considerable damage that occurs in the situation Andres is
> talking about (where everyhing is marked 'hot' so you have to sweep a
> lot more).
>
> also, any added complexity in terms of manipulating usage_count is a
> move away from the lockless maintenance I'm proposing.  maybe my idea
> is a non-starter on that basis alone, but the mechanic should be kept
> as simple as possible.  the idea to move it to the bgwriter is to
> pre-emptively do the work that backends are now doing: try and keep
> ahead of the allocations being done so that buffer requests are
> satisfied quickly.
>

I agree, we want to reduce the complexity of usage_count.I was only musing on the point Robert raised, if we want to
continueusing usage_count and refine it for more accurate tracking of hotness of a buffer. 

Regards,

Atri


Re: Page replacement algorithm in buffer cache

From
Robert Haas
Date:
On Tue, Apr 2, 2013 at 1:20 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-04-02 12:56:56 -0400, Tom Lane wrote:
>> Andres Freund <andres@2ndquadrant.com> writes:
>> > On 2013-04-02 12:22:03 -0400, Tom Lane wrote:
>> >> I agree in general, though I'm not sure the bgwriter process can
>> >> reasonably handle this need along with what it's already supposed to be
>> >> doing.  We may need another background process that is just responsible
>> >> for keeping the freelist populated.
>>
>> > What else is the bgwriter actually doing otherwise? Sure, it doesn't put the
>> > pages on the freelist, but otherwise its trying to do the above isn't it?
>>
>> I think it will be problematic to tie buffer-undirtying to putting both
>> clean and dirty buffers into the freelist.  It might chance to work all
>> right to use one scan process for both, but I'm afraid it's more likely
>> that we'd end up either overserving one goal or underserving the other.
>
> Hm. I had imagined that we would only ever put clean buffers into the
> freelist and that we would never write out a buffer that we don't need
> for a new page. I don't see much point in randomly writing out buffers
> that won't be needed for something else soon. Currently we can't do much
> better than basically undirtying random buffers since we don't really know
> which page will reach a usagecount of zero since bgwriter doesn't
> manipulate usagecounts.
>
> One other scenario I can see is the problem of strategy buffer reusage
> of dirtied pages (hint bits, pruning) during seqscans where we would
> benefit from pages being written out fast, but I can't imagine that that
> could be handled very well by something like the bgwriter?
>
> Am I missing something?

I've had the same thought.  I think we should consider having a
background process that listens on a queue, sort of like the fsync
absorption queue.  When a process using a buffer access strategy
dirties a buffer, it adds it to that queue and sets the latch for the
background process, which then wakes up and starts cleaning the
buffers that have been added to its queue.  The hope is that, by the
time the ring buffer wraps around, the background process will have
cleaned the buffer, preventing the foreground process from having to
wait for the buffer write (and, perhaps, xlog flush).

The main hesitation I've had about actually implementing such a scheme
is that I find it a bit unappealing to have a background process
dedicated to just this.  But maybe it could be combined with some of
the other ideas presented here.  Perhaps we should have one process
that scans the buffer arena and populates the freelists; as a side
effect, if it runs across a dirty buffer, it kicks it over to the
process described in the previous paragraph (which could still, also,
absorb requests from other backends using buffer access strategies).
Then we'd end up with nothing that looks exactly like the background
writer we have now, but maybe no one would miss it.

I think that as we go through the process of trying to improve this,
we should also look hard at trying to make the algorithms more
self-tuning.  For example, instead of having a fixed delay between
rounds for the buffer-arena-scanning process, I think we should try to
make it adaptive.  If it sticks a bunch of buffers on the freelist and
the freelist then runs dry before it wakes up again, the backend that
notices that the list is empty (or below some low watermark), it
should set a latch to wake up the buffer-arena-scanning process; and
the next time that process goes back to sleep, it should sleep for a
shorter period of time.  As things are today, what the background
writer actually does is unhelpful enough that there might not be much
point in fiddling with this, but as we get to having a more sensible
scheme overall, I think it will pay dividends.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Page replacement algorithm in buffer cache

From
Greg Stark
Date:

On Wed, Apr 3, 2013 at 3:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
The main hesitation I've had about actually implementing such a scheme
is that I find it a bit unappealing to have a background process
dedicated to just this.  But maybe it could be combined with some of
the other ideas presented here.  Perhaps we should have one process
that scans the buffer arena and populates the freelists; as a side
effect, if it runs across a dirty buffer, it kicks it over to the
process described in the previous paragraph (which could still, also,
absorb requests from other backends using buffer access strategies).
Then we'd end up with nothing that looks exactly like the background
writer we have now, but maybe no one would miss it.

I think the general pattern of development has led in the opposite direction. Every time there's been one daemon responsible for two things it's ended up causing contorted code and difficult to tune behaviours and we've ended up splitting the two.

In particular in this case it seems like an especially poor choice. In the time one buffer write might take the entire freelist might empty out. I could easily imagine this happening *every* time a write I/O happens. It seems more likely that you'll need multiple processes running the buffer cleaning to keep up with a decent I/O subsystem.

I'm still skeptical about the idea of a "freelist". That just seems like a terrible point of contention. However perhaps that's because I'm picturing an LRU linked list. Perhaps the right thing is to maintain a pool of buffers in some less contention-prone data structure which lets each backend pick buffers out more or less independently of the others.

--
greg

Re: Page replacement algorithm in buffer cache

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> I'm still skeptical about the idea of a "freelist". That just seems like a
> terrible point of contention. However perhaps that's because I'm picturing
> an LRU linked list. Perhaps the right thing is to maintain a pool of
> buffers in some less contention-prone data structure which lets each
> backend pick buffers out more or less independently of the others.

I think the original vision of the clock sweep algorithm included the
idea that different backends could be running the sweep over different
parts of the buffer ring concurrently.  If we could get rid of the need
to serialize that activity, it'd pretty much eliminate the bottleneck
I should think.  The problem is how to manage it to ensure that (a)
backends aren't actually contending on the same buffers as they do this,
and (b) there's a reasonably fair rate of usage_count decrements for
each buffer, rather than possibly everybody ganging up on the same area
sometimes.  Thoughts?
        regards, tom lane



Re: Page replacement algorithm in buffer cache

From
Ants Aasma
Date:
On Thu, Apr 4, 2013 at 3:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think the original vision of the clock sweep algorithm included the
> idea that different backends could be running the sweep over different
> parts of the buffer ring concurrently.  If we could get rid of the need
> to serialize that activity, it'd pretty much eliminate the bottleneck
> I should think.  The problem is how to manage it to ensure that (a)
> backends aren't actually contending on the same buffers as they do this,
> and (b) there's a reasonably fair rate of usage_count decrements for
> each buffer, rather than possibly everybody ganging up on the same area
> sometimes.  Thoughts?

Fairness and avoiding each others implies that some coordination is
required. One wild idea I had is to partition buffers to N partitions
and have one clock sweep each, protected by a lwlock.

To reduce contention, the clocksweep runners use something like
sched_getcpu() to determine which partition to use to find their
buffer. Using the CPU to determine the partition makes it necessary
for the process to be scheduled out in the critical section for some
other backend to contend on the lock. And if some backend does contend
on it, it is likely to reside on the same CPU and by sleeping will
make room for the lockholder to run.

To ensure fairness for buffers, every time one of the clocksweeps
wraps around a global offset counter is incremented. This re-assigns
all cpus/backends to the next partition, sort of like the mad hatters
tea party.

The scenario that I'm most worried about here is what happens when a
process holding the clocksweep lock is migrated to another CPU and
then scheduled out. The processes on the original CPU will sooner or
later block behind the lock while the processes on the CPU where the
lock holder now waits keep hogging the CPU. This will create an
imbalance that the scheduler might try to correct, possibly creating a
nasty feedback loop. It could be that in practice the scenario works
out to be too far fetched to matter, but who knows.

I don't have a slightest idea yet how the background writer would
function in this environment. But if redesigning the bgwriter
mechanism was on the table I thought I would throw the idea out here.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Page replacement algorithm in buffer cache

From
Greg Smith
Date:
On 4/2/13 11:54 AM, Robert Haas wrote:
> But, having said that, I still think the best idea is what Andres
> proposed, which pretty much matches my own thoughts: the bgwriter
> needs to populate the free list, so that buffer allocations don't have
> to wait for linear scans of the buffer array.

I was hoping this one would make it to a full six years of being on the 
TODO list before it came up again, missed it by a few weeks.  The 
funniest part is that Amit even submitted a patch on this theme a few 
months ago without much feedback: 
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs  That stalled where a few
thingshave, on a) needing more regression 
 
test workloads, and b) wondering just what the deal with large 
shared_buffers setting degrading performance was.

I saw refactoring in this area as waiting behind it being easier to 
experiment with adding new processes, but that barrier has fallen now. 
Maybe it needs a new freelist process, maybe it doesn't, today the code 
needed to try both is relatively cheap.

The other thing that always seemed to stop me was never having a typical 
Linux system big enough to hit some of these problems available all the 
time.  What I did this week on that front was just go buy a 24 core 
server with 64GB of RAM that lives in my house.  I just need to keep it 
two floors away if I want to sleep at night.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: Page replacement algorithm in buffer cache

From
Amit Kapila
Date:
On Thursday, April 04, 2013 7:19 AM Greg Smith wrote:
> On 4/2/13 11:54 AM, Robert Haas wrote:
> > But, having said that, I still think the best idea is what Andres
> > proposed, which pretty much matches my own thoughts: the bgwriter
> > needs to populate the free list, so that buffer allocations don't
> have
> > to wait for linear scans of the buffer array.
> 
> I was hoping this one would make it to a full six years of being on the
> TODO list before it came up again, missed it by a few weeks.  The
> funniest part is that Amit even submitted a patch on this theme a few
> months ago without much feedback:
> http://www.postgresql.org/message-
> id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs
>   That stalled where a few things have, on a) needing more regression
> test workloads, and b) wondering just what the deal with large
> shared_buffers setting degrading performance was.

For "b)", below are links where it decreased due to large shared buffers.

http://www.postgresql.org/message-id/attachment/27489/Results.htm
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C38285442C
5@szxeml509-mbx


As per my observation, it occur when I/O starts. The dip could be due to
fluctuation or may be due to some OS scheduling or it could be due to
Eviction of dirty pages sooner than it would otherwise.

I think the further investigation can be more meaningful if the results can
be taken by someone else other than me.

One idea to proceed in this line could be we start with this patch and then
based on results, do the further experiments to make it more useful.  

With Regards,
Amit Kapila.




Re: Page replacement algorithm in buffer cache

From
Robert Haas
Date:
On Wed, Apr 3, 2013 at 9:49 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 4/2/13 11:54 AM, Robert Haas wrote:
>> But, having said that, I still think the best idea is what Andres
>> proposed, which pretty much matches my own thoughts: the bgwriter
>> needs to populate the free list, so that buffer allocations don't have
>> to wait for linear scans of the buffer array.
>
> I was hoping this one would make it to a full six years of being on the TODO
> list before it came up again, missed it by a few weeks.  The funniest part
> is that Amit even submitted a patch on this theme a few months ago without
> much feedback:
> http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs
> That stalled where a few things have, on a) needing more regression test
> workloads, and b) wondering just what the deal with large shared_buffers
> setting degrading performance was.

Those are impressive results.  I think we should seriously consider
doing something like that for 9.4.  TBH, although more workloads to
test is always better, I don't think this problem is so difficult that
we can't have some confidence in a theoretical analysis.  If I read
the original thread correctly (and I haven't looked at the patch
itself), the proposed patch would actually invalidate buffers before
putting them on the freelist.  That effectively amounts to reducing
shared_buffers, so workloads that are just on the edge of what can fit
in shared_buffers will be harmed, and those that benefit incrementally
from increased shared_buffers will be as well.

What I think we should do instead is collect the buffers that we think
are evictable and stuff them onto the freelist without invalidating
them.  When a backend allocates from the freelist, it can double-check
that the buffer still has usage_count 0.  The odds should be pretty
good.  But even if we sometimes notice that the buffer has been
touched again after being put on the freelist, we haven't expended all
that much extra effort, and that effort happened mostly in the
background.  Consider a scenario where only 10% of the buffers have
usage count 0 (which is not unrealistic).  We scan 5000 buffers and
put 500 on the freelist.  Now suppose that, due to some accident of
the workload, 75% of those buffers get touched again before they're
allocated off the freelist (which I believe to be a pessimistic
estimate for most workloads).  Now, that means that only 125 of those
500 buffers will succeed in satisfying an allocation request.  That's
still a huge win, because it means that each backend only has examine
an average of 4 buffers before it finds one to allocate.  If it had
needed to do the freelist scan itself, it would have had to touch 40
buffers before finding one to allocate.

In real life, I think the gains are apt to be, if anything, larger.
IME, it's common for most or all of the buffer pool to be pinned at
usage count 5.  So you could easily have a situation where the arena
scan has to visit millions of buffers to find one to allocate.  If
that's happening in the background instead of the foreground, it's a
huge win.  Also, note that there's nothing to prevent the arena scan
from happening in parallel with allocations off of the freelist - so
while foreground processes are emptying the freelist, the background
process can be looking for more things to add to it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Page replacement algorithm in buffer cache

From
Amit Kapila
Date:
On Thursday, April 04, 2013 6:12 PM Robert Haas wrote:
> On Wed, Apr 3, 2013 at 9:49 PM, Greg Smith <greg@2ndquadrant.com>
> wrote:
> > On 4/2/13 11:54 AM, Robert Haas wrote:
> >> But, having said that, I still think the best idea is what Andres
> >> proposed, which pretty much matches my own thoughts: the bgwriter
> >> needs to populate the free list, so that buffer allocations don't
> have
> >> to wait for linear scans of the buffer array.
> >
> > I was hoping this one would make it to a full six years of being on
> the TODO
> > list before it came up again, missed it by a few weeks.  The funniest
> part
> > is that Amit even submitted a patch on this theme a few months ago
> without
> > much feedback:
> > http://www.postgresql.org/message-
> id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs
> > That stalled where a few things have, on a) needing more regression
> test
> > workloads, and b) wondering just what the deal with large
> shared_buffers
> > setting degrading performance was.
> 
> Those are impressive results.  I think we should seriously consider
> doing something like that for 9.4.  TBH, although more workloads to
> test is always better, I don't think this problem is so difficult that
> we can't have some confidence in a theoretical analysis.  If I read
> the original thread correctly (and I haven't looked at the patch
> itself), the proposed patch would actually invalidate buffers before
> putting them on the freelist.  That effectively amounts to reducing
> shared_buffers, so workloads that are just on the edge of what can fit
> in shared_buffers will be harmed, and those that benefit incrementally
> from increased shared_buffers will be as well.
> 
> What I think we should do instead is collect the buffers that we think
> are evictable and stuff them onto the freelist without invalidating
> them.  When a backend allocates from the freelist, it can double-check
> that the buffer still has usage_count 0.  The odds should be pretty
> good.  But even if we sometimes notice that the buffer has been
> touched again after being put on the freelist, we haven't expended all
> that much extra effort, and that effort happened mostly in the
> background.  

If we just put it to freelist, then next time if it get allocated directly
from bufhash table, then who will remove it from freelist
or do you think that, in BufferAlloc, if it gets from bufhash table, then it
should verify if it's in freelist, then remove from freelist.

With Regards,
Amit Kapila.




Re: Page replacement algorithm in buffer cache

From
Robert Haas
Date:
On Fri, Apr 5, 2013 at 1:12 AM, Amit Kapila <amit.kapila@huawei.com> wrote:
> If we just put it to freelist, then next time if it get allocated directly
> from bufhash table, then who will remove it from freelist
> or do you think that, in BufferAlloc, if it gets from bufhash table, then it
> should verify if it's in freelist, then remove from freelist.

No, I don't think that's necessary.  We already have the following
guard in StrategyGetBuffer:
               if (buf->refcount == 0 && buf->usage_count == 0)               {                       if (strategy !=
NULL)                              AddBufferToRing(strategy, buf);                       return buf;               }
 

If a buffer is allocated from the freelist and it turns out that it
actually has a non-zero reference count or a non-zero pin count, we
just discard it and pull the next buffer off the freelist instead.
So, in the scenario you describe, the buffer gets reallocated (due to
a non-NULL BufferAccessStrategy, presumably) and then somebody comes a
long and pulls it off the freelist.  But, since the buffer has just
been used by someone else, it'll most likely be pinned or have a
non-zero usage count, so we'll just skip it and allocate some other
buffer instead.  No harm done.

Now, it is possible that the buffer could get added to the freelist,
then allocated via a BufferAccessStrategy, and then the clock sweep
could hit it and push the usage count back to 0.  But that's no big
deal either: if we go to put it on the freelist and see (via
buf->freeNext) that it's already there, we can just leave it where it
is (or maybe move it to the end).  On a related note, we probably need
a variant of StrategyFreeBuffer which pushes buffers onto the end of
the freelist rather than the front.  It makes sense to stick
invalidated buffers on the front of the list (which is what
StrategyFreeBuffer does), but non-invalidated buffers should be placed
at the end to more closely approximate LRU.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Page replacement algorithm in buffer cache

From
Amit Kapila
Date:
On Saturday, April 06, 2013 12:38 AM Robert Haas wrote:
> On Fri, Apr 5, 2013 at 1:12 AM, Amit Kapila <amit.kapila@huawei.com>
> wrote:
> > If we just put it to freelist, then next time if it get allocated
> directly
> > from bufhash table, then who will remove it from freelist
> > or do you think that, in BufferAlloc, if it gets from bufhash table,
> then it
> > should verify if it's in freelist, then remove from freelist.
> 
> No, I don't think that's necessary.  We already have the following
> guard in StrategyGetBuffer:
> 
>                 if (buf->refcount == 0 && buf->usage_count == 0)
>                 {
>                         if (strategy != NULL)
>                                 AddBufferToRing(strategy, buf);
>                         return buf;
>                 }
> 
> If a buffer is allocated from the freelist and it turns out that it
> actually has a non-zero reference count or a non-zero pin count, we
> just discard it and pull the next buffer off the freelist instead.
> So, in the scenario you describe, the buffer gets reallocated (due to
> a non-NULL BufferAccessStrategy, presumably) and then somebody comes a
> long and pulls it off the freelist.  But, since the buffer has just
> been used by someone else, it'll most likely be pinned or have a
> non-zero usage count, so we'll just skip it and allocate some other
> buffer instead.  No harm done.

Yes, you are right, I have missed that part of code while thinking of this
scenario, but I was talking about NULL BufferAccessStrategy as well.

I still have one more doubt, consider the below scenario for cases when we
Invalidate buffers during moving to freelist v/s just move to freelist
  Backend got the buffer from freelist for a request of page-9 (number 9 is
random, just to explain), it still have association with another page-10  It needs to add the buffer with new tag (new
pageassociation) in bufhash
 
table and remove the buffer with oldTag (old page association).

The benefit for just moving to freelist is that if we get request of same
page until somebody else used it for another page, it will save read I/O.
However on the other side for many cases
Backend will need extra partition lock to remove oldTag (which can lead to
some bottleneck).

I think saving read I/O is more beneficial but just not sure if that is best
as cases might be less for it.
> Now, it is possible that the buffer could get added to the freelist,
> then allocated via a BufferAccessStrategy, and then the clock sweep
> could hit it and push the usage count back to 0.  But that's no big
> deal either: if we go to put it on the freelist and see (via
> buf->freeNext) that it's already there, we can just leave it where it
> is (or maybe move it to the end).  On a related note, we probably need
> a variant of StrategyFreeBuffer which pushes buffers onto the end of
> the freelist rather than the front.  It makes sense to stick
> invalidated buffers on the front of the list (which is what
> StrategyFreeBuffer does), but non-invalidated buffers should be placed
> at the end to more closely approximate LRU.

Okay.

Last time following tests have been executed to validate the results:

Test suite - pgbench
DB Size - 16 GB 
RAM     - 24 GB
Shared Buffers - 2G, 5G, 7G, 10G
Concurrency - 8, 16, 32, 64 clients
Pre-warm the buffers before start of test

Shall we try for any other scenario's or for initial test of patch above are
okay.


With Regards,
Amit Kapila.









Re: Page replacement algorithm in buffer cache

From
Robert Haas
Date:
On Fri, Apr 5, 2013 at 11:08 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
> I still have one more doubt, consider the below scenario for cases when we
> Invalidate buffers during moving to freelist v/s just move to freelist
>
>    Backend got the buffer from freelist for a request of page-9 (number 9 is
> random, just to explain), it still have association with another page-10
>    It needs to add the buffer with new tag (new page association) in bufhash
> table and remove the buffer with oldTag (old page association).
>
> The benefit for just moving to freelist is that if we get request of same
> page until somebody else used it for another page, it will save read I/O.
> However on the other side for many cases
> Backend will need extra partition lock to remove oldTag (which can lead to
> some bottleneck).
>
> I think saving read I/O is more beneficial but just not sure if that is best
> as cases might be less for it.

I think saving read I/O is a lot more beneficial.  I haven't seen
evidence of a severe bottleneck updating the buffer mapping tables.  I
have seen some evidence of spinlock-level contention on read workloads
that fit in shared buffers, because in that case the system can run
fast enough for the spinlocks protecting the lwlocks to get pretty
hot.  But if you're doing writes, or if the workload doesn't fit in
shared buffers, other bottlenecks slow you down enough that this
doesn't really seem to become much of an issue.

Also, even if you *can* find some scenario where pushing the buffer
invalidation into the background is a win, I'm not convinced that
would justify doing it, because the case where it's a huge loss -
namely, working set just a tiny bit smaller than shared_buffers - is
pretty obvious. I don't think we dare fool around with that; the
townspeople will arrive with pitchforks.

I believe that the big win here is getting the clock sweep out of the
foreground so that BufFreelistLock doesn't catch fire.  The buffer
mapping locks are partitioned and, while it's not like that completely
gets rid of the contention, it sure does help a lot.  So I would view
that goal as primary, at least for now.  If we get a first round of
optimization done in this area, that doesn't preclude further
improving it in the future.

> Last time following tests have been executed to validate the results:
>
> Test suite - pgbench
> DB Size - 16 GB
> RAM     - 24 GB
> Shared Buffers - 2G, 5G, 7G, 10G
> Concurrency - 8, 16, 32, 64 clients
> Pre-warm the buffers before start of test
>
> Shall we try for any other scenario's or for initial test of patch above are
> okay.

Seems like a reasonable place to start.

...Robert



Re: Page replacement algorithm in buffer cache

From
Amit Kapila
Date:

> -----Original Message-----
> From: Robert Haas [mailto:robertmhaas@gmail.com]
> Sent: Tuesday, April 09, 2013 9:28 AM
> To: Amit Kapila
> Cc: Greg Smith; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Page replacement algorithm in buffer cache
> 
> On Fri, Apr 5, 2013 at 11:08 PM, Amit Kapila <amit.kapila@huawei.com>
> wrote:
> > I still have one more doubt, consider the below scenario for cases
> when we
> > Invalidate buffers during moving to freelist v/s just move to
> freelist
> >
> >    Backend got the buffer from freelist for a request of page-9
> (number 9 is
> > random, just to explain), it still have association with another
> page-10
> >    It needs to add the buffer with new tag (new page association) in
> bufhash
> > table and remove the buffer with oldTag (old page association).
> >
> > The benefit for just moving to freelist is that if we get request of
> same
> > page until somebody else used it for another page, it will save read
> I/O.
> > However on the other side for many cases
> > Backend will need extra partition lock to remove oldTag (which can
> lead to
> > some bottleneck).
> >
> > I think saving read I/O is more beneficial but just not sure if that
> is best
> > as cases might be less for it.
> 
> I think saving read I/O is a lot more beneficial.  I haven't seen
> evidence of a severe bottleneck updating the buffer mapping tables.  I
> have seen some evidence of spinlock-level contention on read workloads
> that fit in shared buffers, because in that case the system can run
> fast enough for the spinlocks protecting the lwlocks to get pretty
> hot.  But if you're doing writes, or if the workload doesn't fit in
> shared buffers, other bottlenecks slow you down enough that this
> doesn't really seem to become much of an issue.
> 
> Also, even if you *can* find some scenario where pushing the buffer
> invalidation into the background is a win, I'm not convinced that
> would justify doing it, because the case where it's a huge loss -
> namely, working set just a tiny bit smaller than shared_buffers - is
> pretty obvious. I don't think we dare fool around with that; the
> townspeople will arrive with pitchforks.
> 
> I believe that the big win here is getting the clock sweep out of the
> foreground so that BufFreelistLock doesn't catch fire.  The buffer
> mapping locks are partitioned and, while it's not like that completely
> gets rid of the contention, it sure does help a lot.  So I would view
> that goal as primary, at least for now.  If we get a first round of
> optimization done in this area, that doesn't preclude further
> improving it in the future.

I agree with you that this can be first step towards improvement.

> > Last time following tests have been executed to validate the results:
> >
> > Test suite - pgbench
> > DB Size - 16 GB
> > RAM     - 24 GB
> > Shared Buffers - 2G, 5G, 7G, 10G
> > Concurrency - 8, 16, 32, 64 clients
> > Pre-warm the buffers before start of test
> >
> > Shall we try for any other scenario's or for initial test of patch
> above are
> > okay.
> 
> Seems like a reasonable place to start.

I shall work on this for first CF of 9.4.


With Regards,
Amit Kapila.