Thread: Readme of Buffer Management seems to have wrong sentence

Readme of Buffer Management seems to have wrong sentence

From

Amit Kapila

Date:

08 May 2012, 22:39:02

<div class="WordSection1"><p class="MsoNormal">While going through Readme in backend\storage\buffer, I found some point
misleading.<pclass="MsoNormal"> <p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Arial","sans-serif"">NormalBuffer Replacement Strategy</span><br /><span
style="font-size:10.0pt;font-family:"Arial","sans-serif"">----------------------------------</span>--------------<br
/>..<pclass="MsoNormal">..<p class="MsoNormal"><br /><span
style="font-size:10.0pt;font-family:"Arial","sans-serif"">Eachbuffer header contains a usage counter, which is
incremented(up to a</span><br /><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">small limit value)
wheneverthe buffer is <b>unpinned</b>.  (This requires only the</span><br /><span
style="font-size:10.0pt;font-family:"Arial","sans-serif"">bufferheader spinlock, whic anyway to decrement the</span><br
/><spanstyle="font-size:10.0pt;font-family:"Arial","sans-serif"">buffer reference count, so it's nearly free.)</span><p
class="MsoNormal"> <pclass="MsoNormal">…<p class="MsoNormal"> <p class="MsoNormal">I have checked the code and logic
accordingto which usage counter is increased when the buffer is pinned.<p class="MsoNormal"> <p class="MsoNormal"> <p
class="MsoNormal">AnotherDoubt : Why in function BufferAlloc, it needs to hold the BufFreelistLock till it pin the
bufferwhich increases its reference count.<p class="MsoNormal">                                  As what I understood
isBufFreelistLock is to protect StrategyControl structure and till it finds </div>

Re: Readme of Buffer Management seems to have wrong sentence

From

Robert Haas

Date:

22 May 2012, 10:46:02

On Tue, May 8, 2012 at 9:37 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
> I have checked the code and logic according to which usage counter is
> increased when the buffer is pinned.

Fixed, thanks for the report.

> Another Doubt : Why in function BufferAlloc, it needs to hold the
> BufFreelistLock till it pin the buffer which increases its reference count.

Well, I think the problem is that, if we didn't do that, then, in
theory, the strategy point could wrap all the way around
shared_buffers and someone else could pin the buffer, and then we'd be
hosed.

Mind you, I think this whole area of the code needs some reengineering
for better performance, but I'm not sure this is the right place to
start.  What I think is really bad is that we're forcing every
BufferAlloc() to iterate over buffers checking whether each one is
evictable.  I think we ought to put only those buffers that we think
are likely to be evictable on the freelist, and then the actual buffer
eviction code would only need to recheck that nothing had changed,
instead of needing to scan over a potentially quite large number of
buffers that never had any chance of being selected in the first
place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Readme of Buffer Management seems to have wrong sentence

From

Tom Lane

Date:

22 May 2012, 11:02:21

Robert Haas <robertmhaas@gmail.com> writes:
> Mind you, I think this whole area of the code needs some reengineering
> for better performance, but I'm not sure this is the right place to
> start.  What I think is really bad is that we're forcing every
> BufferAlloc() to iterate over buffers checking whether each one is
> evictable.

Well, keep in mind that that action is not merely there to obtain a
victim buffer; it is also maintaining the global LRU state (by
decrementing the usage counts of buffers it passes over).  I don't think
you can change it to simply look only at a predetermined freelist
without seriously compromising the overall quality of our buffer
replacement decisions.
        regards, tom lane

Re: Readme of Buffer Management seems to have wrong sentence

From

Robert Haas

Date:

22 May 2012, 11:09:07

On Tue, May 22, 2012 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Mind you, I think this whole area of the code needs some reengineering
>> for better performance, but I'm not sure this is the right place to
>> start.  What I think is really bad is that we're forcing every
>> BufferAlloc() to iterate over buffers checking whether each one is
>> evictable.
>
> Well, keep in mind that that action is not merely there to obtain a
> victim buffer; it is also maintaining the global LRU state (by
> decrementing the usage counts of buffers it passes over).  I don't think
> you can change it to simply look only at a predetermined freelist
> without seriously compromising the overall quality of our buffer
> replacement decisions.

The idea would be to have a background process (like bgwriter)
maintain the global LRU state and push candidate buffers onto the
freelist.  Then foreground processes can just pop them off the list
and recheck that they haven't been pinned meanwhile.  As long as we
don't let the background sweep get too far ahead of actual allocation
needs, I don't think this would change the quality of buffer
allocation much at all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Readme of Buffer Management seems to have wrong sentence

From

Tom Lane

Date:

22 May 2012, 11:26:02

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, May 22, 2012 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, keep in mind that that action is not merely there to obtain a
>> victim buffer; it is also maintaining the global LRU state (by
>> decrementing the usage counts of buffers it passes over). �I don't think
>> you can change it to simply look only at a predetermined freelist
>> without seriously compromising the overall quality of our buffer
>> replacement decisions.

> The idea would be to have a background process (like bgwriter)
> maintain the global LRU state and push candidate buffers onto the
> freelist.

Amit was trying to convince me of the same idea at PGCon, but I don't
buy it.  bgwriter doesn't scan the buffer array nearly fast enough to
provide useful adjustment of the usage counts under load.  And besides
if the decrements are decoupled from the allocation requests it's no
longer obvious that the algorithm is even an approximation of LRU.

But the larger issue here is that if that processing is a bottleneck
(which I agree it is), how does it help to force a single process to
be responsible for it?  Any real improvement in scalability here will
need to decentralize the operation more, not less.

My own thoughts about this had pointed in the direction of getting rid
of the central freelist entirely, instead letting each backend run its
own independent clock sweep as needed.  The main problem with that is
that if there's no longer any globally-visible clock sweep state, it's
pretty hard to figure out what the control logic for the bgwriter should
look like.  Maybe it would be all right to have global variables that
are just statistics counters for allocations and buffers swept over,
which backends would need to spinlock for just long enough to increment
the counters at the end of each buffer allocation.
        regards, tom lane

Re: Readme of Buffer Management seems to have wrong sentence

From

Robert Haas

Date:

22 May 2012, 12:59:42

On Tue, May 22, 2012 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The idea would be to have a background process (like bgwriter)
>> maintain the global LRU state and push candidate buffers onto the
>> freelist.
>
> Amit was trying to convince me of the same idea at PGCon, but I don't
> buy it.  bgwriter doesn't scan the buffer array nearly fast enough to
> provide useful adjustment of the usage counts under load.  And besides
> if the decrements are decoupled from the allocation requests it's no
> longer obvious that the algorithm is even an approximation of LRU.

Well, bgwriter is *supposed* to anticipate which buffers are about to
be allocated and clean any of those that are dirty.  Having it
decrement the usage counts and stuff the resulting list of buffers
into a linked list seems like a pretty reasonable extension of that,
assuming that it works in the first place.  If it doesn't, then we
need a rethink.

> But the larger issue here is that if that processing is a bottleneck
> (which I agree it is), how does it help to force a single process to
> be responsible for it?  Any real improvement in scalability here will
> need to decentralize the operation more, not less.

Sure.  I think we could have the freelist and the clock sweep
protected by different locks.  The background writer would lock out
other people running the clock sweep, but the freelist could be
protected by a spinlock which no one would ever need to take for more
than a few cycles.  Right there, you should get a significant
scalability improvement, since the critical section would be so much
shorter than it is now.  If that's not enough, you could have several
freelists protected by different spinlocks; the bgwriter would put
1/Nth of the reusable buffers on each freelist, and backends would
pick a freelist at random to pull buffers off of.

> My own thoughts about this had pointed in the direction of getting rid
> of the central freelist entirely, instead letting each backend run its
> own independent clock sweep as needed.  The main problem with that is
> that if there's no longer any globally-visible clock sweep state, it's
> pretty hard to figure out what the control logic for the bgwriter should
> look like.  Maybe it would be all right to have global variables that
> are just statistics counters for allocations and buffers swept over,
> which backends would need to spinlock for just long enough to increment
> the counters at the end of each buffer allocation.

Hmm, that's certainly an interesting idea.  I fear that if the clock
sweeps from the different backends ended up too closely synchronized,
you would end up evicting whatever was in the way, be it hot or cold.
It might almost be better to have individual backends choose buffers
to evict at random; if the chosen buffer isn't evictable, we decrement
its usage count and pick another one, also at random.

With respect to the control logic for the background writer, one idea
I had was to get rid of the idea that the background writer's job is
to write in advance of the strategy point.  Instead, every time the
clock sweep passes over a dirty buffer that is otherwise evictable, we
add it to a queue of things that the bgwriter should clean.  Those
buffers, once cleaned, go on the free list.  Maybe some variant of
that could work with your idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Readme of Buffer Management seems to have wrong sentence

From

Tom Lane

Date:

22 May 2012, 13:12:37

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, May 22, 2012 at 10:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> My own thoughts about this had pointed in the direction of getting rid
>> of the central freelist entirely, instead letting each backend run its
>> own independent clock sweep as needed.

> Hmm, that's certainly an interesting idea.  I fear that if the clock
> sweeps from the different backends ended up too closely synchronized,
> you would end up evicting whatever was in the way, be it hot or cold.
> It might almost be better to have individual backends choose buffers
> to evict at random; if the chosen buffer isn't evictable, we decrement
> its usage count and pick another one, also at random.

Hmm ... yeah, in principle that could be better.  It's still a clock
sweep but following a hard-to-predict ordering of the buffers.  Being
hard to predict doesn't necessarily guarantee no bad behavior though.
Maybe we could do something a bit cheaper than random() and more
amenable to analysis, that would still ensure that backends couldn't end
up with closely sync'd scans.  I'm imagining advancing not 1 buffer
each time, but K buffers where each backend will use a different value
of K.  Perhaps backend with backendID N could use the N'th prime number,
say.  You'd want something relatively prime to shared_buffers to
guarantee that the backend can visit all the buffers, and just making it
prime would do fine.

> With respect to the control logic for the background writer, one idea
> I had was to get rid of the idea that the background writer's job is
> to write in advance of the strategy point.  Instead, every time the
> clock sweep passes over a dirty buffer that is otherwise evictable, we
> add it to a queue of things that the bgwriter should clean.  Those
> buffers, once cleaned, go on the free list.  Maybe some variant of
> that could work with your idea.

Both ends of that imply centralized data structures that could become
contention bottlenecks, so it doesn't sound tremendously appealing to
me.  Maybe it can work as long as nobody has to lock the lists for long,
but I'd rather think of approaches with no central point of contention.
        regards, tom lane

Re: Readme of Buffer Management seems to have wrong sentence

From

Robert Haas

Date:

22 May 2012, 13:23:31

On Tue, May 22, 2012 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> With respect to the control logic for the background writer, one idea
>> I had was to get rid of the idea that the background writer's job is
>> to write in advance of the strategy point.  Instead, every time the
>> clock sweep passes over a dirty buffer that is otherwise evictable, we
>> add it to a queue of things that the bgwriter should clean.  Those
>> buffers, once cleaned, go on the free list.  Maybe some variant of
>> that could work with your idea.
>
> Both ends of that imply centralized data structures that could become
> contention bottlenecks, so it doesn't sound tremendously appealing to
> me.  Maybe it can work as long as nobody has to lock the lists for long,
> but I'd rather think of approaches with no central point of contention.

If we're going to throw our current algorithm over wholesale, I'd
rather use some approach that has been demonstrated to work well in
other systems.  Buffer eviction is a problem that's been around since
the 1970s, and our algorithm is just about that old.  I realize that
there are (legitimate) concerns about what might be patented, but
hopefully that doesn't mean we're not allowed to consult the
literature in any way.  If that were the case, we wouldn't have SSI.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Readme of Buffer Management seems to have wrong sentence

From

Tom Lane

Date:

22 May 2012, 13:39:41

Robert Haas <robertmhaas@gmail.com> writes:
> If we're going to throw our current algorithm over wholesale, I'd
> rather use some approach that has been demonstrated to work well in
> other systems.  Buffer eviction is a problem that's been around since
> the 1970s, and our algorithm is just about that old.

Um, if you're claiming that that code dates from Berkeley days, you're
quite mistaken.  We adopted the clock sweep in 2005, after trying a few
other things whose performance was worse.  I've not seen any argument in
this thread that suggests we should abandon clock sweep + usage counts
entirely.  Rather, to me the issue is that we haven't completely gotten
rid of the last vestiges of the old global freelist approach.

BTW, it might be worth pointing out something I was trying to explain
to Amit at PGCon: the key reason that we went with usage counters rather
than something like a global LRU chain is that in the fast path where
ReadBuffer finds the requested block already in a buffer, it does not
have to contend for any global data structure to update the buffer's
usage information.  It just has to bump the usage count in the buffer's
header.  The free list, and the contention for BufFreelistLock, only
matter in the slow path where you're going to have to incur I/O anyway
(or at least a visit to the kernel).  That seemed plenty good enough
in 2005.  Our ambitions have now advanced further, so I'm on board with
trying to reduce contention here too, but I think it would be a mistake
to make the fast case any slower.
        regards, tom lane

Re: Readme of Buffer Management seems to have wrong sentence

From

Robert Haas

Date:

22 May 2012, 14:29:02

On Tue, May 22, 2012 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> If we're going to throw our current algorithm over wholesale, I'd
>> rather use some approach that has been demonstrated to work well in
>> other systems.  Buffer eviction is a problem that's been around since
>> the 1970s, and our algorithm is just about that old.
>
> Um, if you're claiming that that code dates from Berkeley days, you're
> quite mistaken.

Note the code: the algorithm.  I believe that what we have implemented
is GCLOCK, which is very old.

In the interest of full disclosure, descriptions of exactly what
GCLOCK is seem to vary a bit from paper to paper, but at least some
papers describe the exact algorithm we use.

> We adopted the clock sweep in 2005, after trying a few
> other things whose performance was worse.  I've not seen any argument in
> this thread that suggests we should abandon clock sweep + usage counts
> entirely.  Rather, to me the issue is that we haven't completely gotten
> rid of the last vestiges of the old global freelist approach.

Well, I think that switching from one clock sweep to a clock sweep per
backend would be basically an abandonment of the current approach.
The results might be better or worse, but they'd surely be different.

> BTW, it might be worth pointing out something I was trying to explain
> to Amit at PGCon: the key reason that we went with usage counters rather
> than something like a global LRU chain is that in the fast path where
> ReadBuffer finds the requested block already in a buffer, it does not
> have to contend for any global data structure to update the buffer's
> usage information.  It just has to bump the usage count in the buffer's
> header.  The free list, and the contention for BufFreelistLock, only
> matter in the slow path where you're going to have to incur I/O anyway
> (or at least a visit to the kernel).  That seemed plenty good enough
> in 2005.  Our ambitions have now advanced further, so I'm on board with
> trying to reduce contention here too, but I think it would be a mistake
> to make the fast case any slower.

Totally agreed.  We're not the first people to think of this, either:
CLOCK and GLOCK have been extensively studied and found to be almost
as good as LRU in selecting good victim pages, but with less
contention.  That's why people are using them.

Here's a paper that defines GLOCK to be the algorithm we use (page 2,
second column, second paragraph from the bottom), and furthermore
mentions PostgreSQL (top of page 3):

http://staff.aist.go.jp/m.yui/publications/ICDE10_conf_full_409.pdf

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Readme of Buffer Management seems to have wrong sentence

From

Ants Aasma

Date:

22 May 2012, 17:37:13

On Tue, May 22, 2012 at 8:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Totally agreed.  We're not the first people to think of this, either:
> CLOCK and GLOCK have been extensively studied and found to be almost
> as good as LRU in selecting good victim pages, but with less
> contention.  That's why people are using them.
>
> Here's a paper that defines GLOCK to be the algorithm we use (page 2,
> second column, second paragraph from the bottom), and furthermore
> mentions PostgreSQL (top of page 3):
>
> http://staff.aist.go.jp/m.yui/publications/ICDE10_conf_full_409.pdf

Interesting paper, they make the whole buffer management lock free.
Getting rid of not only the BufFreelistLock but also BufMappingLocks
and buffer header spinlocks. Now AFAIK BufMappingLocks and buffer
header aren't really a big issue for us, and atleast the latter looks
like turning it lock-free would entail lots of pretty hairy and
bug-prone code. On the other hand, making the clock sweep lock-free
looks relatively easy.

As far as I can see there is no reason why nextVictimBuffer couldn't
be read, incremented and atomically cmpxchg'ed to let other continue
before checking for the usage count. Likewise completePasses and
numBufferAllocs can easily be atomically incremented, bgwriter
shouldn't mind if the values are slightly out of date. The free list
itself is a bit trickier, but if it's still necessary/useful then
SC->firstFreeBuffer and buf->freeNext are in effect a linked-list
stack, there should plenty of tested lock free algorithms floating
around for that. (btw. lastFreeBuffer looks like dead code, is that
correct?)

As for better buffer management algorithms, have you read about
CLOCK-Pro? [1] It looks like it's an adaptive variant of LIRS, the
algorithm the MySQL uses (or atleast used some time ago). Looks like
Linux kernel devs also at least thought about implementing it [2] [3]
(hard to tell exactly, their docs are pretty chaotic compared pg).
According to [4], LIRS is almost as good as or better than LRU, by
extension I'd expect CLOCK-Pro to be better than GLOCK. It still has a
single clock mechanism, so better replacement policies won't help a
whole lot with lock contention on buffer allocation.

[1] http://www.cse.ohio-state.edu/~fchen/paper/papers/usenix05.pdf
[2] http://linux-mm.org/ClockProApproximation
[3] http://linux-mm.org/PageReplacementDesign
[4] http://www.almaden.ibm.com/cs/people/dmodha/ARC.pdf

Cheers,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Re: Readme of Buffer Management seems to have wrong sentence

From

Amit Kapila

Date:

23 May 2012, 12:41:55

>>And besides
>>if the decrements are decoupled from the allocation requests it's no
>>longer obvious that the algorithm is even an approximation of LRU.

I was trying to highlight that we can do the clocksweep in bgwriter and keep
the backends logic as it is currently.
The core idea is that it will reduce the work of backends and chances of
them to get the free buffer early than currently will be more.

Some of the other ideas about it which I have discussed are
1. moving clean buffers to freelist when any found by bgwriter or
checkpoint. This is to get the buffers early by backends without going into
clock sweep algorithm.

2. having multiple lists of hot and cold buffers and backends will find the
buffers in following order if the required buffer is not already there  a. freelist  b. cold list  c. hot list
3. try to experiment with different values of usage count under heavy load
scenarios and check does it make any difference.

I will try to prototype these ideas and publish the results here, so that we
can check if it gives any benfit.
There are some other ideas also in this chain list which I shall check like
1. (clock
sweeps from the different backends ended up too closely synchronized). If
these really cause problems I will try to address.
2. Not to have same lock for all the algorithm for finding a free buffer.

I would like to build a prototype as follows:
1. prepare or identify an existing testcase where these problems would be
more evident
2. Change the code to implement the ideas in this mail chain and what I
think can improve.

Do you feel I can attempt to address this problem with some prototypes and
discuss here after few days when I have some results ready.

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Tuesday, May 22, 2012 7:55 PM
To: Robert Haas
Cc: Amit Kapila; PostgreSQL-development
Subject: Re: [HACKERS] Readme of Buffer Management seems to have wrong
sentence

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, May 22, 2012 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, keep in mind that that action is not merely there to obtain a
>> victim buffer; it is also maintaining the global LRU state (by
>> decrementing the usage counts of buffers it passes over).  I don't think
>> you can change it to simply look only at a predetermined freelist
>> without seriously compromising the overall quality of our buffer
>> replacement decisions.

> The idea would be to have a background process (like bgwriter)
> maintain the global LRU state and push candidate buffers onto the
> freelist.

Amit was trying to convince me of the same idea at PGCon, but I don't
buy it.  bgwriter doesn't scan the buffer array nearly fast enough to
provide useful adjustment of the usage counts under load.  And besides
if the decrements are decoupled from the allocation requests it's no
longer obvious that the algorithm is even an approximation of LRU.

But the larger issue here is that if that processing is a bottleneck
(which I agree it is), how does it help to force a single process to
be responsible for it?  Any real improvement in scalability here will
need to decentralize the operation more, not less.

My own thoughts about this had pointed in the direction of getting rid
of the central freelist entirely, instead letting each backend run its
own independent clock sweep as needed.  The main problem with that is
that if there's no longer any globally-visible clock sweep state, it's
pretty hard to figure out what the control logic for the bgwriter should
look like.  Maybe it would be all right to have global variables that
are just statistics counters for allocations and buffers swept over,
which backends would need to spinlock for just long enough to increment
the counters at the end of each buffer allocation.
        regards, tom lane

Re: Readme of Buffer Management seems to have wrong sentence

From

Greg Smith

Date:

23 May 2012, 14:05:38

On 05/23/2012 11:36 AM, Amit Kapila wrote:

> Do you feel I can attempt to address this problem with some prototypes and
> discuss here after few days when I have some results ready.

I don't think there is a clear picture yet of what benchmark to use for 
testing changes here.  Items like "Consider adding buffers the 
background writer finds reusable to the free list" have been on the TODO 
list since 2007; neither ideas nor code are the problem here.  Proving 
that a) a new policy helps on some workloads and b) it doesn't harm any 
important workload, those are the hard parts here.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Re: Readme of Buffer Management seems to have wrong sentence

From

Amit Kapila

Date:

23 May 2012, 14:37:37

>>I don't think there is a clear picture yet of what benchmark to use for
testing changes here.  
I will first try to generate such a scenario(benchmark). I have still not
thought completely.
However the idea in my mind is that scenario where buffer list is heavily
operated upon. 
Operations where shared buffers are much less compare to the data in disk
and the operations are distributed such that
they require to access most of the data in disk randomly. 


>> Proving that a) a new policy helps on some workloads 
I thought to prove, I should write a proof of concept code and then test it
by having appropriate test, or do you think that it needs to be proved some
other way.

>>and b) it doesn't harm any important workload, those are the hard parts
here
Do you have something in mind which needs to be taken care or thought about
before attempting the idea.



-----Original Message-----
From: Greg Smith [mailto:greg@2ndQuadrant.com] 
Sent: Wednesday, May 23, 2012 10:35 PM
To: pgsql-hackers@postgresql.org; amit.kapila@huawei.com
Subject: Re: [HACKERS] Readme of Buffer Management seems to have wrong
sentence

On 05/23/2012 11:36 AM, Amit Kapila wrote:

> Do you feel I can attempt to address this problem with some prototypes and
> discuss here after few days when I have some results ready.

I don't think there is a clear picture yet of what benchmark to use for 
testing changes here.  Items like "Consider adding buffers the 
background writer finds reusable to the free list" have been on the TODO 
list since 2007; neither ideas nor code are the problem here.  Proving 
that a) a new policy helps on some workloads and b) it doesn't harm any 
important workload, those are the hard parts here.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Re: Readme of Buffer Management seems to have wrong sentence

From

Jeff Janes

Date:

23 May 2012, 15:09:40

On Wed, May 23, 2012 at 10:33 AM, Amit Kapila <amit.kapila@huawei.com> wrote:
>>>I don't think there is a clear picture yet of what benchmark to use for
> testing changes here.
> I will first try to generate such a scenario(benchmark). I have still not
> thought completely.
> However the idea in my mind is that scenario where buffer list is heavily
> operated upon.
> Operations where shared buffers are much less compare to the data in disk
> and the operations are distributed such that
> they require to access most of the data in disk randomly.

If most buffer reads actually have to read from disk, then that will
so throttle your throughput that you will not be able to make anything
else be relevant.  You need to have shared_buffers be much smaller
than RAM, and have almost all the "disk" data resident in RAM but not
in shared_buffers.

Cheers,

Jeff

Re: Readme of Buffer Management seems to have wrong sentence

From

Jeff Janes

Date:

23 May 2012, 15:21:31

On Wed, May 23, 2012 at 8:36 AM, Amit Kapila <amit.kapila@huawei.com> wrote:
>>>And besides
>>>if the decrements are decoupled from the allocation requests it's no
>>>longer obvious that the algorithm is even an approximation of LRU.
>
> I was trying to highlight that we can do the clocksweep in bgwriter and keep
> the backends logic as it is currently.
> The core idea is that it will reduce the work of backends and chances of
> them to get the free buffer early than currently will be more.

Do we have any evidence that this is actually a problem which needs to
be solved?  I've seen plenty of evidence that contention for the lock
can be a problem, but no evidence that the amount of work done under
the lock, once it is obtained, is a problem.

> Some of the other ideas about it which I have discussed are
> 1. moving clean buffers to freelist when any found by bgwriter or
> checkpoint. This is to get the buffers early by backends without going into
> clock sweep algorithm.

You would need to lock once to put the buffer on, and again to take it
off.  That increases the traffic through a contended lock, rather than
decreasing it.  So unless this is coupled with reducing the lock
strength, I don't see how it can win.

>
> 2. having multiple lists of hot and cold buffers and backends will find the
> buffers in following order if the required buffer is not already there
>   a. freelist
>   b. cold list
>   c. hot list
>
> 3. try to experiment with different values of usage count under heavy load
> scenarios and check does it make any difference.

One thing I wanted to play with is having newly read buffers get a
usage count of 0 rather than 1.  The problem is that there is no way
to test it in enough different situations to convince people it would
be a general improvement.

Cheers,

Jeff

Re: Readme of Buffer Management seems to have wrong sentence

From

Amit Kapila

Date:

23 May 2012, 15:35:00

>>You need to have shared_buffers be much smaller than RAM, and have almost
all the "disk" data resident in RAM but not
>> in shared_buffers.
Sure, this is better way to generate heavy activity buffers 

-----Original Message-----
From: Jeff Janes [mailto:jeff.janes@gmail.com] 
Sent: Wednesday, May 23, 2012 11:39 PM
To: Amit Kapila
Cc: Greg Smith; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Readme of Buffer Management seems to have wrong
sentence

On Wed, May 23, 2012 at 10:33 AM, Amit Kapila <amit.kapila@huawei.com>
wrote:
>>>I don't think there is a clear picture yet of what benchmark to use for
> testing changes here.
> I will first try to generate such a scenario(benchmark). I have still not
> thought completely.
> However the idea in my mind is that scenario where buffer list is heavily
> operated upon.
> Operations where shared buffers are much less compare to the data in disk
> and the operations are distributed such that
> they require to access most of the data in disk randomly.

If most buffer reads actually have to read from disk, then that will
so throttle your throughput that you will not be able to make anything
else be relevant.  You need to have shared_buffers be much smaller
than RAM, and have almost all the "disk" data resident in RAM but not
in shared_buffers.

Cheers,

Jeff

Re: Readme of Buffer Management seems to have wrong sentence

From

Tom Lane

Date:

23 May 2012, 15:40:28

Jeff Janes <jeff.janes@gmail.com> writes:
> One thing I wanted to play with is having newly read buffers get a
> usage count of 0 rather than 1.  The problem is that there is no way
> to test it in enough different situations to convince people it would
> be a general improvement.

Hmm ... ISTM that that was discussed back when we instituted buffer
usage counts, and rejected on the grounds that a newly-read buffer could
then have negligible life expectancy.  The clock sweep might be just
about to pass over it.  By starting at 1, it's guaranteed to have at
least 1 sweep cycle time in which it might accumulate more hits.

In other words, we have a choice of whether a buffer's initial lifetime
is between 0 and 1 sweep times, or between 1 and 2 sweep times; and the
discrimination against an unlucky buffer position is infinite in the
first case versus at most 2X in the second case.
        regards, tom lane

Re: Readme of Buffer Management seems to have wrong sentence

From

Amit Kapila

Date:

23 May 2012, 15:48:30

>>Do we have any evidence that this is actually a problem which needs to be
solved?

I don't have very clear evidence, didn't generate any profiling report
still. However in the scenario
Where there are many buffers in the list and backend has to traverse it,
there should be reasonable work under lock.

>>You would need to lock once to put the buffer on, and again to take it
off.  That increases the traffic through a >>contended lock, rather than
decreasing it.

But it shall reduce time of backends which traverse the buffer list as in
some cases they will get the buffer directly from freelist. I understand
that some new contention will be created but wanted to see if the amount it
has reduced can out weight the new contention that gets created.

>>The problem is that there is no way to test it in enough different
situations to convince people it would be a general >>improvement.
I think in this whole idea main point was to have scenarios which can prove
all the points are win. This is same point raised by Greg as well. My first
thought was to generate heavy contention on shared buffers will be able to
show up but I still need to think more on it.
Any more ideas and suggestions to generate the scenarios?

-----Original Message-----
From: Jeff Janes [mailto:jeff.janes@gmail.com]
Sent: Wednesday, May 23, 2012 11:48 PM
To: Amit Kapila
Cc: Tom Lane; Robert Haas; PostgreSQL-development
Subject: Re: [HACKERS] Readme of Buffer Management seems to have wrong
sentence

On Wed, May 23, 2012 at 8:36 AM, Amit Kapila <amit.kapila@huawei.com> wrote:
>>>And besides
>>>if the decrements are decoupled from the allocation requests it's no
>>>longer obvious that the algorithm is even an approximation of LRU.
>
> I was trying to highlight that we can do the clocksweep in bgwriter and
keep
> the backends logic as it is currently.
> The core idea is that it will reduce the work of backends and chances of
> them to get the free buffer early than currently will be more.

Do we have any evidence that this is actually a problem which needs to
be solved?  I've seen plenty of evidence that contention for the lock
can be a problem, but no evidence that the amount of work done under
the lock, once it is obtained, is a problem.

> Some of the other ideas about it which I have discussed are
> 1. moving clean buffers to freelist when any found by bgwriter or
> checkpoint. This is to get the buffers early by backends without going
into
> clock sweep algorithm.

You would need to lock once to put the buffer on, and again to take it
off.  That increases the traffic through a contended lock, rather than
decreasing it.  So unless this is coupled with reducing the lock
strength, I don't see how it can win.

>
> 2. having multiple lists of hot and cold buffers and backends will find
the
> buffers in following order if the required buffer is not already there
>   a. freelist
>   b. cold list
>   c. hot list
>
> 3. try to experiment with different values of usage count under heavy load
> scenarios and check does it make any difference.

One thing I wanted to play with is having newly read buffers get a
usage count of 0 rather than 1.  The problem is that there is no way
to test it in enough different situations to convince people it would
be a general improvement.

Cheers,

Jeff

Re: Readme of Buffer Management seems to have wrong sentence

From

Jeff Janes

Date:

23 May 2012, 16:03:47

On Wed, May 23, 2012 at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> One thing I wanted to play with is having newly read buffers get a
>> usage count of 0 rather than 1.  The problem is that there is no way
>> to test it in enough different situations to convince people it would
>> be a general improvement.
>
> Hmm ... ISTM that that was discussed back when we instituted buffer
> usage counts, and rejected on the grounds that a newly-read buffer could
> then have negligible life expectancy.  The clock sweep might be just
> about to pass over it.

I guess that could be the case if the buffer just came off of the
linked list, but that situation is very rare in general use (and the
sweep hand wouldn't be moving at all until the linked list is empty).
If it were allocated through the clock, then the sweep hand should
have just passed over it and so is unlikely to be just about to pass
over it again.

If the clock sweep is moving so fast that it makes nearly a complete
rotation in the time it takes to read a buffer from disk, then I think
the system is probably beyond redemption.  But I guess that that is
something to be tested for, if I can engineer a test.

> By starting at 1, it's guaranteed to have at
> least 1 sweep cycle time in which it might accumulate more hits.
>
> In other words, we have a choice of whether a buffer's initial lifetime
> is between 0 and 1 sweep times, or between 1 and 2 sweep times; and the
> discrimination against an unlucky buffer position is infinite in the
> first case versus at most 2X in the second case.

But the cost of preventing an occasional buffer from being unlucky is
that the length of the average clock sweep is almost doubled, and thus
it is also easier for hot-ish buffers to get accidentally evicted.
This last part could perhaps be ameliorated by having the usage
incremented by 2 rather than 1 each time a buffer hits.

Also, if the just-read buffer does get unlucky, either it won't be
needed again soon anyway and so it deserves to be unlucky, or the next
time it is needed it will probably still be in RAM, will be read in
quickly before the clock hand has moved much, and will have plenty of
time to accumulate new hits the next time around.

Cheers,

Jeff

Re: Readme of Buffer Management seems to have wrong sentence

From

Robert Haas

Date:

23 May 2012, 16:03:50

On Wed, May 23, 2012 at 2:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Wed, May 23, 2012 at 10:33 AM, Amit Kapila <amit.kapila@huawei.com> wrote:
>>>>I don't think there is a clear picture yet of what benchmark to use for
>> testing changes here.
>> I will first try to generate such a scenario(benchmark). I have still not
>> thought completely.
>> However the idea in my mind is that scenario where buffer list is heavily
>> operated upon.
>> Operations where shared buffers are much less compare to the data in disk
>> and the operations are distributed such that
>> they require to access most of the data in disk randomly.
>
> If most buffer reads actually have to read from disk, then that will
> so throttle your throughput that you will not be able to make anything
> else be relevant.  You need to have shared_buffers be much smaller
> than RAM, and have almost all the "disk" data resident in RAM but not
> in shared_buffers.

But this is pretty common, since we advise people to set
shared_buffers relatively low as compared to physical memory.  The
problem is visible on the graph I posted here:

http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html

When the scale factor gets large enough to exceed shared_buffers,
performance peaks in the 36-44 client range.  When it's small enough
to fit in shared_buffers, performance continues to increase through 64
clients and even a bit beyond.  Note that the absolute *performance*
is not much worse with the larger scale factor, if you have only one
client.  It's the *scalability* that goes out the window.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Readme of Buffer Management seems to have wrong sentence

From

Tom Lane

Date:

23 May 2012, 16:33:54

Jeff Janes <jeff.janes@gmail.com> writes:
> On Wed, May 23, 2012 at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmm ... ISTM that that was discussed back when we instituted buffer
>> usage counts, and rejected on the grounds that a newly-read buffer could
>> then have negligible life expectancy. �The clock sweep might be just
>> about to pass over it.

> I guess that could be the case if the buffer just came off of the
> linked list, but that situation is very rare in general use (and the
> sweep hand wouldn't be moving at all until the linked list is empty).
> If it were allocated through the clock, then the sweep hand should
> have just passed over it and so is unlikely to be just about to pass
> over it again.

Hmm ... that's true as long as we have one central clock sweep.  With
the nearby discussion of per-backend sweeps I'm less sure that I believe
it.  Still, you might be right that it's worth experimenting with to see
if the average clock sweep distance can be shortened.

So, back to the discussion of what would make acceptable test case(s)
for this.  You might look back at the 2005 discussions to see what we
were using at the time of the last major rejiggering in this area.
I concur with your point that we should look at cases where the bulk
of the database fits in kernel disk buffers but not shared_buffers.
        regards, tom lane

Re: Readme of Buffer Management seems to have wrong sentence

From

Ants Aasma

Date:

24 May 2012, 19:39:45

On Tue, May 22, 2012 at 11:36 PM, Ants Aasma <ants@cybertec.at> wrote:
> ... The free list
> itself is a bit trickier, but if it's still necessary/useful then
> SC->firstFreeBuffer and buf->freeNext are in effect a linked-list
> stack, there should plenty of tested lock free algorithms floating
> around for that. (btw. lastFreeBuffer looks like dead code, is that
> correct?)

Thinking about it a bit more, if the freelist is mostly empty, a
simpler alternative would be to make an unprotected read to check
SC->firstFreeBuffer and only acquire BufFreelistLock if there's
anything to pop. This would reduce the lock free parts to just
atomically incrementing a variable.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de