Thread: Re: [PATCHES] update i386 spinlock for hyperthreading

Re: [PATCHES] update i386 spinlock for hyperthreading

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Manfred Spraul wrote:
>> I remember there were patches that tried other algorithms instead of the 
>> simple LRU for the buffer manager. Has anyone tried to change the 
>> locking of the buffer manager?

> CVS HEAD already has an Adaptive Replacement Cache (ARC) for buffer
> replacement.

That's irrelevant to the problem, though.  Unless the ARC code uses data
structures that are more amenable to localized locking than the old
global buffer freelist.  (Jan?)
        regards, tom lane


Re: [PATCHES] update i386 spinlock for hyperthreading

From
Jan Wieck
Date:
Tom Lane wrote:

>Bruce Momjian <pgman@candle.pha.pa.us> writes:
>  
>
>>Manfred Spraul wrote:
>>    
>>
>>>I remember there were patches that tried other algorithms instead of the 
>>>simple LRU for the buffer manager. Has anyone tried to change the 
>>>locking of the buffer manager?
>>>      
>>>
>
>  
>
>>CVS HEAD already has an Adaptive Replacement Cache (ARC) for buffer
>>replacement.
>>    
>>
>
>That's irrelevant to the problem, though.  Unless the ARC code uses data
>structures that are more amenable to localized locking than the old
>global buffer freelist.  (Jan?)
>
>            regards, tom lane
>  
>

Not that I know of. The new strategy uses one shared hash table like the 
old, and one buffer pool as well. It grabs the same old Bufmgr lock 
during the lookup+replacement decision process, gives it up during 
eventual IO, grabs it again when done with the IO. As a matter of fact, 
the strategy itself does no locking at all. Like the old LRU code it 
simply assumes that the buffer manager holds the lock during calls.


Jan

-- 

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #





Re: [PATCHES] update i386 spinlock for hyperthreading

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
>> That's irrelevant to the problem, though.  Unless the ARC code uses data
>> structures that are more amenable to localized locking than the old
>> global buffer freelist.  (Jan?)

> the strategy itself does no locking at all. Like the old LRU code it 
> simply assumes that the buffer manager holds the lock during calls.

Okay, I suspected as much but wasn't sure.

Manfred's numbers definitely say that we need to find a way to break
down the BufMgrLock into multiple finer-grain locks.  We already have
all those per-buffer LWLocks, but I don't see how to apply those to
the problem of managing the global lookup and replacement datastructures.

Anyone see an attack path here?
        regards, tom lane


Re: [PATCHES] update i386 spinlock for hyperthreading

From
Bruce Momjian
Date:
Tom Lane wrote:
> Jan Wieck <JanWieck@Yahoo.com> writes:
> >> That's irrelevant to the problem, though.  Unless the ARC code uses data
> >> structures that are more amenable to localized locking than the old
> >> global buffer freelist.  (Jan?)
> 
> > the strategy itself does no locking at all. Like the old LRU code it 
> > simply assumes that the buffer manager holds the lock during calls.
> 
> Okay, I suspected as much but wasn't sure.
> 
> Manfred's numbers definitely say that we need to find a way to break
> down the BufMgrLock into multiple finer-grain locks.  We already have
> all those per-buffer LWLocks, but I don't see how to apply those to
> the problem of managing the global lookup and replacement datastructures.
> 
> Anyone see an attack path here?

Should we have one lock per hash bucket rather than one for the entire
hash?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: [PATCHES] update i386 spinlock for hyperthreading

From
Manfred Spraul
Date:
Bruce Momjian wrote:

>>Anyone see an attack path here?
>>    
>>
>
>Should we have one lock per hash bucket rather than one for the entire
>hash?
>  
>
That's the simple part. The problem is the aging strategy: we need a
strategy that doesn't rely on a global list that's updated after every
lookup. If I understand the ARC code correctly, there is a
STRAT_MRU_INSERT(cdb, STRAT_LIST_T2) that happen in every lookup.


--
Manfred




Re: [PATCHES] update i386 spinlock for hyperthreading

From
Jan Wieck
Date:
Manfred Spraul wrote:
> Bruce Momjian wrote:
> 
>>>Anyone see an attack path here?
>>>    
>>>
>>
>>Should we have one lock per hash bucket rather than one for the entire
>>hash?
>>  
>>
> That's the simple part. The problem is the aging strategy: we need a
> strategy that doesn't rely on a global list that's updated after every
> lookup. If I understand the ARC code correctly, there is a
> STRAT_MRU_INSERT(cdb, STRAT_LIST_T2) that happen in every lookup.

Moving the Cache Directory Block (cdb) on a hit to the MRU position of
the appropriate queue "is the bookkeeping" of this strategy. The whole
algorithm is based on it, and I don't see yet how to avoid that without
opening a huge can of worms that look like deadlocks. But I'll think
about it for a while.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: [PATCHES] update i386 spinlock for hyperthreading

From
Bruce Momjian
Date:
Jan Wieck wrote:
> Manfred Spraul wrote:
> > Bruce Momjian wrote:
> > 
> >>>Anyone see an attack path here?
> >>>    
> >>>
> >>
> >>Should we have one lock per hash bucket rather than one for the entire
> >>hash?
> >>  
> >>
> > That's the simple part. The problem is the aging strategy: we need a
> > strategy that doesn't rely on a global list that's updated after every
> > lookup. If I understand the ARC code correctly, there is a
> > STRAT_MRU_INSERT(cdb, STRAT_LIST_T2) that happen in every lookup.
> 
> Moving the Cache Directory Block (cdb) on a hit to the MRU position of
> the appropriate queue "is the bookkeeping" of this strategy. The whole
> algorithm is based on it, and I don't see yet how to avoid that without
> opening a huge can of worms that look like deadlocks. But I'll think
> about it for a while.

If we can't eliminate the global lock, and we reduce its duration?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: [PATCHES] update i386 spinlock for hyperthreading

From
Manfred Spraul
Date:
Jan Wieck wrote:

>Moving the Cache Directory Block (cdb) on a hit to the MRU position of
>the appropriate queue "is the bookkeeping" of this strategy. The whole
>algorithm is based on it, and I don't see yet how to avoid that without
>opening a huge can of worms that look like deadlocks. But I'll think
>about it for a while.
>
I feared that.
Are there strategies that do not rely on a global lock? The Linux kernel 
uses a lazy LRU with referenced bits: on access, the referenced bit is 
set. The freespace logic takes pages from the end of a linked list, and 
checks that bit: if it's set, then the page is moved back to the top of 
the list. Otherwise it's a candidate for replacement. Pages start at the 
head of that pseudo-lru list, with the reference bit clear: that way a 
page that is accessed only once has a lower priority than a frequently 
accessed page. At least that's how I understand the algorithm.

--   Manfred



Re: [PATCHES] update i386 spinlock for hyperthreading

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> If we can't eliminate the global lock, and we reduce its duration?

It'd be a big win even if we could just arrange that ReadBuffer and
ReleaseBuffer don't *both* grab the same global lock.

Can we logically separate the buffer lookup hashtable from the freelist
maintenance code?  I am wondering about having two locks, one for each
of those areas.  In principle I think a fast-path ReadBuffer (one that
doesn't need to do I/O because the desired page is in a buffer already)
would need to get only a share lock on the lookup table, and never lock
the freelist structure at all.  ReleaseBuffer would need a write lock
on the freelist, but need not touch the lookup table.  Only in the case
where ReadBuffer has to do I/O, and therefore needs to acquire a free
buffer to read the page into, do you need locks on both structures ---
and with sufficient care, you might not need to lock them both at the
same time, even in that path.

To make this work, we would have to recognize that the "freelist" might
be out of date --- that is, a page that had been in the freelist because
it has zero pins would be left in the freelist by ReadBuffer, and anyone
trying to remove it from the freelist for reuse would have to notice
that it has positive pin count and realize that it's not really free.
But that should be pretty simple.

Beyond that we could think about locking just parts of each of these
structures (for instance, doesn't ARC really use multiple freelists?)
but I think we ought to focus first on that basic division.
        regards, tom lane


Re: [PATCHES] update i386 spinlock for hyperthreading

From
Tom Lane
Date:
Manfred Spraul <manfred@colorfullife.com> writes:
> Are there strategies that do not rely on a global lock? The Linux kernel 
> uses a lazy LRU with referenced bits: on access, the referenced bit is 
> set. The freespace logic takes pages from the end of a linked list, and 
> checks that bit: if it's set, then the page is moved back to the top of 
> the list. Otherwise it's a candidate for replacement.

I think this is the same idea as what I was just suggesting: add an
extra check when looking for a free page, and thereby avoid having to
lock/update the global datastructure during ReadBuffer.
        regards, tom lane


Re: [PATCHES] update i386 spinlock for hyperthreading

From
Jan Wieck
Date:
Tom Lane wrote:

> Manfred Spraul <manfred@colorfullife.com> writes:
>> Are there strategies that do not rely on a global lock? The Linux kernel 
>> uses a lazy LRU with referenced bits: on access, the referenced bit is 
>> set. The freespace logic takes pages from the end of a linked list, and 
>> checks that bit: if it's set, then the page is moved back to the top of 
>> the list. Otherwise it's a candidate for replacement.
> 
> I think this is the same idea as what I was just suggesting: add an
> extra check when looking for a free page, and thereby avoid having to
> lock/update the global datastructure during ReadBuffer.

Hmmmm ... speaking without having done in depth thought on this one:

If there would be an easy way to determine the approximate position of a 
CDB inside the queue, one could just forget about moving it to the MRU 
position if it's in the upper 3rd or so anyway.

The theory of the whole ARC strategy (which is nothing more than four 
LRU's with a twist) is that the access frequency of blocks is reverse 
proportional to their numbers (a few are high frequency used, some are 
medium often requested, the vast majority rather seldom). That means, 
that the buffers in the upper third or quarter of the T1 queue (and 
that's the critical one) are basically circling around each other with 
frequent visitors from the lower regions or other queues.

So if there would be a way to make a good guess on that relative queue 
position, that expensive CDB shuffeling can be avoided often.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: [PATCHES] update i386 spinlock for hyperthreading

From
"Simon Riggs"
Date:
>Tom Lane
> Manfred's numbers definitely say that we need to find a way to break
> down the BufMgrLock into multiple finer-grain locks.  We already have
> all those per-buffer LWLocks, but I don't see how to apply those to
> the problem of managing the global lookup and replacement
datastructures.
> 
> Anyone see an attack path here?

Not specifically, but some ideas that might lead to something useful...

Taking a leaf from somebody else's book is always the best way - much as
has been done with the ARC algorithm itself.
1. Scaling Oracle8iT: Building Highly Scalable OLTP System Architectures

James Morle
ISBN: 0-201-32574-8
...seems like an interesting purchase in this regard (I have no
connection with the author) - but I can't vouch for usefulness of it.
http://www.aw-bc.com/catalog/academic/product/0,4096,0201325748-TOC,00.h
tml
This is nothing to do with OPS or RAC, which I definitely do not
advocate as an approach to the current challenges.

2. DB2 has long offered multiple BUFFERPOOLS. Why not explicitly define
multiple pools and then assign specific database objects to them - each
would have its own MRU list.

...that spurs some further off-the-wall ideas that follow the more usual
PostgreSQL line "design for automated operation" - how to make use of
multiple buffer pools without the need to manually assign database
objects to them???

First a question: am I right in thinking that the problem is
proportional to the number of independent actors (i.e. CPUs), but not
dependant at all on the actual size of the cache?

3. Split the T1 list into 2 (maybe more?) pieces, completely independent
of one another - T1a and T1b. When a T2, T1a, or T1b hit occurs, we
*somehow* pick one of the T1 lists, in a pseudo random fashion and move
it to the top of that list. This would clearly not be the best that ARC
can achieve, but if the buffer (and therefore the list) is big enough,
then the skew between the two lists might well be less than the loss of
performance from having a hot single MRU list. This might be regarded as
vertical partitioning.

>Jan Wieck writes:
>The theory of the whole ARC strategy (which is nothing more than four 
>LRU's with a twist) is that the access frequency of blocks is reverse 
>proportional to their numbers (a few are high frequency used, some are 
>medium often requested, the vast majority rather seldom). That means, 
>that the buffers in the upper third or quarter of the T1 queue (and 
>that's the critical one) are basically circling around each other with 
>frequent visitors from the lower regions or other queues.

4. Taking the analogy from (3) differently: partition T1 horizontally,
like the floors of a building, or the league divisions in Football
(Major/Minor? in the US, or 1st, 2nd, 3rd etc in the UK). The ARC
algorithm is identical, but T1 effectively has multiple MRU points, or
put another way, multiple lists all chained together. Say we give it 3
levels - T1.1 thru T1.3. When a block is first promoted from T2, it goes
to the MRU of T1.3. When a hit occurs when it is in T1.3 it gets punted
to the MRU of T1.2, and when a hit occurs when in T1.2 it would get
promoted to T1.1 MRU. On the way down, blocks falling off of T1.1 would
go to T1.2 then T1.3. 

The T1.1 MRU would still be hotter than the rest. The way I am thinking
about this now is that the MRU points are evenly spaced, so the ARC
algorithm doesn't dynamically resize them as it does with T1/T2, but
that could change also. Doing things this way might mean that ARC
doesn't have to remember which transaction id added it - a specific ARC
tailoring feature added for PostgreSQL, since even if an UPDATE/DELETE
touches it twice, it won't move very far up the list. Would that save
time on the list reordering code?
4a) Perhaps MRU points should be spaced as 1/6, 1/3, 1/2 of the list, or
someother fairly simple but non-linear way in an attempt to balance the
hit frequency of the MRU points. Perhaps use four levels, then split
1/16, 2/16, 4/16, 8/16 (so last level if 9/16ths of list).

4b) Combine this with idea (3) above to split that list into two or more
equal sized lists, T1.1a and T1.1b.

4c) Promote blocks differently, according to their content type? Put
index non-leaf blocks (from fastroot down) & the rightmost leaf block
that are in T1 straight into T2.1, put permanent table heap blocks &
index leaf blocks into T2.2 and put temporary objects into T2.3. Would
require tasting the block or enquiring details about its parent object
rather than handling it blind - without regard to its contents, as ARC
does now - that sounds expensive.

Hope some of that helps..

Best Regards, Simon Riggs