Re: [PATCHES] update i386 spinlock for hyperthreading - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: [PATCHES] update i386 spinlock for hyperthreading
Date
Msg-id 000c01c3ee9b$7178ce70$8ac887d9@LaptopDellXP
Whole thread Raw
In response to Re: [PATCHES] update i386 spinlock for hyperthreading  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
>Tom Lane
> Manfred's numbers definitely say that we need to find a way to break
> down the BufMgrLock into multiple finer-grain locks.  We already have
> all those per-buffer LWLocks, but I don't see how to apply those to
> the problem of managing the global lookup and replacement
datastructures.
> 
> Anyone see an attack path here?

Not specifically, but some ideas that might lead to something useful...

Taking a leaf from somebody else's book is always the best way - much as
has been done with the ARC algorithm itself.
1. Scaling Oracle8iT: Building Highly Scalable OLTP System Architectures

James Morle
ISBN: 0-201-32574-8
...seems like an interesting purchase in this regard (I have no
connection with the author) - but I can't vouch for usefulness of it.
http://www.aw-bc.com/catalog/academic/product/0,4096,0201325748-TOC,00.h
tml
This is nothing to do with OPS or RAC, which I definitely do not
advocate as an approach to the current challenges.

2. DB2 has long offered multiple BUFFERPOOLS. Why not explicitly define
multiple pools and then assign specific database objects to them - each
would have its own MRU list.

...that spurs some further off-the-wall ideas that follow the more usual
PostgreSQL line "design for automated operation" - how to make use of
multiple buffer pools without the need to manually assign database
objects to them???

First a question: am I right in thinking that the problem is
proportional to the number of independent actors (i.e. CPUs), but not
dependant at all on the actual size of the cache?

3. Split the T1 list into 2 (maybe more?) pieces, completely independent
of one another - T1a and T1b. When a T2, T1a, or T1b hit occurs, we
*somehow* pick one of the T1 lists, in a pseudo random fashion and move
it to the top of that list. This would clearly not be the best that ARC
can achieve, but if the buffer (and therefore the list) is big enough,
then the skew between the two lists might well be less than the loss of
performance from having a hot single MRU list. This might be regarded as
vertical partitioning.

>Jan Wieck writes:
>The theory of the whole ARC strategy (which is nothing more than four 
>LRU's with a twist) is that the access frequency of blocks is reverse 
>proportional to their numbers (a few are high frequency used, some are 
>medium often requested, the vast majority rather seldom). That means, 
>that the buffers in the upper third or quarter of the T1 queue (and 
>that's the critical one) are basically circling around each other with 
>frequent visitors from the lower regions or other queues.

4. Taking the analogy from (3) differently: partition T1 horizontally,
like the floors of a building, or the league divisions in Football
(Major/Minor? in the US, or 1st, 2nd, 3rd etc in the UK). The ARC
algorithm is identical, but T1 effectively has multiple MRU points, or
put another way, multiple lists all chained together. Say we give it 3
levels - T1.1 thru T1.3. When a block is first promoted from T2, it goes
to the MRU of T1.3. When a hit occurs when it is in T1.3 it gets punted
to the MRU of T1.2, and when a hit occurs when in T1.2 it would get
promoted to T1.1 MRU. On the way down, blocks falling off of T1.1 would
go to T1.2 then T1.3. 

The T1.1 MRU would still be hotter than the rest. The way I am thinking
about this now is that the MRU points are evenly spaced, so the ARC
algorithm doesn't dynamically resize them as it does with T1/T2, but
that could change also. Doing things this way might mean that ARC
doesn't have to remember which transaction id added it - a specific ARC
tailoring feature added for PostgreSQL, since even if an UPDATE/DELETE
touches it twice, it won't move very far up the list. Would that save
time on the list reordering code?
4a) Perhaps MRU points should be spaced as 1/6, 1/3, 1/2 of the list, or
someother fairly simple but non-linear way in an attempt to balance the
hit frequency of the MRU points. Perhaps use four levels, then split
1/16, 2/16, 4/16, 8/16 (so last level if 9/16ths of list).

4b) Combine this with idea (3) above to split that list into two or more
equal sized lists, T1.1a and T1.1b.

4c) Promote blocks differently, according to their content type? Put
index non-leaf blocks (from fastroot down) & the rightmost leaf block
that are in T1 straight into T2.1, put permanent table heap blocks &
index leaf blocks into T2.2 and put temporary objects into T2.3. Would
require tasting the block or enquiring details about its parent object
rather than handling it blind - without regard to its contents, as ARC
does now - that sounds expensive.

Hope some of that helps..

Best Regards, Simon Riggs



pgsql-hackers by date:

Previous
From: Neil Conway
Date:
Subject: psql tab completion & USERSET vars
Next
From: "Alex J. Avriette"
Date:
Subject: Re: RFC: Security documentation