Thread: BTree metapage lock and freelist structure

BTree metapage lock and freelist structure

From
Alvaro Herrera
Date:
Hello hackers,

I'm thinking about the btree metapage locking problem.

In the current situation there are three operations that lock the
metapage:
1. when looking for the root page (share lock, "getroot")
2. when setting a new root page (exclusive lock, "setroot")
3. when extending the relation (exclusive lock, "extend").

Now, I want to add three more operations:
4. add a page to the freelist ("addfree")
5. get a page from the freelist ("getfree")
6. shrink a relation ("shrink").

The shrink operation only happens when one tries to add the last page of
the relation to the freelist.  I don't know if that's very common, but
in case of relation truncating or massive deletion this is important.


What I want is to be able to do getroot and setroot without being
blocked by any of the other four.  Of course the other four are all
mutually exclusive.

There doesn't seem to be a way to acquire two different locks on the
same page, so I propose to lock the InvalidBlockNumer for the btree, and
use that as the lock to be obtained before doing extend, addfree,
getfree or shrink.  The setroot/getroot operations would still use the
locking on BTREE_METAPAGE, so they can proceed whether the
InvalidBlockNumber is blocked or not.


On a different topic:  the freelist structure I think should be
represented as a freelist int32 number (btm_numfreepages) in
BTMetaPageData, and a pointer to the first BlockNumber.  Adding a new
page is done by sequentially scanning the list until a zero is found or
the end of the block is reached.  Getting a page sequentially scans the
same list until a blocknumber > 0 is found (iff btm_numfreepages is
greater than zero).  This allows for ~ 2000 free pages (very unlikely to
actually happen if the shrink operation is in place).

Comments?  Another solution would be to have a separate page for the
freelist, but it seems to be a waste.

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Porque francamente, si para saber manejarse a uno mismo hubiera que
rendir examen... ¿Quién es el machito que tendría carnet?"  (Mafalda)


Re: BTree metapage lock and freelist structure

From
Tom Lane
Date:
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> There doesn't seem to be a way to acquire two different locks on the
> same page, so I propose to lock the InvalidBlockNumer for the btree, and
> use that as the lock to be obtained before doing extend, addfree,
> getfree or shrink.  The setroot/getroot operations would still use the
> locking on BTREE_METAPAGE, so they can proceed whether the
> InvalidBlockNumber is blocked or not.

Unfortunately, blkno == InvalidBlockNumber is already taken --- it's the
encoding of a relation-level lock.  Compare LockRelation and LockPage
in src/backend/storage/lmgr/lmgr.c.

It looks to me like nbtree.c is currently using LockPage(rel, 0) to
interlock relation extension --- this is the same convention used to
interlock heap extension.  But access to the metapage is determined by
buffer locks, which are an independent facility.

I agree with your conclusion that extend, shrink, addfree, and getfree
operations may as well all use the same exclusive lock.  I think it
could be LockPage(rel, 0), though.

Since addfree/getfree need to touch the metapage, at first glance it
would seem that they need to do LockPage(rel, 0) *plus* get a buffer
lock on the metapage.  But I think it might work for them to get only
a shared lock on the metapage; the LockPage lock will guarantee
exclusion against other addfree/getfree operations, and they don't
really care if someone else is changing the root pointer.  This is ugly
but given the limited set of operations that occur against a btree
metapage, it seems acceptable to me.  Comments anyone?


> On a different topic:  the freelist structure I think should be
> represented as a freelist int32 number (btm_numfreepages) in
> BTMetaPageData, and a pointer to the first BlockNumber.  Adding a new
> page is done by sequentially scanning the list until a zero is found or
> the end of the block is reached.  Getting a page sequentially scans the
> same list until a blocknumber > 0 is found (iff btm_numfreepages is
> greater than zero).  This allows for ~ 2000 free pages (very unlikely to
> actually happen if the shrink operation is in place).

No more than 2000 free pages in an index?  What makes you think that's
unlikely?  It's only 16 meg of free space, which would be about 1% of
a gigabyte-sized index.  I think it's a bad idea to make such a
hardwired assumption.

The original concept I believe was to have a linked list of free pages,
ie, each free page contains the pointer to the next one.  This allows
indefinite expansion.  It does mean that you have to read in the
selected free page to get the next-free-page pointer from it, which you
have to do to update the metapage, so that read has to happen while
you're still holding the getfree lock.  That's annoying.

Perhaps we could use a hybrid data structure: up to 2000 free pages
stored in the metapage, with any remaining ones chained together.
Most of the time, freelist operations wouldn't need to touch the
chain, hopefully.
        regards, tom lane