Re: BTree metapage lock and freelist structure - Mailing list pgsql-hackers

From Tom Lane
Subject Re: BTree metapage lock and freelist structure
Date
Msg-id 25142.1034015464@sss.pgh.pa.us
Whole thread Raw
In response to BTree metapage lock and freelist structure  (Alvaro Herrera <alvherre@dcc.uchile.cl>)
List pgsql-hackers
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> There doesn't seem to be a way to acquire two different locks on the
> same page, so I propose to lock the InvalidBlockNumer for the btree, and
> use that as the lock to be obtained before doing extend, addfree,
> getfree or shrink.  The setroot/getroot operations would still use the
> locking on BTREE_METAPAGE, so they can proceed whether the
> InvalidBlockNumber is blocked or not.

Unfortunately, blkno == InvalidBlockNumber is already taken --- it's the
encoding of a relation-level lock.  Compare LockRelation and LockPage
in src/backend/storage/lmgr/lmgr.c.

It looks to me like nbtree.c is currently using LockPage(rel, 0) to
interlock relation extension --- this is the same convention used to
interlock heap extension.  But access to the metapage is determined by
buffer locks, which are an independent facility.

I agree with your conclusion that extend, shrink, addfree, and getfree
operations may as well all use the same exclusive lock.  I think it
could be LockPage(rel, 0), though.

Since addfree/getfree need to touch the metapage, at first glance it
would seem that they need to do LockPage(rel, 0) *plus* get a buffer
lock on the metapage.  But I think it might work for them to get only
a shared lock on the metapage; the LockPage lock will guarantee
exclusion against other addfree/getfree operations, and they don't
really care if someone else is changing the root pointer.  This is ugly
but given the limited set of operations that occur against a btree
metapage, it seems acceptable to me.  Comments anyone?


> On a different topic:  the freelist structure I think should be
> represented as a freelist int32 number (btm_numfreepages) in
> BTMetaPageData, and a pointer to the first BlockNumber.  Adding a new
> page is done by sequentially scanning the list until a zero is found or
> the end of the block is reached.  Getting a page sequentially scans the
> same list until a blocknumber > 0 is found (iff btm_numfreepages is
> greater than zero).  This allows for ~ 2000 free pages (very unlikely to
> actually happen if the shrink operation is in place).

No more than 2000 free pages in an index?  What makes you think that's
unlikely?  It's only 16 meg of free space, which would be about 1% of
a gigabyte-sized index.  I think it's a bad idea to make such a
hardwired assumption.

The original concept I believe was to have a linked list of free pages,
ie, each free page contains the pointer to the next one.  This allows
indefinite expansion.  It does mean that you have to read in the
selected free page to get the next-free-page pointer from it, which you
have to do to update the metapage, so that read has to happen while
you're still holding the getfree lock.  That's annoying.

Perhaps we could use a hybrid data structure: up to 2000 free pages
stored in the metapage, with any remaining ones chained together.
Most of the time, freelist operations wouldn't need to touch the
chain, hopefully.
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Curtis Faith"
Date:
Subject: Re: Parallel Executors [was RE: Threaded Sorting]
Next
From: "Curtis Faith"
Date:
Subject: Dirty Buffer Writing [was Proposed LogWriter Scheme]