Re: BTree free pages again - Mailing list pgsql-hackers

From Manfred Koizar
Subject Re: BTree free pages again
Date
Msg-id bq3dru02bhau88dkjd2fm81pn61giahffb@4ax.com
Whole thread Raw
In response to BTree free pages again  (Alvaro Herrera <alvherre@dcc.uchile.cl>)
List pgsql-hackers
Alvaro,

some time ago I started to collect ideas for btree reorganization but
never got beyond brainstorming.  Maybe it helps if I add them to your
ideas ...

On Tue, 22 Oct 2002 00:12:30 -0300, Alvaro Herrera
<alvherre@dcc.uchile.cl> wrote:
>I propose instead an hybrid approach:  the metapage has an array of
>BlockNumbers, _and_ a pointer to a "next freelist page".  That page has
>another array of BlockNumbers and another link to next freelist page.
>This allows for easier "compaction" of the freelist, an operation which
>should be done on a regular basis (with each VACUUM FULL, for example).
>The list of freelist-pages should actually be double linked; that way,
>the compaction process can take the BlockNumbers from the last page and
>put it on the first to fill it up, etc.

>(Remember that each newpage
>operation has to sequentially scan the freelist, and put a zero when it
>takes one).

What do you mean by "sequentially scan the freelist"?  Scan the array
of page numbers on the meta page or scan the whole list of freelist
pages?  I'd take a page number from the meta page.  When the list of
free pages on the meta page is empty, read the first freelist page,
move its freelist to the metapage, let the meta page point to its
successor, and return the (formerly) first freelist page as the new
page.

Putting a page on the free list:  If there is room for one more page
number on the meta page, put the page number there and we are
finished.  Otherwise move the freelist from the meta page to the new
page and insert this page at the start of the freelist chain.

To avoid freelist ping-pong you might want to move only, say, 90% or
95% of the freelist from the metapage to the new free page ...

I don't know if we need to doubly link the list.  When do we have to
scan it backwards?


>Each time a
>tuple is deleted the page is checked to see if it's empty.  If it is,
>it's added to a "candidate-empty" list.

At this point the page is marked as "dead" (by setting a new flag in
BTPageOpaqueData.btpo_flags) and the lock is released (not sure about
releasing or keeping the pin).  A dead page is not yet available for
reuse.  It is always empty, so it is immediately skipped by index
scans.  The insertion code is modified to never put an index tuple
onto a dead page (you can put it onto the right sibling without
breaking consistency).

>At the end of the
>btbulkdelete() operation, we have a list of pages that are candidates
>for adding into the freelist.

There are several possible approaches:
(a) Maintain a list of dead pages in memory
(b) Rescan the whole index for dead pages
(c) Handle each dead page immediately

Each of these has its drawbacks: (a) leaves behind dead pages when the
reorganization crashes, (b) might have to read lots of non-dead pages,
(c) is inefficient if there are adjacent dead pages.

>For each one, the page is
>exclusive-locked, and rechecked if it's empty.

The "dead"-flag guarantees emptiness, just assert it's still dead.
(Can VACUUMs run concurrently?  If yes, this algorithm has to be
adjusted.)  Also no need to lock the page now.

>If it is, the parent is
>also exclusive-locked (beware of deadlock here!) and also its left and
>right siblings.  In the parent, the item pointing to this page is
>deleted; in the siblings, the side pointers are updated (high key
>on the left sibling also?).

Do one page at a time:  Exclusively lock the parent page, remove the
index tuple pointing to the dead page, unlock parent page.  Lock one
sibling, update pointer, unlock.  Lock the other sibling, update
pointer, unlock.

Attention!  Maybe there's a problem lurking regarding recent splits of
the left sibling!  Have to think more about this ...

Now we have reached a state where the dead page will not be visited by
any new index scan.  There might still be danger from scans that were
paused just before they were about to touch the dead page.  Have to
look at the code whether pins overlap when a scan is sent down from a
parent page or left/right from a sibling ...

>Then this page is added to the freelist.
>On the "add-to-free-list" operation, the page is checked to see if it's
>the last page of the relation.  If it is, the page just before is
>checked for emptyness (using the BTP_FREE flag) iteratively until a
>nonfree page is found.

Special handling for freelist pages required here.

>All those pages are deleted from the freelist.

You don't have to delete them explicitly.  Just store the maximum free
page number on the meta page and let the free page search ignore (and
set to 0) all page numbers greater than this number.  Making this work
with newroot always extending the relation needs some thought ...

>Then the relation is shrinked in that many pages.
>[...]
>To prevent deadlocks, the newroot operation
>does not get a page from the freelist; it always extends the relation.

>I think I've put too many things in one email.  Sorry for this.

Me too :-)

ServusManfred


pgsql-hackers by date:

Previous
From: Hans-Jürgen Schönig
Date:
Subject: PREPARE / EXECUTE
Next
From: Bruce Momjian
Date:
Subject: Re: pg_dump and large files - is this a problem?