Re: New FSM patch - Mailing list pgsql-hackers

From Tom Lane
Subject Re: New FSM patch
Date
Msg-id 6057.1221777446@sss.pgh.pa.us
Whole thread Raw
In response to Re: New FSM patch  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List pgsql-hackers
I did some editorializing on the FSM README file, in the line of
familiarizing myself with the contents.  Attached is an updated version.

Here are a couple of other random comments I jotted in the process:

search_avail makes me nervous: in the presence of a corrupt tree
I think it could index off the end of the page.  There needs to be
protection against trying to access a leaf node that doesn't exist.
(The search upward is okay because it cannot "miss" the root, and
you already checked the root is big enough; but a comment about
that wouldn't hurt.)  Also, I think it shouldn't throw a hard error
if the tree is corrupt, but just elog(LOG) and take corrective action.

I'd be happier if "next" were declared as int, as that makes it more
likely to be atomically fetchable/storable than if it's int16.  (On some
platforms a partial-word store is done by read, modify, write of a full
word.)  And could we name it something less generic than "next"?

The functions in fsmpage.c have names that are far too generic to be
exposed as global symbols.  If you don't want to fold that file into
freespace.c and make them static, then you need to rename them.

            regards, tom lane

$PostgreSQL$

Free Space Map
--------------

The purpose of the free space map is to quickly locate a page with enough
free space to hold a tuple to be stored; or to determine that no such page
exists and the relation must be extended by one page.  As of PostgreSQL 8.4
each relation has its own, extensible free space map stored in a separate
"fork" of its relation.  This eliminates the disadvantages of the former
fixed-size FSM.

It is important to keep the map small so that it can be searched rapidly.
Therefore, we don't attempt to record the exact free space on a page.
We allocate one map byte to each page, allowing us to record free space
at a granularity of 1/256th of a page.  Another way to say it is that
the stored value is the free space divided by BLCKSZ/256 (rounding down).
We assume that the free space must always be less than BLCKSZ, since
all pages have some overhead; so the maximum map value is 255.

To assist in fast searching, the map isn't simply an array of per-page
entries, but has a tree structure above those entries.  There is a tree
structure of pages, and a tree structure within each page, as described
below.

FSM page structure
------------------

Within each FSM page, we use a binary tree structure where leaf nodes store
the amount of free space on heap pages (or lower level FSM pages, see
"Higher-level structure" below), with one leaf node per heap page. A non-leaf
node stores the max amount of free space on any of its children.

For example:

    4
 4     2
3 4   0 2    <- This level represents heap pages

We need two basic operations: search and update.

To search for a page with X amount of free space, traverse down the tree
along a path where n >= X, until you hit the bottom. If both children of a
node satisfy the condition, you can pick either one arbitrarily.

To update the amount of free space on a page to X, first update the leaf node
corresponding to the heap page, then "bubble up" the change to upper nodes,
by walking up to each parent and recomputing its value as the max of its
two children.  Repeat until reaching the root or a parent whose value
doesn't change.

This data structure has a couple of nice properties:
- to discover that there is no page with X bytes of free space, you only
  need to look at the root node
- by varying which child to traverse to in the search algorithm, when you have
  a choice, we can implement various strategies, like preferring pages closer
  to a given page, or spreading the load across the table.

Higher-level routines that use FSM pages access them through the set_avail()
and search_avail() functions. The interface to those functions hides the
page's internal tree structure, treating the FSM page as a black box that has
a certain number of "slots" for storing free space information.  (However,
the higher routines have to be aware of the tree structure of the whole map.)

The binary tree is stored on each FSM page as an array. Because the page
header takes some space on a page, the binary tree isn't perfect. That is,
a few right-most leaf nodes are missing, and there are some useless non-leaf
nodes at the right. So the tree looks something like this:

       0
   1       2
 3   4   5   6
7 8 9 A B

where the numbers denote each node's position in the array.  Note that the
tree is guaranteed complete above the leaf level; only some leaf nodes are
missing.  This is reflected in the number of usable "slots" per page not
being an exact power of 2.

A FSM page also has a "next" pointer that determines where to start the next
search for free space within that page.  The reason for that is to spread out
the pages that are returned by FSM searches.  When several backends are
concurrently inserting into a relation, contention can be avoided by having
them insert into different pages.  The FSM is responsible for making that
happen, and the "next" pointer helps provide the desired behavior.

Higher-level structure
----------------------

To scale up the data structure described above beyond a single page, we
maintain a similar tree-structure across pages. Leaf nodes in higher level
pages correspond to lower level FSM pages. The root node within each page
has the same value as the corresponding leaf node on its parent page.

The root page is always stored at physical block 0.

For example, assuming each FSM page can hold information about 4 pages (in
reality, it holds (BLCKSZ - headers) / 2, or ~4000 with default BLCKSZ),
we get a disk layout like this:

 0     <-- page 0 at level 2 (root page)
  0     <-- page 0 at level 1
   0     <-- page 0 at level 0
   1     <-- page 1 at level 0
   2     <-- ...
   3
  1     <-- page 1 at level 1
   4
   5
   6
   7
  2
   8
   9
   10
   11
  3
   12
   13
   14
   15

where the numbers are page numbers *at that level*, starting from 0.

To find the physical block # corresponding to leaf page n, we need to
calculate

(number of preceding leaf pages) + (number of preceding upper level pages).
This turns out to be

y = n + (n / F + 1) + (n / F^2 + 1) + ... + 1

where F is the fanout (4 in the above example).

>From that, you can figure out the formulas for finding a given child page of
an upper-level page, or the parent of a page (XXX: fill in details)

To keep things simple, the tree is always constant height. To cover the
maximum relation size of 2^32 blocks, three levels is enough with the default
BLCKSZ (4000^3 > 2^32).

Addressing
----------

The higher-level routines operate on "logical" addresses, consisting of
- level,
- logical page number, and
- slot (if applicable)

Bottom level FSM pages have level of 0, the level above that 1, and root 2.
As in the diagram above, logical page number is the page number at that level,
starting from 0.

Locking
-------

When traversing down to search for free space, only one page is locked at a
time: the parent page is released before locking the child. If the child page
is concurrently modified, and there no longer is free space on the child page
when you land on it, you need to start from scratch (after correcting the
parent page, so that you don't get into an infinite loop).

When bubbling up a change, the parent page is locked before the lock on the
child is released. This ensures that the value at the root node of the child
page is always in sync with the corresponding leaf node on the parent.

We use shared buffer locks when searching, but exclusive buffer lock when
updating a page.  However, the "next" search pointer is updated during
searches even though we have only a shared lock.  "next" is just a hint and
we can easily reset it if it gets corrupted; so it seems better to accept
some risk of that type than to pay the overhead of exclusive locking.


TODO
----

- fastroot to avoid traversing upper nodes with just 1 child
- use a different system for tables that fit into one FSM page, with a
  mechanism to switch to the real thing as it grows.


pgsql-hackers by date:

Previous
From: Steve Crawford
Date:
Subject: Re: Do we really need a 7.4.22 release now?
Next
From: "David E. Wheeler"
Date:
Subject: Re: Where to Host Project