Re: Free Space Map data structure - Mailing list pgsql-hackers

From Hannu Krosing
Subject Re: Free Space Map data structure
Date
Msg-id 1207646767.8153.31.camel@huvostro
Whole thread Raw
In response to Free Space Map data structure  (Heikki Linnakangas <heikki@enterprisedb.com>)
Responses Re: Free Space Map data structure
List pgsql-hackers
On Tue, 2008-04-08 at 09:33 +0100, Heikki Linnakangas wrote:
> The last thread about Free Space Map evolved into discussion about 
> whether the FSM and other kinds of auxiliary data should be stored 
> within the heap pages, in "map forks", or auxiliary relfilenodes 
> attached to the relation. It seems the consensus was to go with the map 
> forks, but what I really wanted to discuss is the data structure needed 
> for the Free Space Map.
> 
> The FSM data structure needs to support two basic operations:
> 
> 1. Fast lookup of page with >= X bytes of free space
> 2. Update of arbitrary, individual pages.
> 
> Our current code doesn't support 2, as we always update the FSM in bulk 
> after vacuum, but we will need that capability to be able to do partial 
> vacuums in the future.
> 
> Additionally, the data structure needs to be "pageable", so that we can 
> efficiently store it in pages that can be loaded in the buffer cache on 
> demand, and not require a fixed size shared memory block.
> 
> The simplest conceivable data structure is a straight array, where nth 
> element is the free space on page n. That's easily pageable, and 
> provides O(1) lookup/update of arbitrary page. Unfortunately it's slow 
> to search for a page with X bytes of free space in that.



> One brilliant idea I had, is a binary heap/tree kind of structure, where 
> each heap page is represented by one leaf node. Each leaf node stores 
> the amount of free space on the corresponding heap apge. Each non-leaf 
> node stores the max. amount of free space in any of its children. So the 
> top-most node immediately tells the max. amount of free space on *any* 
> page, which means that to find out that there's no suitable page is a 
> O(1) operation, which is good when you're inserting to a full relation. 
> When you're looking for X bytes, you traverse the tree down a path with 
> nodes > X.
> 
> For example:
> 
>      9
>   4     9
> 2 4   0 9
> 
> The leaf nodes correspond the heap pages, so page #0 has 2 units of free 
> space, page #1 has 4, page #1 is full and page has 9.
> 
> Let's work through a couple of examples:
> 
> To search for a page with 3 bytes of free space, start from the top. We 
> see that the topmost node has value 9, so we know there is a page 
> somewhere with enough space. Traverse down the tree, to the node where X 
>  >= 3. In this case, that's the left child, but if it was true for both, 
> we could pick either one. Traverse down from that node similarly, until 
> we hit the bottom.
> 
> To update the free space on page #1 to 3, you look up the leaf node 
> corresponding that page, which is easy if we store the tree in an array. 
> We update the 4 on that node to 3, and walk up the tree updating the 
> parents. In this case, we update the parent of that node to 3, and stop, 
> because the value in top node is higher than that. The lookup/update is 
> therefore O(log N) operation.
> 
> 
> Unfortunately, this data structure isn't easily pageable. It still seems 
> like a good and simple structure within a page, but we need a way to 
> scale it.

Also, we should aim at not needing write anything for "initially full"
pages, so that there will be no lock contention for parallel bulkloads.

That should be easily achieved, if we initialize all pages to "full" and
only do the tree update only if there is some room (above a certain
threshold) left after doing an insert.

In other words, pages filled by bulkload should not go into FSM.

> If we use one byte to store the amount of free space on a page (I think 
> that's enough granularity), you can fit a tree like that with about 8000 
> nodes, with 4000 leaf nodes, in one 8k block. That means that we can 
> address 4000 heap pages, or 32MB of heap, with one FSM page like that.
> 
> To go above that, we can introduce upper pages, where the leaf nodes 
> refer not to heap pages but to other FSM pages. The addressing gets 
> tricky, though. Because the number of upper pages required depends on 
> the depth of the tree, we can't just divide heap page number by 4000 to 
> get to the right FSM page. I think that's solvable, by always storing 
> the upper pages at predetermined places in the FSM file, but you'll need 
> a bit of logic to get from heap page # to FSM page #.

If we could rely on sparse files, then we could just extend the
tree-in-array idea to pages. Two levels of of pages (4000) can cover
256Gb of data ((8k*8k)/2) leaves * 8kb page size

The pages actually containing any meaningful data can be computed, the
rest should not be touched. 

Same is true for in-page updates - if we have only 4 pages in heap, then
there is no need to update the whole first branch (12 levels) just
update the page and its 2 parents, and when searching, start from parent
when adding 5th page, then extend the tree upward and start future
searches from there 3rd level grandparent

Probably we could do without sparse files, if we find an efficient way
to compute the "add order" of leaf and parent pages for above algorithm.

> Thoughts? Any other data structures that better fill the bill?



> -- 
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com
> 



pgsql-hackers by date:

Previous
From: Zoltan Boszormenyi
Date:
Subject: Re: TRUNCATE TABLE with IDENTITY
Next
From: "Pavan Deolasee"
Date:
Subject: Re: Free Space Map data structure