Free Space Map rewrite - Mailing list pgsql-patches
From | Heikki Linnakangas |
---|---|
Subject | Free Space Map rewrite |
Date | |
Msg-id | 4836F44B.9070207@enterprisedb.com Whole thread Raw |
Responses |
Re: Free Space Map rewrite
|
List | pgsql-patches |
Hi, Here's a new snapshot of the FSM rewrite I've been working on. The "map fork" stuff hasn't been changed since last patch (I have some work to do there based on Tom's recent comments), but the FSM implementation itself is now starting to get in shape. So the thing to look at in this patch is freespace.c. It's unreadable in diff format because the whole file has basically been rewritten, you'll have to apply the patch. I've also attached a README, which is also part of the patch. Still a lot of work to be done, like ironing out race conditions between updates and searches, the approach I'm planning to take there is explained in the README, and WAL-logging, but I'm fairly happy with what's there now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com FSM page structure ------------------ Within an FSM page, we use a binary tree structure where leaf nodes store the amount of free space on heap pages (or lower level FSM pages, see Higher-level Structure below), with one leaf node per heap page. Intermediate nodes store the Max amount of free space on any of its children. For example: 4 4 2 3 4 0 2 <- This level represents heap pages There's two basic operations: search and update. To search for a page with X amount of free space, traverse down the tree along a path where n >= X, until you hit the bottom. If both children of a node satisfy the condition, you can pick either one arbitrarily. To update the amount of free space on a page to X, first update the leaf node corresponding the heap page, and "bubble up" the change to upper nodes, until you hit a node where n is already >= X. This data structure has a couple of nice properties: - to determine that there is no page with X bytes of free space, you only need to look at the root node - by varying which child to traverse to in the search algorithm, when you have a choice, we can implement various strategies, like preferring pages closer to a given page, or spreading the load across the table. Higher-level structure ---------------------- To scale up the data structure described above beyond a single page, we maintain a similar tree-structure across pages. Leaf nodes in higher level pages correspond to lower level FSM pages. The root node within each page has the same value as the corresponding leaf node on the parent page. Root page is always stored at physical block 0. For example, assuming each FSM page can hold information about 4 pages (in reality, that's (BLCKSZ - headers) / 2, or ~4000 with default BLCKSZ), we get a disk layout like this: 0 <-- page 0 at level 2 (root page) 0 <-- page 0 at level 1 0 <-- page 0 at level 0 1 <-- page 1 at level 0 2 <-- ... 3 1 <-- page 1 at level 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15 where the numbers are page numbers *at that level*, starting from 0. To find the physical block # corresponding leaf page n, we need to calculate (number of preceding leaf pages) + (number of preceding upper level pages). This turns out to be y = n + (n / F + 1) + (n / F^2 + 1) + ... + 1 where F is the fanout (4 in the above example). From that, you can figure out the formulas for finding a given child page of upper-level page, or the parent of a page (XXX: explain) To keep things simple, the tree is always constant height. To cover the max. relation size of 2^31 blocks, three levels is enough with the default BLCKSZ (4000^3 >= 2^31). Locking ------- When traversing down, lock only one page at a time. Release lock on parent page before locking child page. That means that you will need to start from scratch if the node is concurrently updated and the free space that was supposed to be on a page is no longer there. When bubbling up, lock and update parent page before releaseing lock on child. This ensures that the TODO ---- - fastroot to avoid traversing upper nodes with just 1 child - use a different system for tables that fit into one FSM page, with a mechanism to switch to the real thing as it grows.
Attachment
pgsql-patches by date: