Re: Free Space Map data structure - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Free Space Map data structure
Date
Msg-id 47FD142B.1010407@enterprisedb.com
Whole thread Raw
In response to Re: Free Space Map data structure  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> ... what I really wanted to discuss is the data structure needed 
>> for the Free Space Map.
> 
>> The FSM data structure needs to support two basic operations:
>> 1. Fast lookup of page with >= X bytes of free space
>> 2. Update of arbitrary, individual pages.
>> Our current code doesn't support 2, as we always update the FSM in bulk 
>> after vacuum, but we will need that capability to be able to do partial 
>> vacuums in the future.
> 
> Note that the lack of such infrastructure is mainly because we didn't
> need it, not because it couldn't be done.  

Yep. I actually remember seeing a function in freespace.c in CVS history 
to do partial updates, but it was removed at some point because it 
wasn't needed.

>> One brilliant idea I had, is a binary heap/tree kind of structure, where 
>> each heap page is represented by one leaf node.
> 
> I'm worried about a couple of things here:
> 
> * Loss of concurrency in access to the FSM itself, due to updates having
> to change large amounts of interrelated entries, and due to all
> processes always wanting to touch the root first.

When searching for free space, we start from the top, and go down from 
there. We only need to take a shared lock on the (page that contains) 
the upper nodes, so that shouldn't become a bottleneck.

When we update a node, we can stop propagating the change upwards as 
soon as we hit a node where the other child has less space than the 
current node. For example:
         5    5         3 4    5     3
4 3  2 5   2 3

To update the first leaf node from 4 to 2, we need to update it's parent 
to 3. But we don't need to update it's parent (5), and we can stop at 
that point without accessing the root.

We probably want to mark new pages as full, as Hannu suggested, to avoid 
repeatedly updating the FSM during bulk loads.

Given that the current FSM is protected by a single lock, shared by all 
relations, and always taken in exclusive mode, I'm not too worried. I'm 
not sure how long an FSM page needs to be locked in my scheme compared 
to the current FSM, but at least there will be a separate FSM for each 
relation. I do want to beat the current FSM in terms of scalability, 
though, I think I've seen FreeSpaceLock become contented on some tests.

As a little optimization, we could optimistically start from somewhere 
else than the top node when searching for free space. Perhaps from the 
top of the FSM page that contained our previous rd_targetblock. I don't 
think we need to, though.

> * Loss of concurrency in subsequent heap updates.  The current FSM gets
> one thing right: if different backends ask for space, they will (if at
> all possible) be given different heap pages to insert into, and
> therefore will not suffer page-level contention while they do their
> insertions.  I don't see what you'll do to ensure a comparable behavior
> with this structure.

Whenever we have a choice to traverse to either left or right child 
because there's heap pages with enough free space on both sides, we can 
randomize that decision to achieve the spreading.

This actually brings up another nice property this data structure has: 
instead of randomizing, we can choose the child node that's "closer" to 
some specific heap page. That can be used to keep an updated tuple 
version close to the old version, or to maintain cluster order. That's a 
goal that's at least somewhat at odds with the goal of spreading the 
inserts, of course, but the data structure has the flexibility, which is 
nice.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: "Claudio Rossi"
Date:
Subject: Re: Segfault using heap_form_tuple
Next
From: Decibel!
Date:
Subject: Re: Concurrent psql API