Rewriting Free Space Map - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Rewriting Free Space Map
Date
Msg-id 47DD97B5.5020309@enterprisedb.com
Whole thread Raw
Responses Re: Rewriting Free Space Map
List pgsql-hackers
I've started working on revamping Free Space Map, using the approach 
where we store a map of heap pages on every nth heap page. What we need 
now is discussion on the details of how exactly it should work.

Here's my rough plan on the implementation. It's long, sorry. I'm fairly 
confident with it, but please let me know if you see any problems or 
have any suggestions or better ideas.

Heap FSM
--------

The FSM is stored in the special area of every nth heap page. When 
extending the relation, the heapam checks if the block number of the new 
page is one that belongs to the FSM. If it is, it let's the FSM to 
initialize it by calling InitFSMPage() on it, and extends the relation 
again to get another, normal heap page.

I chose the "every nth page is an FSM page" approach, rather than using 
a separate relfilenode, which I also considered. The separate 
relfilenode approach has some advantages, like being able to scan all 
FSM pages in a sequential fashion, but it involves a fair amount of 
catalog and buffer manager changes.

It's convenient that the FSM uses up the whole page, leaving no room for 
heap tuples. It simplifies the locking, as we don't need to worry with 
the possibility in the FSM that the caller is already holding a lock on 
the same page.

In an FSM page, there's one byte for each of the next N heap pages, 
starting from the FSM page. That one byte stores the amount of free 
space on the corresponding heap page, in BLCKSZ/256 byte precision (32 
bytes with default block size).

The mapping of free space to these 256 "buckets" wouldn't necessarily 
have to be a linear one, we could for example have a single bucket for 
pages with more than BLCKSZ/2 bytes of free space and divide the rest 
linearly into 16 byte buckets, but let's keep it simple for now. Of 
course, we could also just use 2 bytes per page, and store the page size 
exactly, but 32 byte precision seems like enough to me.


Index FSM
---------

Indexes use a similar scheme, but since we only need to keep track 
whether a page is used or not, we only need one bit per page. If the 
amount of free space on pages is interesting for an indexam in the 
future, it can use the heap FSM implementation instead. Or no FSM at 
all, like the hash indexam.

To use the index FSM, the indexam needs to leave every nth page alone, 
like in the heap. The B-tree assumes that the metapage is at block 0, 
but that's also where we would like to store the first index FSM page. 
To overcome that, we can make the index FSM special area a little bit 
smaller, so that the B-tree metadata fits on the same page as the FSM 
information. That will be wasted space on other FSM pages than block 0, 
but we're only talking about 24 bytes per FSM page, and we only need one 
FSM page per ~512 MB of index pages (with default BLCKSZ).


Performance
-----------

The patch I'm working on currently uses a naive way to find a page in 
the FSM. To find a page with X bytes of free space, it scans the FSM 
pages until one is found. And it always starts the scan from the 
beginning, which is far from optimal. And when there's page with enough 
free space, it still needs to scan all FSM pages just to find out that 
we need to extend the relation.

To speed things up, we're going to need some mechanism to avoid that. 
First of all, we need to somehow cache the information that "there's no 
page with >= X bytes left", to avoid fruitless scanning. To speed up the 
case when there's only a few pages with enough free space, we can keep a 
limited size list of such pages in addition to the map.

These information needs to be in shared memory, either on heap pages 
like the FSM pages and managed by the buffer manager, or in a separate 
shmem block. I would like to go with normal bufmgr managed pages, as 
fixed-sized memory blocks have their problems, and the cached 
information should be stored to disk as well.

Let's have one special page in the heap file, called the Free Space List 
(FSL) page, in addition to the normal FSM pages. It has the following 
structure:

struct {  bit anypages[256]
 struct {    BlockNumber blockno;    uint8 freespace;  } freespacelist[as large as fits on page]
}

Remember that we track the free space on each page using one byte, IOW, 
each page falls into one of 256 buckets of free space. In the anypages 
bitmap, we have one bit per bucket indicating "is there any pages with 
this much free space". When we look for a page with X bytes, we check 
the bits up to the bucket corresponding X bytes, and if there's no set 
bits we know not to bother scanning the FSM pages.

To speed up the scan where there is space, we keep a simple list of 
pages with free space. This list is actually like the current FSM, but 
here we only use it as a small cache of the FSM pages. VACUUM and any 
other operations that update the FSM can put pages to the list when 
there's free slots.

We can store the FSL page on a magic location, say block #100. For 
relations smaller than that, there's no need for the FSL and we might as 
well scan the FSM page. I'm not sure if we should have more than one FSL 
page for large tables.

I'm not sure yet how the locking of FSL and FSM pages should work. It 
shouldn't be too hard, though, as the FSM/FSL information are just hints 
anyway. We do need to take care that we don't permanently lose track of 
any significant amount of free space, as after we can do partial vacuums 
using the visibility map, VACUUM might not visit one part of a table 
even if other parts it are frequently updated.


Page allocation algorithm
-------------------------

There's many different ways we can hand out pages from the FSM. Possible 
strategies, some of which are at odds with each other, include:

1. Try to spread out inserts of different backends, to avoid contention
2. Prefer low-numbered pages, to increase the chance of being able to 
truncate in VACUUM.
3. Reserve pages with lots of free space for big allocations, prefer 
almost full pages for small allocations. To use all space more efficiently.
4. If the table is clustered, try to use pages close to those with 
similar values.
5. On UPDATE, try to use pages close to the old tuple.
6. Prefer pages that are currently in cache, to avoid I/O.

The current FSM tries to achieve only 1, and there haven't been many 
complaints, so I wouldn't worry too much about the other goals. 4 and 6 
would need some major changes to the buffer manager and indexam 
interfaces, respectively, and 3 could lead to more I/O when you do a lot 
of small inserts.

We can spread inserts of different backends in the Free Space List by 
moving a page to the end of the list when it's handed out, or removing 
it from there altogether. When the FSL is empty, we can vary the place 
where we start to scan the FSM.

To prefer low-number pages, to increase the chance of being able to 
truncate the relation later, we can favor lower numbered blocks in 
VACUUM when we decide which blocks to put into the FSL. And we can also 
bias the starting point of where we start to scan the FSM when the FSL 
is empty.


Visibility Map
--------------

This far I've only talked about the FSM, but it's important to consider 
how the Visibility Map fits into the scheme. My current thinking is that 
there will be one bit per heap page in the visibility map. The exact 
semantics of that one bit is still not totally clear, but that's not 
important right now.

There's a few alternatives:
1. Mix the visibility map with the FSM, stealing one bit of every FSM 
byte. There would then be 7 bits for storing how much space there is on 
each page, and 1 bit indicating the visibility.

2. Allocate part of the FSM pages for visibility map. For example, First 
1/9 of the page for VM, and 8/9 for the FSM. The VM part would be a 
straight bitmap with one bit per page, and the FSM part would use one 
byte per page.

3. Use different pages for the VM, for example every nth page for VM, 
and every (n+1)th page for FSM.

I'm leaning towards 2 at the moment. 3 is intriguing as well, though,
because it would help with potential lock contention on the VM/FSM pages.


Per-chunk relfrozenxid
----------------------
I'm imagining that we would set a bit in VM, if a page has no dead 
tuples at all, making it possible to use it for index-only scans. Such a 
page is also uninteresting for VACUUM. However, we would still need to 
scan the whole table to freeze tuples.

To alleviate that, we could allocate some space in the FSM pages, 
indicating a smallest Xid in a range of heap pages. IOW, like 
relfrozenxid, but with a granularity of N pages, rather than whole 
relation. You would still have to scan that range of pages to update it, 
but that's much better than the whole relation.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Hash index build patch has *worse* performance at small table sizes
Next
From: Cristian Gafton
Date:
Subject: Re: Single table forcing sequential scans on query plans