Rewriting Free Space Map - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Rewriting Free Space Map |
Date | |
Msg-id | 47DD97B5.5020309@enterprisedb.com Whole thread Raw |
Responses |
Re: Rewriting Free Space Map
|
List | pgsql-hackers |
I've started working on revamping Free Space Map, using the approach where we store a map of heap pages on every nth heap page. What we need now is discussion on the details of how exactly it should work. Here's my rough plan on the implementation. It's long, sorry. I'm fairly confident with it, but please let me know if you see any problems or have any suggestions or better ideas. Heap FSM -------- The FSM is stored in the special area of every nth heap page. When extending the relation, the heapam checks if the block number of the new page is one that belongs to the FSM. If it is, it let's the FSM to initialize it by calling InitFSMPage() on it, and extends the relation again to get another, normal heap page. I chose the "every nth page is an FSM page" approach, rather than using a separate relfilenode, which I also considered. The separate relfilenode approach has some advantages, like being able to scan all FSM pages in a sequential fashion, but it involves a fair amount of catalog and buffer manager changes. It's convenient that the FSM uses up the whole page, leaving no room for heap tuples. It simplifies the locking, as we don't need to worry with the possibility in the FSM that the caller is already holding a lock on the same page. In an FSM page, there's one byte for each of the next N heap pages, starting from the FSM page. That one byte stores the amount of free space on the corresponding heap page, in BLCKSZ/256 byte precision (32 bytes with default block size). The mapping of free space to these 256 "buckets" wouldn't necessarily have to be a linear one, we could for example have a single bucket for pages with more than BLCKSZ/2 bytes of free space and divide the rest linearly into 16 byte buckets, but let's keep it simple for now. Of course, we could also just use 2 bytes per page, and store the page size exactly, but 32 byte precision seems like enough to me. Index FSM --------- Indexes use a similar scheme, but since we only need to keep track whether a page is used or not, we only need one bit per page. If the amount of free space on pages is interesting for an indexam in the future, it can use the heap FSM implementation instead. Or no FSM at all, like the hash indexam. To use the index FSM, the indexam needs to leave every nth page alone, like in the heap. The B-tree assumes that the metapage is at block 0, but that's also where we would like to store the first index FSM page. To overcome that, we can make the index FSM special area a little bit smaller, so that the B-tree metadata fits on the same page as the FSM information. That will be wasted space on other FSM pages than block 0, but we're only talking about 24 bytes per FSM page, and we only need one FSM page per ~512 MB of index pages (with default BLCKSZ). Performance ----------- The patch I'm working on currently uses a naive way to find a page in the FSM. To find a page with X bytes of free space, it scans the FSM pages until one is found. And it always starts the scan from the beginning, which is far from optimal. And when there's page with enough free space, it still needs to scan all FSM pages just to find out that we need to extend the relation. To speed things up, we're going to need some mechanism to avoid that. First of all, we need to somehow cache the information that "there's no page with >= X bytes left", to avoid fruitless scanning. To speed up the case when there's only a few pages with enough free space, we can keep a limited size list of such pages in addition to the map. These information needs to be in shared memory, either on heap pages like the FSM pages and managed by the buffer manager, or in a separate shmem block. I would like to go with normal bufmgr managed pages, as fixed-sized memory blocks have their problems, and the cached information should be stored to disk as well. Let's have one special page in the heap file, called the Free Space List (FSL) page, in addition to the normal FSM pages. It has the following structure: struct { bit anypages[256] struct { BlockNumber blockno; uint8 freespace; } freespacelist[as large as fits on page] } Remember that we track the free space on each page using one byte, IOW, each page falls into one of 256 buckets of free space. In the anypages bitmap, we have one bit per bucket indicating "is there any pages with this much free space". When we look for a page with X bytes, we check the bits up to the bucket corresponding X bytes, and if there's no set bits we know not to bother scanning the FSM pages. To speed up the scan where there is space, we keep a simple list of pages with free space. This list is actually like the current FSM, but here we only use it as a small cache of the FSM pages. VACUUM and any other operations that update the FSM can put pages to the list when there's free slots. We can store the FSL page on a magic location, say block #100. For relations smaller than that, there's no need for the FSL and we might as well scan the FSM page. I'm not sure if we should have more than one FSL page for large tables. I'm not sure yet how the locking of FSL and FSM pages should work. It shouldn't be too hard, though, as the FSM/FSL information are just hints anyway. We do need to take care that we don't permanently lose track of any significant amount of free space, as after we can do partial vacuums using the visibility map, VACUUM might not visit one part of a table even if other parts it are frequently updated. Page allocation algorithm ------------------------- There's many different ways we can hand out pages from the FSM. Possible strategies, some of which are at odds with each other, include: 1. Try to spread out inserts of different backends, to avoid contention 2. Prefer low-numbered pages, to increase the chance of being able to truncate in VACUUM. 3. Reserve pages with lots of free space for big allocations, prefer almost full pages for small allocations. To use all space more efficiently. 4. If the table is clustered, try to use pages close to those with similar values. 5. On UPDATE, try to use pages close to the old tuple. 6. Prefer pages that are currently in cache, to avoid I/O. The current FSM tries to achieve only 1, and there haven't been many complaints, so I wouldn't worry too much about the other goals. 4 and 6 would need some major changes to the buffer manager and indexam interfaces, respectively, and 3 could lead to more I/O when you do a lot of small inserts. We can spread inserts of different backends in the Free Space List by moving a page to the end of the list when it's handed out, or removing it from there altogether. When the FSL is empty, we can vary the place where we start to scan the FSM. To prefer low-number pages, to increase the chance of being able to truncate the relation later, we can favor lower numbered blocks in VACUUM when we decide which blocks to put into the FSL. And we can also bias the starting point of where we start to scan the FSM when the FSL is empty. Visibility Map -------------- This far I've only talked about the FSM, but it's important to consider how the Visibility Map fits into the scheme. My current thinking is that there will be one bit per heap page in the visibility map. The exact semantics of that one bit is still not totally clear, but that's not important right now. There's a few alternatives: 1. Mix the visibility map with the FSM, stealing one bit of every FSM byte. There would then be 7 bits for storing how much space there is on each page, and 1 bit indicating the visibility. 2. Allocate part of the FSM pages for visibility map. For example, First 1/9 of the page for VM, and 8/9 for the FSM. The VM part would be a straight bitmap with one bit per page, and the FSM part would use one byte per page. 3. Use different pages for the VM, for example every nth page for VM, and every (n+1)th page for FSM. I'm leaning towards 2 at the moment. 3 is intriguing as well, though, because it would help with potential lock contention on the VM/FSM pages. Per-chunk relfrozenxid ---------------------- I'm imagining that we would set a bit in VM, if a page has no dead tuples at all, making it possible to use it for index-only scans. Such a page is also uninteresting for VACUUM. However, we would still need to scan the whole table to freeze tuples. To alleviate that, we could allocate some space in the FSM pages, indicating a smallest Xid in a range of heap pages. IOW, like relfrozenxid, but with a granularity of N pages, rather than whole relation. You would still have to scan that range of pages to update it, but that's much better than the whole relation. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
pgsql-hackers by date: