Thread: FSM rewrite: doc changes
To keep everyone who's interested up-to-date, attached is the latest patch. The big addition is documentation. I added a section in the "Physical Database Layout" chapter about the new FSM, and removed references to the max_fsm_relations/pages options, plus some other small changes. I also moved some the one debugging function to peek into an FSM page at low level to contrib/pageinspect, and updated the documentation of pg_freespacemap and pageinspect to match the code. I find it a bit disturbing that a documentation patch actually removes more lines from the manual than adds, but it's quite understandable because it's no longer necessary to explain the two GUC options that used to be quite important :-). Comments welcome. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > To keep everyone who's interested up-to-date, attached is the latest > patch. ... > I find it a bit disturbing that a documentation patch actually removes > more lines from the manual than adds, but it's quite understandable > because it's no longer necessary to explain the two GUC options that > used to be quite important :-). Comments welcome. Well, this patch isn't actually supposed to have user-visible impact other than eliminating a couple of troublesome configuration settings. So it's entirely expected for the docs to get shorter ;-) I did another pass of code-reading, and found a lot of nitpicks and some not-so-trivial issues. In no particular order: Copyright in indexfsm.c is a year off. InitIndexFreeSpaceMap should have a comment The comment for RecordFreeIndexPage gives the function's name incorrectly. InitFreeSpaceMap() should be explicitly declared as taking void in its definition. FreeSpaceMapTruncateRel seems to have a bug in its early-exit test: in the case where the number of FSM blocks stays the same, it fails to zero out slots in the last block. I also think it's got an off-by-one problem in figuring the number of FSM blocks: for the normal case where the new heap end is in the middle of a FSM block, shouldn't new_nfsmblocks be one larger than it is? The case where nblocks is an exact multiple of SlotsPerFSMPage would need to be special-cased to be exactly correct, though I see no real harm in letting the FSM be left one page too big in that case. The patch shouldn't be touching bufmgr.c at all any more --- or at least, none of the diffs there are improvements. Docs for contrib/pageinspect still need work: the 3-parameter form of get_raw_page isn't documented, nor the fork behavior of the 2-parameter form. In gistvacuum.c, you've removed the code that adjusts totFreePages to not count pages truncated away. I think you could just subtract the number of truncated pages from it, since they must have been counted in it earlier. (ginvacuum.c seems to get this right.) I do not like the kluge in heap_xlog_clean one bit, and think it's unnecessary anyway since we are not relying on the FSM to be accurate. Suggest reverting the heapam.c changes except for heap_sync(). rd_fsm_nblocks_cache should be reset in the places where rd_targblock is. You seem to have tracked the clearings of rd_smgr which is not the right thing at all. I see you renamed "next", which is good, but the README isn't up to speed on it and a lot of the comments aren't either. Since fp_next_slot is signed, the sanity check in fsm_search_avail had better include "target < 0". The new search algorithm in fsm_search_avail still doesn't work. Consider what happens when the target is the rightmost slot on the page; it certainly won't wrap properly. fsm_truncate_avail seems quite broken: it's clearing the whole page always. In fsm_rebuild_page, surely we needn't check "if (lchild < NodesPerPage)". Also you probably ought to make it if (fsmpage->fp_nodes[nodeno] != newvalue) { fsmpage->fp_nodes[nodeno] =newvalue; changed = true; } to avoid useless write traffic into a shared buffer. I think DEPTH should be a macro not a static int; it's certainly reducible to a compile-time constant. Also I wonder whether you really need the SlotsPerFSMPagePowers[] array at all (and if not, you could get rid of InitFreeSpaceMap). It's used in only one place and it seems a bit hard to argue that a multiplication loop really needs to be avoided there --- the division loop that comes after it will cost a lot more, and in any case both are negligible compared to the shared buffer fetch that's about to occur. This test in fsm_space_needed_to_cat:if (needed >= (FSM_CATEGORIES - 1) * FSM_CAT_STEP) elog(ERROR, "invalid FSM requestsize"); reveals a rather fundamental problem: it is clearly possible for this test to fail on valid request sizes, because the page header overhead is less than FSM_CAT_STEP (especially if BLCKSZ is more than 8K). I'm not sure about a really clean solution here. We could offset the needed_to_cat and avail_to_cat calculations so that category 255 corresponds exactly to the maximum possible free space, but that requires assuming that FSM knows exactly what that is, which is a bit unpleasant. Thoughts? It seems a bit schizophrenic that fsm_search_avail takes a Buffer when all the other functions in fsmpage.c take Page arguments. I see why fsm_search_avail needs to do that, but maybe it'd be better if the other functions did too? fsm_search() should not take addr as an argument, since it has a built-in assumption that it is started at the root. I find the use of eof as both a local variable and a parameter in fsm_vacuum_page to be pretty poor programming practice. Maybe call the parameter eof_p? Shouldn't fsm_redo include a FreeFakeRelcacheEntry call? regards, tom lane
Tom Lane wrote: > FreeSpaceMapTruncateRel seems to have a bug in its early-exit test: in the > case where the number of FSM blocks stays the same, it fails to zero out slots > in the last block. I also think it's got an off-by-one problem in figuring > the number of FSM blocks: for the normal case where the new heap end is in > the middle of a FSM block, shouldn't new_nfsmblocks be one larger than it > is? The case where nblocks is an exact multiple of SlotsPerFSMPage would > need to be special-cased to be exactly correct, though I see no real harm in > letting the FSM be left one page too big in that case. > > fsm_truncate_avail seems quite broken: it's clearing the whole page always. Yep, I noticed these myself on Friday after sending the patch.. > I do not like the kluge in heap_xlog_clean one bit, and think it's unnecessary > anyway since we are not relying on the FSM to be accurate. Suggest reverting > the heapam.c changes except for heap_sync(). The point is to have a more up-to-date FSM after recovery. PITR recovery in a warm stand-by server in particular. I'll take it out for now, but it needs more discussion. In fact, I think we should update the FSM even more aggressively, on inserts and updates as well vacuums. Probably not on all inserts and updates, though, to keep the overhead minimal, but there's a tradeoff somewhere in between. > The new search algorithm in fsm_search_avail still doesn't work. Consider > what happens when the target is the rightmost slot on the page; it certainly > won't wrap properly. Crap, you're right. > In fsm_rebuild_page, surely we needn't check "if (lchild < NodesPerPage)". Yes, we do. There can be is a number of completely unused upper nodes on the right. The upper levels of the tree are complete, but the bottom level is missing enough nodes to make room for the page header. So the tree looks something like this: X X X X X X X X X X X X . . . Where . is a missing node. The parents that miss both children will always be zero. > This test in fsm_space_needed_to_cat: > if (needed >= (FSM_CATEGORIES - 1) * FSM_CAT_STEP) > elog(ERROR, "invalid FSM request size"); > reveals a rather fundamental problem: it is clearly possible > for this test to fail on valid request sizes, because the page > header overhead is less than FSM_CAT_STEP (especially if BLCKSZ > is more than 8K). I'm not sure about a really clean solution > here. We could offset the needed_to_cat and avail_to_cat > calculations so that category 255 corresponds exactly to the > maximum possible free space, but that requires assuming that FSM > knows exactly what that is, which is a bit unpleasant. Thoughts? Hmph. The other alternative is to use 2 bytes instead of one per page, and track the free space exactly. But I'd rather not do that just to deal with the very special case of huge requests. Or we could just return -1 instead of throwing an error. Requests higher than the limit would then always have to extend the heap. That's not good, but I think we already have that problem for tuples of exactly MaxHeapTupleSize bytes. Since PageGetFreeSpace subtracts the size of a new line pointer, only newly extended pages that have never had any tuples on them have enough space, as determined by PagetGetFreeSpace, to fit a tuple of MaxHeapTupleSize bytes. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> In fsm_rebuild_page, surely we needn't check "if (lchild < NodesPerPage)". > Yes, we do. But the loop starting point is such that you must be visiting a parent with at least one child, no? >> reveals a rather fundamental problem: it is clearly possible >> for this test to fail on valid request sizes, because the page >> header overhead is less than FSM_CAT_STEP (especially if BLCKSZ >> is more than 8K). I'm not sure about a really clean solution >> here. > Hmph. The other alternative is to use 2 bytes instead of one per page, > and track the free space exactly. But I'd rather not do that just to > deal with the very special case of huge requests. Yeah, I thought about that too. It's got another problem besides the sheer space cost: it would result in a whole lot more update traffic for upper levels of the tree. The quantization of possible values in the current design is good because it avoids updates of parents for relatively small deltas of free space. > Or we could just return -1 instead of throwing an error. Requests higher > than the limit would then always have to extend the heap. That's not > good, but I think we already have that problem for tuples of exactly > MaxHeapTupleSize bytes. Since PageGetFreeSpace subtracts the size of a > new line pointer, only newly extended pages that have never had any > tuples on them have enough space, as determined by PagetGetFreeSpace, to > fit a tuple of MaxHeapTupleSize bytes. That seems like something we'll want to fix sometime, rather than hardwiring into the FSM design. I suppose an alternative possibility is to set MaxHeapTupleSize at 255/256's of a block by definition, so that no request will ever exceed what the FSM stuff can handle. But I'm sure that'd make somebody unhappy --- somewhere out there is a table with tuples wider than that. Probably the least bad alternative here is to allow FSM's category scaling to depend on MaxHeapTupleSize. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Tom Lane wrote: >>> In fsm_rebuild_page, surely we needn't check "if (lchild < NodesPerPage)". > >> Yes, we do. > > But the loop starting point is such that you must be visiting a parent > with at least one child, no? Hmm, true, and that means that the comment above it is wrong. I'll change the code to do what the comment says. That way, if the page is completely garbled, and a parent node with no children has a value other than zero, that gets fixed too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com