Thread: FSM rewrite: doc changes

FSM rewrite: doc changes

From
Heikki Linnakangas
Date:
To keep everyone who's interested up-to-date, attached is the latest
patch. The big addition is documentation. I added a section in the
"Physical Database Layout" chapter about the new FSM, and removed
references to the max_fsm_relations/pages options, plus some other small
changes.

I also moved some the one debugging function to peek into an FSM page at
low level to contrib/pageinspect, and updated the documentation of
pg_freespacemap and pageinspect to match the code.

I find it a bit disturbing that a documentation patch actually removes
more lines from the manual than adds, but it's quite understandable
because it's no longer necessary to explain the two GUC options that
used to be quite important :-). Comments welcome.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: FSM rewrite: doc changes

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> To keep everyone who's interested up-to-date, attached is the latest 
> patch. ...
> I find it a bit disturbing that a documentation patch actually removes 
> more lines from the manual than adds, but it's quite understandable 
> because it's no longer necessary to explain the two GUC options that 
> used to be quite important :-). Comments welcome.

Well, this patch isn't actually supposed to have user-visible impact
other than eliminating a couple of troublesome configuration settings.
So it's entirely expected for the docs to get shorter ;-)

I did another pass of code-reading, and found a lot of nitpicks and
some not-so-trivial issues.  In no particular order:


Copyright in indexfsm.c is a year off.

InitIndexFreeSpaceMap should have a comment

The comment for RecordFreeIndexPage gives the function's name incorrectly.

InitFreeSpaceMap() should be explicitly declared as taking void in its
definition.

FreeSpaceMapTruncateRel seems to have a bug in its early-exit test: in the
case where the number of FSM blocks stays the same, it fails to zero out slots
in the last block.  I also think it's got an off-by-one problem in figuring
the number of FSM blocks: for the normal case where the new heap end is in
the middle of a FSM block, shouldn't new_nfsmblocks be one larger than it
is?  The case where nblocks is an exact multiple of SlotsPerFSMPage would
need to be special-cased to be exactly correct, though I see no real harm in
letting the FSM be left one page too big in that case.

The patch shouldn't be touching bufmgr.c at all any more --- or at least, none
of the diffs there are improvements.

Docs for contrib/pageinspect still need work: the 3-parameter form of
get_raw_page isn't documented, nor the fork behavior of the 2-parameter form.

In gistvacuum.c, you've removed the code that adjusts totFreePages to not
count pages truncated away.  I think you could just subtract the number of
truncated pages from it, since they must have been counted in it earlier.
(ginvacuum.c seems to get this right.)

I do not like the kluge in heap_xlog_clean one bit, and think it's unnecessary
anyway since we are not relying on the FSM to be accurate.  Suggest reverting
the heapam.c changes except for heap_sync().

rd_fsm_nblocks_cache should be reset in the places where rd_targblock is.
You seem to have tracked the clearings of rd_smgr which is not the right
thing at all.

I see you renamed "next", which is good, but the README isn't up to speed on
it and a lot of the comments aren't either.

Since fp_next_slot is signed, the sanity check in fsm_search_avail had better
include "target < 0".

The new search algorithm in fsm_search_avail still doesn't work.  Consider
what happens when the target is the rightmost slot on the page; it certainly
won't wrap properly.

fsm_truncate_avail seems quite broken: it's clearing the whole page always.

In fsm_rebuild_page, surely we needn't check "if (lchild < NodesPerPage)".
Also you probably ought to make it    if (fsmpage->fp_nodes[nodeno] != newvalue)    {        fsmpage->fp_nodes[nodeno]
=newvalue;        changed = true;    }
 
to avoid useless write traffic into a shared buffer.

I think DEPTH should be a macro not a static int; it's certainly
reducible to a compile-time constant.  Also I wonder whether you
really need the SlotsPerFSMPagePowers[] array at all (and if not,
you could get rid of InitFreeSpaceMap).  It's used in only one
place and it seems a bit hard to argue that a multiplication loop
really needs to be avoided there --- the division loop that comes
after it will cost a lot more, and in any case both are negligible
compared to the shared buffer fetch that's about to occur.

This test in fsm_space_needed_to_cat:if (needed >= (FSM_CATEGORIES - 1) * FSM_CAT_STEP)    elog(ERROR, "invalid FSM
requestsize");
 
reveals a rather fundamental problem: it is clearly possible
for this test to fail on valid request sizes, because the page
header overhead is less than FSM_CAT_STEP (especially if BLCKSZ
is more than 8K).  I'm not sure about a really clean solution
here.  We could offset the needed_to_cat and avail_to_cat
calculations so that category 255 corresponds exactly to the
maximum possible free space, but that requires assuming that FSM
knows exactly what that is, which is a bit unpleasant.  Thoughts?

It seems a bit schizophrenic that fsm_search_avail takes a Buffer
when all the other functions in fsmpage.c take Page arguments.
I see why fsm_search_avail needs to do that, but maybe it'd be
better if the other functions did too?

fsm_search() should not take addr as an argument, since it has a
built-in assumption that it is started at the root.

I find the use of eof as both a local variable and a parameter in
fsm_vacuum_page to be pretty poor programming practice.  Maybe call
the parameter eof_p?

Shouldn't fsm_redo include a FreeFakeRelcacheEntry call?
        regards, tom lane


Re: FSM rewrite: doc changes

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> FreeSpaceMapTruncateRel seems to have a bug in its early-exit test: in the
> case where the number of FSM blocks stays the same, it fails to zero out slots
> in the last block.  I also think it's got an off-by-one problem in figuring
> the number of FSM blocks: for the normal case where the new heap end is in
> the middle of a FSM block, shouldn't new_nfsmblocks be one larger than it
> is?  The case where nblocks is an exact multiple of SlotsPerFSMPage would
> need to be special-cased to be exactly correct, though I see no real harm in
> letting the FSM be left one page too big in that case.
> 
> fsm_truncate_avail seems quite broken: it's clearing the whole page always.

Yep, I noticed these myself on Friday after sending the patch..

> I do not like the kluge in heap_xlog_clean one bit, and think it's unnecessary
> anyway since we are not relying on the FSM to be accurate.  Suggest reverting
> the heapam.c changes except for heap_sync().

The point is to have a more up-to-date FSM after recovery. PITR recovery 
in a warm stand-by server in particular.

I'll take it out for now, but it needs more discussion. In fact, I think 
we should update the FSM even more aggressively, on inserts and updates 
as well vacuums. Probably not on all inserts and updates, though, to 
keep the overhead minimal, but there's a tradeoff somewhere in between.

> The new search algorithm in fsm_search_avail still doesn't work.  Consider
> what happens when the target is the rightmost slot on the page; it certainly
> won't wrap properly.

Crap, you're right.

> In fsm_rebuild_page, surely we needn't check "if (lchild < NodesPerPage)".

Yes, we do. There can be is a number of completely unused upper nodes on 
the right. The upper levels of the tree are complete, but the bottom 
level is missing enough nodes to make room for the page header. So the 
tree looks something like this:
         X    X         X X    X     X   X
X X  X X   X . . .

Where . is a missing node. The parents that miss both children will 
always be zero.

> This test in fsm_space_needed_to_cat:
>     if (needed >= (FSM_CATEGORIES - 1) * FSM_CAT_STEP)
>         elog(ERROR, "invalid FSM request size");
> reveals a rather fundamental problem: it is clearly possible
> for this test to fail on valid request sizes, because the page
> header overhead is less than FSM_CAT_STEP (especially if BLCKSZ
> is more than 8K).  I'm not sure about a really clean solution
> here.  We could offset the needed_to_cat and avail_to_cat
> calculations so that category 255 corresponds exactly to the
> maximum possible free space, but that requires assuming that FSM
> knows exactly what that is, which is a bit unpleasant.  Thoughts?

Hmph. The other alternative is to use 2 bytes instead of one per page, 
and track the free space exactly. But I'd rather not do that just to 
deal with the very special case of huge requests.

Or we could just return -1 instead of throwing an error. Requests higher 
than the limit would then always have to extend the heap. That's not 
good, but I think we already have that problem for tuples of exactly 
MaxHeapTupleSize bytes. Since PageGetFreeSpace subtracts the size of a 
new line pointer, only newly extended pages that have never had any 
tuples on them have enough space, as determined by PagetGetFreeSpace, to 
fit a tuple of MaxHeapTupleSize bytes.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: FSM rewrite: doc changes

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> In fsm_rebuild_page, surely we needn't check "if (lchild < NodesPerPage)".

> Yes, we do.

But the loop starting point is such that you must be visiting a parent
with at least one child, no?

>> reveals a rather fundamental problem: it is clearly possible
>> for this test to fail on valid request sizes, because the page
>> header overhead is less than FSM_CAT_STEP (especially if BLCKSZ
>> is more than 8K).  I'm not sure about a really clean solution
>> here.

> Hmph. The other alternative is to use 2 bytes instead of one per page, 
> and track the free space exactly. But I'd rather not do that just to 
> deal with the very special case of huge requests.

Yeah, I thought about that too.  It's got another problem besides the
sheer space cost: it would result in a whole lot more update traffic for
upper levels of the tree.  The quantization of possible values in the
current design is good because it avoids updates of parents for
relatively small deltas of free space.

> Or we could just return -1 instead of throwing an error. Requests higher 
> than the limit would then always have to extend the heap. That's not 
> good, but I think we already have that problem for tuples of exactly 
> MaxHeapTupleSize bytes. Since PageGetFreeSpace subtracts the size of a 
> new line pointer, only newly extended pages that have never had any 
> tuples on them have enough space, as determined by PagetGetFreeSpace, to 
> fit a tuple of MaxHeapTupleSize bytes.

That seems like something we'll want to fix sometime, rather than
hardwiring into the FSM design.

I suppose an alternative possibility is to set MaxHeapTupleSize at
255/256's of a block by definition, so that no request will ever exceed
what the FSM stuff can handle.  But I'm sure that'd make somebody
unhappy --- somewhere out there is a table with tuples wider than that.

Probably the least bad alternative here is to allow FSM's category
scaling to depend on MaxHeapTupleSize.
        regards, tom lane


Re: FSM rewrite: doc changes

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> In fsm_rebuild_page, surely we needn't check "if (lchild < NodesPerPage)".
> 
>> Yes, we do.
> 
> But the loop starting point is such that you must be visiting a parent
> with at least one child, no?

Hmm, true, and that means that the comment above it is wrong. I'll 
change the code to do what the comment says. That way, if the page is 
completely garbled, and a parent node with no children has a value other 
than zero, that gets fixed too.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com