Thread: New FSM patch

New FSM patch

From
Heikki Linnakangas
Date:
Here's an updated FSM patch. Changes since last patch:

- Per comments and discussion with Simon, I've changed the "bubble up"
behavior so that when a bottom-level page is updated, if the amount of
free space was decreased, the change is not immediately bubbled up to
upper page. Instead, searchers that traverse down the tree will update
the upper pages when they see that they're out of sync. This should
alleviate the worry that we need to keep a bottom-level page exclusively
locked across I/O.

- Page-level routines have been split to a separate file fsmpage.c.
While there isn't that much code in it, it makes the separation between
functions that operate on a single page and functions that operate
across pages more clear.

- Fixed some minor bugs.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: New FSM patch

From
Simon Riggs
Date:
On Fri, 2008-08-29 at 10:47 +0300, Heikki Linnakangas wrote:

> - Per comments and discussion with Simon, I've changed the "bubble up" 
> behavior so that when a bottom-level page is updated, if the amount of 
> free space was decreased, the change is not immediately bubbled up to 
> upper page. Instead, searchers that traverse down the tree will update 
> the upper pages when they see that they're out of sync. This should 
> alleviate the worry that we need to keep a bottom-level page exclusively 
> locked across I/O.

Thanks for taking time with the new FSM.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: New FSM patch

From
Simon Riggs
Date:
On Fri, 2008-08-29 at 10:47 +0300, Heikki Linnakangas wrote:
> Here's an updated FSM patch. 

Can I check some aspects of this related to Hot Standby? Some of them
sound obvious, but worth double checking.

* There will be no need to read FSM by any normal operation of a
read-only transaction, so locking correctness considerations can
possibly be ignored during recovery. pg_freespacemap exists though:
would we need to prevent that from executing during recovery, or will
the FSM be fully readable? i.e. does redo take appropriate locks already
(I don't see any Cleanup locks being required).

* FSM will be continuously maintained during recovery, so FSM will now
be correct and immediately available when recovery completes?

* There are no cases where a screwed-up FSM will crash either recovery
(FATAL+) or halt normal operation (PANIC)?

* incomplete action cleanup is fairly cheap and doesn't rely on the FSM
being searchable to correct the error? This last is a hard one...

Do we have the concept of a invalid/corrupt FSM? What happens if the
logic goes wrong and we have a corrupt page? Will that mean we can't
complete actions against the heap? 

Are there really any changes to these files?
src/include/storage/bufmgr.h
src/include/postmaster/bgwriter.h

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: New FSM patch

From
Heikki Linnakangas
Date:
Thanks for the review!

Simon Riggs wrote:
> Can I check some aspects of this related to Hot Standby? Some of them
> sound obvious, but worth double checking.
> 
> * There will be no need to read FSM by any normal operation of a
> read-only transaction, so locking correctness considerations can
> possibly be ignored during recovery.

Correct.

A HOT prune operation doesn't currently update the FSM, but if it did, 
that would be a case of a read-only transaction updating the FSM. But we 
can't prune anyway in a hot standby.

>  pg_freespacemap exists though:
> would we need to prevent that from executing during recovery, or will
> the FSM be fully readable? i.e. does redo take appropriate locks already
> (I don't see any Cleanup locks being required).

pg_freespacemap, the contrib module? Yeah, the FSM should be fully readable.

During normal operation, when a bottom level page is updated, and the 
update needs to be bubbled up, the upper level page is pinned and locked 
before the lock on the lower level page is released. That interlocking 
is not done during WAL replay, and the lock on the lower level page is 
released before locking the upper page. It's not required during WAL 
replay, as there's no concurrent updates to the FSM.

> * FSM will be continuously maintained during recovery, so FSM will now
> be correct and immediately available when recovery completes?

Correct,

> * There are no cases where a screwed-up FSM will crash either recovery
> (FATAL+) or halt normal operation (PANIC)?

Hmm, there's no explicit elog(FATAL/PANIC) calls, but if the FSM is 
really corrupt, you can probably get a segfault. That should be fixable 
by adding more sanity checks, though.

> * incomplete action cleanup is fairly cheap and doesn't rely on the FSM
> being searchable to correct the error? This last is a hard one...

Correct.

> Do we have the concept of a invalid/corrupt FSM?  What happens if the
> logic goes wrong and we have a corrupt page? Will that mean we can't> complete actions against the heap?

Some scenarios I can think of:

Scenario: The binary tree on a page is corrupt, so that the value of an 
upper node is < Max(leftchild, rightchild).
Consequence: Searchers won't see the free space below that node, and 
will look elsewhere.

Scenario: The binary tree on a page is corrupt, so that the value of an 
upper node is > Max(leftchild, rightchild).
Consequence: Searchers will notice the corruption while trying to 
traverse down that path, and throw an elog(WARNING) in search_avail(). 
fsm_search will retry the search, and will in worst case go into an 
infinite loop. That's obviously not good. We could automatically fix the 
upper nodes of the tree, but that would wipe evidence that would be 
useful in debugging.

Scenario: An upper level page is corrupt, claiming that there's no free 
space on a lower level page, while there actually is. (the opposite, 
where an upper level page claims that there *is* free space on a lower 
level page, while there actually isn't, is now normal. The searcher will 
update the upper level page in that case)
Consequence: A searcher won't see that free space, and will look elsewhere.

Scenario: An upper level page is corrupt, claiming that there is free 
space on a lower level page that doesn't exist.
Consequence: Searchers will elog(ERROR), trying to read the non-existent 
FSM page.

The 3rd scenario would lead to heap inserts/updates failing. We could 
avoid that by checking that the page exists with 
RelationGetNumberOfBlocks(), but I'm not sure if it's worth the cost.

> Are there really any changes to these files?
> src/include/storage/bufmgr.h
> src/include/postmaster/bgwriter.h

Hmm, apparently not.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: New FSM patch

From
Simon Riggs
Date:
On Thu, 2008-09-04 at 11:07 +0300, Heikki Linnakangas wrote:
> Thanks for the review!

Not as thorough as I would have liked, I must admit.

Thanks for the other confirmations.

> Scenario: The binary tree on a page is corrupt, so that the value of an 
> upper node is > Max(leftchild, rightchild).
> Consequence: Searchers will notice the corruption while trying to 
> traverse down that path, and throw an elog(WARNING) in search_avail(). 
> fsm_search will retry the search, and will in worst case go into an 
> infinite loop. That's obviously not good. We could automatically fix the 
> upper nodes of the tree, but that would wipe evidence that would be 
> useful in debugging.

We probably need to break out of infinite loops, especially ones that
output warning messages on each loop. :-)

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: New FSM patch

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Thu, 2008-09-04 at 11:07 +0300, Heikki Linnakangas wrote:
>> Scenario: The binary tree on a page is corrupt, so that the value of an 
>> upper node is > Max(leftchild, rightchild).
>> Consequence: Searchers will notice the corruption while trying to 
>> traverse down that path, and throw an elog(WARNING) in search_avail(). 
>> fsm_search will retry the search, and will in worst case go into an 
>> infinite loop. That's obviously not good. We could automatically fix the 
>> upper nodes of the tree, but that would wipe evidence that would be 
>> useful in debugging.
> 
> We probably need to break out of infinite loops, especially ones that
> output warning messages on each loop. :-)

Yep. I turned that warning into an error in the latest patch I just 
posted elsewhere in this thread.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: New FSM patch

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Here's an updated FSM patch. Changes since last patch:
>

Yesterday, I started to reviewing your patch. At the beginning I have 
general questions:

1) If I understand correctly the main goal is to improve FSM to cover 
all pages in file which is useful for huge database.

2) Did you perform any benchmark? Is there any performance improvement 
or penalty?

3) How it works when database has many active parallel connections?
        Zdenek


Re: New FSM patch

From
Heikki Linnakangas
Date:
Zdenek Kotala wrote:
> Yesterday, I started to reviewing your patch. 

Thanks!

> 1) If I understand correctly the main goal is to improve FSM to cover 
> all pages in file which is useful for huge database.

That's not a goal per se, though it's true that the new FSM does cover 
all pages. The goals are to:
- eliminate max_fsm_pages and max_fsm_relations GUC variables, so that 
there's one thing less to configure
- make the FSM immediately available and useful after recovery (eg. warm 
standby)
- make it possible to retail update the FSM, which will be needed for 
partial vacuum

> 2) Did you perform any benchmark? Is there any performance improvement 
> or penalty?

Working on it.. I've benchmarked some bulk-insertion scenarios, and the 
new FSM is now comparable to the current implementation on those tests. 
See the o

I've also been working on a low level benchmark using a C user-defined 
function that exercises just the FSM, showing the very raw CPU 
performance vs. current implementation. More on that later, but ATM it 
looks like the new implementation can be faster or slower than the 
current one, depending on the table size.

The biggest potential performance issue, however, is the fact that the 
new FSM implementation is WAL-logged. That shows up dramatically in the 
raw test where there's no other activity than FSM lookups and updates, 
but will be much less interesting in real life where FSM lookups are 
always related to some other updates which are WAL-logged anyway.

I also ran some DBT-2 tests without think times, with a small number of 
warehouses. But the results of that had such a high variability from 
test to test, that any difference in FSM speed would've been lost in the 
noise.

Do you still have the iGen setup available? Want to give it a shot?

> 3) How it works when database has many active parallel connections?

The new FSM should in principle scale better than the old one. However, 
Simon raised a worry about the WAL-logging: WALInserLock can already 
become the bottleneck in OLTP-scenarios with very high load and many 
CPUs. The FSM isn't any worse than other actions that generate WAL, but 
naturally if you're bottlenecked by the WAL lock or bandwidth, any 
increase in WAL traffic will show up as an overall performance loss.

I'm not too worried about that, myself, because in typical scenarios the 
extra WAL traffic generated by the FSM should be insignificant in volume 
compared to all the other WAL traffic. But Simon will probably demand 
some hard evidence of that ;-).

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: New FSM patch

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Zdenek Kotala wrote:

> Do you still have the iGen setup available? Want to give it a shot?

Not sure if I have it configured, need to check. But I'll try it or I'll ask 
Jignesh or Paul if they have a free time. They are real benchmark gurus.
    Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: New FSM patch

From
Zdenek Kotala
Date:
I have question about FSM structure on the page. If I look into the code it 
seems to me that tree structure is "degenerated" on the right side. It is 
probably space optimization because complete tree needs BLCKSZ/2 space on the 
page and rest is empty.

Is my assumption correct? If yes maybe extra note in fsm_internal.h about it 
could be helpful.

    Zdenek




-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: New FSM patch

From
Heikki Linnakangas
Date:
Zdenek Kotala wrote:
> I have question about FSM structure on the page. If I look into the code 
> it seems to me that tree structure is "degenerated" on the right side. 
> It is probably space optimization because complete tree needs BLCKSZ/2 
> space on the page and rest is empty.
> 
> Is my assumption correct? If yes maybe extra note in fsm_internal.h 
> about it could be helpful.

Yes, that's right. I'll add a note on that if it helps.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: New FSM patch

From
Zdenek Kotala
Date:
Another question:

Does we need random_bool to spread workload? It seems to me a useless, because 
it also invokes one backend to use more pages instead of using one which is 
already in buffer cache. I think that it should generate a lot of extra i/o. Do 
not forget WAL full page write for firstime modified page.
    Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: New FSM patch

From
Heikki Linnakangas
Date:
Zdenek Kotala wrote:
> Does we need random_bool to spread workload? It seems to me a useless,
> because it also invokes one backend to use more pages instead of using
> one which is already in buffer cache.I think that it should generate a
> lot of extra i/o. Do not forget WAL full page write for firstime
> modified page.

random_bool() is gone in the latest version of the patch, in favor of a
"next pointer". You must be looking at an old version, and I must've
forgotten to update the link in the Wiki. That change was discussed in
the "New FSM allocation policy" thread.

Anyway, here's is the new version for your convenience, and I also added
a paragraph to the README, mentioning that the tree is degenerated from
the right.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: New FSM patch

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> I've also been working on a low level benchmark using a C user-defined 
> function that exercises just the FSM, showing the very raw CPU 
> performance vs. current implementation. More on that later, but ATM it 
> looks like the new implementation can be faster or slower than the 
> current one, depending on the table size.

Let me describe this test case first:
- The test program calls RecordAndGetPageWithFreeSpace in a tight loop, 
with random values. There's no activity to the heap. In normal usage, 
the time spent in RecordAndGetWithFreeSpace is minuscule compared to the 
heap and index updates that cause RecordAndGetWithFreeSpace to be called.
- WAL was placed on a RAM drive. This is of course not how people set up 
their database servers, but the point of this test was to measure CPU 
speed and scalability. The impact of writing extra WAL is significant 
and needs to be taken into account, but that's a separate test and 
discussion, and needs to be considered in comparison to the WAL written 
by heap and index updates.

That said, the test results are pretty interesting.

I ran the test using a custom scripts with pgbench. I ran it with 
different table sizes, and with 1 or 2 clients, on CVS HEAD and a 
patched version. The unit is "thousands of RecordAndGetPageWithFreeSpace 
calls per second":

Table size    Patched            CVS HEAD    1 clnt    2 clnts    1 clnt    2 clients
8 kB        4.59    3.45    62.83    26.85
336 kB        13.85    6.43    41.8    16.55
3336 kB        14.96    6.3    22.45    10.55
33336 kB    14.85    6.56    5.44    4.08
333336 kB    14.48    11.04    0.79    0.74
3333336 kB    12.68    11.5    0.07    0.07
33333336 kB    7.67    5.37    0.05    0.05

The big surprise to me was that performance on CVS HEAD tanks as the 
table size increases. One possible explanation is that searches for X 
bytes of free space, for a very high X, will not find any matches, and 
the current FSM implementation ends up scanning through the whole FSM 
list for that relation.

Another surprise was how badly both implementations scale. On CVS HEAD, 
I expected the performance to be roughly the same with 1 and 2 clients, 
because all access to the FSM is serialized on the FreeSpaceLock. But 
adding the 2nd client not only didn't help, but it actually made the 
performance much worse than with a single client. Context switching or 
cache line contention, perhaps? The new FSM implementation shows the 
same effect, which was an even bigger surprise. At table sizes > 32 MB, 
the FSM no longer fits on a single FSM page, so I expected almost a 
linear speed up with bigger table sizes from using multiple clients. 
That's not happening, and I don't know why. Although, going from 33MB to 
333 MB, the performance with 2 clients almost doubles, but it still 
doesn't exceed that with 1 client.

Going from 3 GB to 33 GB, the performance of the new implementation 
drops. I don't know why, I think I'll run some more tests with big table 
sizes to investigate that a bit more. The performance in the old 
implementation stays almost the same at that point; I believe that's 
because max_fsm_pages is exceeded at that point.

All in all, this isn't a very realistic test case, but it's interesting 
nevertheless. I'm happy with the performance of the new FSM on this 
test, as it's in the same ballpark as the old one, even though it's not 
quite what I expected.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: New FSM patch

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Heikki Linnakangas wrote:
>> I've also been working on a low level benchmark using a C user-defined 
>> function that exercises just the FSM, showing the very raw CPU 
>> performance vs. current implementation. More on that later, but ATM it 
>> looks like the new implementation can be faster or slower than the 
>> current one, depending on the table size.
> 
> Let me describe this test case first:
> - The test program calls RecordAndGetPageWithFreeSpace in a tight loop, 
> with random values. There's no activity to the heap. In normal usage, 
> the time spent in RecordAndGetWithFreeSpace is minuscule compared to the 
> heap and index updates that cause RecordAndGetWithFreeSpace to be called.
> - WAL was placed on a RAM drive. This is of course not how people set up 
> their database servers, but the point of this test was to measure CPU 
> speed and scalability. The impact of writing extra WAL is significant 
> and needs to be taken into account, but that's a separate test and 
> discussion, and needs to be considered in comparison to the WAL written 
> by heap and index updates.
> 
> That said, the test results are pretty interesting.
> 
> I ran the test using a custom scripts with pgbench. I ran it with 
> different table sizes, and with 1 or 2 clients, on CVS HEAD and a 
> patched version. The unit is "thousands of RecordAndGetPageWithFreeSpace 
> calls per second":
> 
> Table size    Patched            CVS HEAD
>         1 clnt    2 clnts    1 clnt    2 clients
> 8 kB        4.59    3.45    62.83    26.85
> 336 kB        13.85    6.43    41.8    16.55
> 3336 kB        14.96    6.3    22.45    10.55
> 33336 kB    14.85    6.56    5.44    4.08
> 333336 kB    14.48    11.04    0.79    0.74
> 3333336 kB    12.68    11.5    0.07    0.07
> 33333336 kB    7.67    5.37    0.05    0.05
> 
> The big surprise to me was that performance on CVS HEAD tanks as the 
> table size increases. One possible explanation is that searches for X 
> bytes of free space, for a very high X, will not find any matches, and 
> the current FSM implementation ends up scanning through the whole FSM 
> list for that relation.
> 
> Another surprise was how badly both implementations scale. On CVS HEAD, 
> I expected the performance to be roughly the same with 1 and 2 clients, 
> because all access to the FSM is serialized on the FreeSpaceLock. But 
> adding the 2nd client not only didn't help, but it actually made the 
> performance much worse than with a single client. Context switching or 
> cache line contention, perhaps? 

> The new FSM implementation shows the 
> same effect, which was an even bigger surprise. At table sizes > 32 MB, 
> the FSM no longer fits on a single FSM page, so I expected almost a 
> linear speed up with bigger table sizes from using multiple clients. 
> That's not happening, and I don't know why. Although, going from 33MB to 
> 333 MB, the performance with 2 clients almost doubles, but it still 
> doesn't exceed that with 1 client.

It looks likes that there are lot of lock issues on FSM pages. When number of 
FSM pages is increased then number of collisions is lower. It is probably why 2 
clients significantly speed up between 33MB and 333MB. I think it is time to 
take DTrace ;-).
Do you have any machine with DTrace support? If not send me your test suit and I 
will try it run on my machine.
    Zdenek





-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: New FSM patch

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Let me describe this test case first:
> - The test program calls RecordAndGetPageWithFreeSpace in a tight loop, 
> with random values.

What's the distribution of the random values, exactly?  In particular,
how do the request sizes compare to available free space per-page?

The design intent for FSM was that we'd not bother to record pages that
have less free space than the average request size, so as to (usually)
avoid the problem of uselessly searching a lot of entries.  I can't tell
whether your test case models that behavior at all.  If it does then
there may be something else that needs fixing.
        regards, tom lane


Re: New FSM patch

From
Heikki Linnakangas
Date:
Zdenek Kotala wrote:
> It looks likes that there are lot of lock issues on FSM pages. When
> number of FSM pages is increased then number of collisions is lower. It
> is probably why 2 clients significantly speed up between 33MB and 333MB.

Yes, that's what I thought as well. With table size under 33 MB, the FSM
consists of just one (bottom-level) FSM page,

> I think it is time to take DTrace ;-).
> Do you have any machine with DTrace support?

No.

> If not send me your test
> suit and I will try it run on my machine.

Sure, here you are. tests.sh is the main script to run. You'll need to
adjusts the paths there for your environment.

As it is, the tests will take many hours to run, so you'll probably want
to modify tests.sh and pgbenchtests.sh to reduce the number of
iterations. At least on my server, the variance in the numbers was very
small, so repeating the tests 4 times in tests.sh is probably overkill.

Thanks!

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: New FSM patch

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Let me describe this test case first:
>> - The test program calls RecordAndGetPageWithFreeSpace in a tight loop, 
>> with random values.
> 
> What's the distribution of the random values, exactly?  In particular,
> how do the request sizes compare to available free space per-page?

The request, and "old avail" sizes are in the range of 0-8100 
(random()%8100).

> The design intent for FSM was that we'd not bother to record pages that
> have less free space than the average request size, so as to (usually)
> avoid the problem of uselessly searching a lot of entries.  I can't tell
> whether your test case models that behavior at all.  If it does then
> there may be something else that needs fixing.

Probably not. The test case starts with a table that's practically 
empty, so all pages are put into the FSM.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: New FSM patch

From
Zdenek Kotala
Date:
Hi Heikki,

I'm still work on performance test, but I have following 
comments/questions/suggestion:

1)
#define NodesPerPage (BLCKSZ - SizeOfPageHeaderData - offsetof(FSMPageData, 
fp_nodes))

should be

#define NodesPerPage (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - 
offsetof(FSMPageData, fp_nodes))

See how PageGetContents is defined

2) I suggest to renema following functions:

GetParentNo -> FSMGetParentPageNo
GetChildNo ->  FSMGetChildPageNo
GetFSMBlockNumber -> FSMGetBlockNumber


3) I'm not happy much with idea that page contains data and they are 
"invisible". special, lower or upper is unset. It seems like empty page. I know 
that it is used in hash index implementation as well, but it could be fixed too.

I suggest to set special and upper correctly (or only upper). lower should 
indicate that there are not linp.

4) I suggest to create structure

struct foo {    int level;int logpageno;int slot;
}

5) I see potential infinite recursive loop in fsm_search.

6) Does FANOUT^4 fit into int? (by the way what FANOUT means?)


And there are more comments on rcodereview:

pgsql/src/backend/catalog/index.c
<http://reviewdemo.postgresql.org/r/27/#comment45>
    Strange comment? Looks like copy paste error.



pgsql/src/backend/catalog/index.c
<http://reviewdemo.postgresql.org/r/27/#comment47>
    ?RELKIND_INDEX?



pgsql/src/backend/storage/buffer/bufmgr.c
<http://reviewdemo.postgresql.org/r/27/#comment40>
    Extra space



pgsql/src/backend/storage/buffer/bufmgr.c
<http://reviewdemo.postgresql.org/r/27/#comment41>
    Extra space



pgsql/src/backend/storage/buffer/bufmgr.c
<http://reviewdemo.postgresql.org/r/27/#comment42>
    Extra space



pgsql/src/backend/storage/buffer/bufmgr.c
<http://reviewdemo.postgresql.org/r/27/#comment43>
    Extra space



pgsql/src/backend/storage/buffer/bufmgr.c
<http://reviewdemo.postgresql.org/r/27/#comment44>
    Extra space



pgsql/src/backend/storage/freespace/freespace.c
<http://reviewdemo.postgresql.org/r/27/#comment37>
    Use shift, however compileer could optimize it anyway.



pgsql/src/backend/storage/freespace/freespace.c
<http://reviewdemo.postgresql.org/r/27/#comment38>
    Why?  ;-)



pgsql/src/backend/storage/freespace/freespace.c
<http://reviewdemo.postgresql.org/r/27/#comment39>
    What's happen if FSM_FORKNUM does not exist?



pgsql/src/include/storage/bufmgr.h
<http://reviewdemo.postgresql.org/r/27/#comment36>
    Need consolidate - forknum vs blockNum, zeroPage



pgsql/src/include/storage/freespace.h
<http://reviewdemo.postgresql.org/r/27/#comment35>
    Cleanup



pgsql/src/include/storage/lwlock.h
<http://reviewdemo.postgresql.org/r/27/#comment49>
    Maybe better to use RESERVED to preserve lock numbers. It helps to DTrace 
script be more backward compatible.




-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: New FSM patch

From
Heikki Linnakangas
Date:
Zdenek Kotala wrote:
> I'm still work on performance test, but I have following
> comments/questions/suggestion:

Thanks!

> 1)
> #define NodesPerPage (BLCKSZ - SizeOfPageHeaderData -
> offsetof(FSMPageData, fp_nodes))
>
> should be
>
> #define NodesPerPage (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) -
> offsetof(FSMPageData, fp_nodes))
>
> See how PageGetContents is defined

Yes, good catch.

> 2) I suggest to renema following functions:
>
> GetParentNo -> FSMGetParentPageNo
> GetChildNo ->  FSMGetChildPageNo
> GetFSMBlockNumber -> FSMGetBlockNumber

Well, they're just static functions. But sure, why not.

> 3) I'm not happy much with idea that page contains data and they are
> "invisible". special, lower or upper is unset. It seems like empty page.
> I know that it is used in hash index implementation as well, but it
> could be fixed too.
>
> I suggest to set special and upper correctly (or only upper). lower
> should indicate that there are not linp.

Hmm. In B-tree metapage, pd_lower is set to point to the end of the
struct that's stored there. It's because that allows the xlog code to
skip the unused space there in full page images, but that's not
applicable for FSM pages.

I think I'll fix that so that the data is stored in the special part,
and the special part fills the whole page.

> 4) I suggest to create structure
>
> struct foo {
>     int level;
>     int logpageno;
>     int slot;
> }

Yes, that might be helpful.

> 5) I see potential infinite recursive loop in fsm_search.

Yes, that's quite subtle. The recursion should end eventually, because
whenever we reach a dead-end when we descend the tree, we fix the upper
nodes so that we shouldn't take that dead-end path on the next iteration.

That said, perhaps it would be a good idea to put a counter there and
stop after say a few thousand iterations, just in case.. In any case,
looks like it needs more comments.

I think I'll restructure that into a loop, instead of recursion.

> 6) Does FANOUT^4 fit into int? (by the way what FANOUT means?)

FANOUT is just an alias for LeafNodesPerPage. It's the number of child
pages a non-leaf FSM page has (or can have).

No, FANOUT^4 doesn't fit in int, good catch. Actually, FANOUTPOWERS
table doesn't need to go that far, so that's just a leftover. It only
needs to have DEPTH elements. However, we have the same problem if
DEPTH==3, FANOUT^4 will not fit into int. I put a comment there.
Ideally, the 4th element would be #iffed out, but I couldn't immediately
figure out how to do that.

> And there are more comments on rcodereview:
>
> pgsql/src/backend/catalog/index.c
> <http://reviewdemo.postgresql.org/r/27/#comment45>
>
>     Strange comment? Looks like copy paste error.

That function, setNewRelfilenode() is used for heaps as well, even
though it's in index.c. I'll phrase the comment better..

> pgsql/src/backend/catalog/index.c
> <http://reviewdemo.postgresql.org/r/27/#comment47>
>
>     ?RELKIND_INDEX?

No, that's correct, see above. The FSM is only created for heaps there,
indexes are responsible for creating their own FSM if they need one.
Hash indexes for example don't need one.

> pgsql/src/backend/storage/freespace/freespace.c
> <http://reviewdemo.postgresql.org/r/27/#comment37>
>
>     Use shift, however compileer could optimize it anyway.

I think I'll leave it to the compiler, for the sake of readibility.

> pgsql/src/backend/storage/freespace/freespace.c
> <http://reviewdemo.postgresql.org/r/27/#comment38>
>
>     Why?  ;-)

:-) Comment updated to:
  /* Can't ask for more space than the highest category represents */

> pgsql/src/backend/storage/freespace/freespace.c
> <http://reviewdemo.postgresql.org/r/27/#comment39>
>
>     What's happen if FSM_FORKNUM does not exist?

smgrnblocks() will throw an error, I believe. Users of the FSM are
responsible to create the FSM fork if they need it.

> pgsql/src/include/storage/bufmgr.h
> <http://reviewdemo.postgresql.org/r/27/#comment36>
>
>     Need consolidate - forknum vs blockNum, zeroPage

What do you mean?

> pgsql/src/include/storage/lwlock.h
> <http://reviewdemo.postgresql.org/r/27/#comment49>
>
>     Maybe better to use RESERVED to preserve lock numbers. It helps to
> DTrace script be more backward compatible.

Hmm, each lightweight lock uses a few bytes of shared memory, but I
guess that's insignificant. I'll do that and add a comment explaining
why that's done.


Here's a new patch, updated per your comments.

PS. ReviewBoard seems to be quite nice for pointing out small changes
like that.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: New FSM patch

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Heikki Linnakangas wrote:

<snip>

> 
> Let me describe this test case first:
> - The test program calls RecordAndGetPageWithFreeSpace in a tight loop, 
> with random values. There's no activity to the heap. In normal usage, 
> the time spent in RecordAndGetWithFreeSpace is minuscule compared to the 
> heap and index updates that cause RecordAndGetWithFreeSpace to be called.
> - WAL was placed on a RAM drive. This is of course not how people set up 
> their database servers, but the point of this test was to measure CPU 
> speed and scalability. The impact of writing extra WAL is significant 
> and needs to be taken into account, but that's a separate test and 
> discussion, and needs to be considered in comparison to the WAL written 
> by heap and index updates.
> 

<snip>

> 
> Another surprise was how badly both implementations scale. On CVS HEAD, 
> I expected the performance to be roughly the same with 1 and 2 clients, 
> because all access to the FSM is serialized on the FreeSpaceLock. But 
> adding the 2nd client not only didn't help, but it actually made the 
> performance much worse than with a single client. Context switching or 
> cache line contention, perhaps? The new FSM implementation shows the 
> same effect, which was an even bigger surprise. At table sizes > 32 MB, 
> the FSM no longer fits on a single FSM page, so I expected almost a 
> linear speed up with bigger table sizes from using multiple clients. 
> That's not happening, and I don't know why. Although, going from 33MB to 
> 333 MB, the performance with 2 clients almost doubles, but it still 
> doesn't exceed that with 1 client.


I tested it with DTrace on Solaris 10 and 8CPUs SPARC machine. I got 
similar result as you. Main problem in your new implementation is 
locking. On small tables where FSM fits on one page clients spend about 
3/4 time to waiting on page lock. On medium tables (2level FSM) then 
InsertWal lock become significant - it takes 1/4 of waiting time. Page 
waiting takes "only" 1/3.

I think the main reason of scalability problem is that locking invokes 
serialization.

Suggestions:

1) remove WAL logging. I think that FSM record should be recovered 
during processing of others WAL records (like insert, update). Probably 
only we need full page write on first modification after checkpoint.

2) break lock - use only share lock for page locking and divide page for 
smaller part for exclusive locking (at least for root page)


However, your test case is too artificial. I'm going to run OLTP 
workload and test it with "real" workload.
    Zdenek







Re: New FSM patch

From
Heikki Linnakangas
Date:
Zdenek Kotala wrote:
> I tested it with DTrace on Solaris 10 and 8CPUs SPARC machine. I got 
> similar result as you. Main problem in your new implementation is 
> locking. On small tables where FSM fits on one page clients spend about 
> 3/4 time to waiting on page lock.

That in itself is not that surprising. The test case does nothing else 
than exercise the FSM, so it's pretty obvious that all the time will be 
spent in the FSM. And they will be serialized on the lock on the single 
FSM page, like they are currently serialized on the FreeSpaceMapLock.

I think it's pretty unlikely we'd run into FSM congestion on a small 
table in real life. If you're (non-HOT) updating a table at such a high 
rate, it's not going to stay small for long.

> On medium tables (2level FSM) then 
> InsertWal lock become significant - it takes 1/4 of waiting time. Page 
> waiting takes "only" 1/3.

That's a more interesting case from scalability point of view.

For the case of a medium size table, we could implement the optimization 
of a "fast-root", similar to B-tree, to speed up the searches. If the 
heap is small enough that not all FSM levels are needed, we could simply 
start searches from the lower levels. That should reduce the page 
locking significantly. However, when searching, the upper levels are 
locked in shared, not exclusive, mode, so it's not clear if that would 
help with that 1/3 that's now spent in page waiting.

> Suggestions:
> 
> 1) remove WAL logging. I think that FSM record should be recovered 
> during processing of others WAL records (like insert, update). Probably 
> only we need full page write on first modification after checkpoint.

Hmm, we don't have the infrastructure to do that, but I guess it's 
doable. In case of a crash, the FSM information after recovery wouldn't 
obviously be up-to-date. And the FSM information in a warm standby would 
also lag behind.

One option would be to put RecordPageWithFreeSpace() calls into 
heap_redo, so that the FSM would be updated based on the WAL records we 
write anyway.

I think we'd still need to WAL log operations that decrease the amount 
of free space on page. Otherwise, after we have partial vacuum, we might 
never revisit a page, and update the FSM, even though there's usable 
space on the page, leaving the space forgotten forever.

And the torn page problem would need to be handled somehow.

> 2) break lock - use only share lock for page locking and divide page for 
> smaller part for exclusive locking (at least for root page)

That sounds difficult. But I don't think it's very likely to have 
congestion on a single FSM page in real life. Unless the root page 
becomes a bottleneck. Can you figure out how much of the lock waits come 
from the upper level and root pages?

> However, your test case is too artificial.

> I'm going to run OLTP 
> workload and test it with "real" workload.

Agreed. It should be possible to emulate the pattern the FSM really sees:

- tables become fuller and fuller, until no more tuples fill. Vacuum 
sweeps through them periodically, adding a lot of free space to each page.
- RecordPageWithFreeSpace is only called when a tuple being inserted 
doesn't fit on the previous page the backend inserted to.

However, even if these artificial tests show that the new implementation 
is X times slower than the old one, the question that they can't answer 
is whether it matters or not. If only say 0.01% of the time was spent in 
FSM, a 10x slowdown would be acceptable. A full-blown OLTP benchmark 
should give some clue on that.

Other important test cases are various bulk insert kind of tests.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: New FSM patch

From
Decibel!
Date:
On Sep 17, 2008, at 9:30 AM, Heikki Linnakangas wrote:
> I think we'd still need to WAL log operations that decrease the  
> amount of free space on page. Otherwise, after we have partial  
> vacuum, we might never revisit a page, and update the FSM, even  
> though there's usable space on the page, leaving the space  
> forgotten forever.


ISTM that it would be better to deal with such corner cases via  
periodic non-partial vacuums, done by something like autovac, and  
probably done with an ever higher-than-normal vacuum_cost_delay  
setting so as to minimize performance. That's likely a lot less  
wasteful than further compounding lock contention for the WAL. Even  
if it does result in more overall IO, you have to trade a *lot* of IO  
to balance out the impact of lock contention.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: New FSM patch

From
Zdenek Kotala
Date:
Heikki Linnakangas napsal(a):
> Zdenek Kotala wrote:

>> Suggestions:
>>
>> 1) remove WAL logging. I think that FSM record should be recovered 
>> during processing of others WAL records (like insert, update). 
>> Probably only we need full page write on first modification after 
>> checkpoint.
> 
> Hmm, we don't have the infrastructure to do that, but I guess it's 
> doable. In case of a crash, the FSM information after recovery wouldn't 
> obviously be up-to-date. And the FSM information in a warm standby would 
> also lag behind.
> 
> One option would be to put RecordPageWithFreeSpace() calls into 
> heap_redo, so that the FSM would be updated based on the WAL records we 
> write anyway.

Yes, it seems to be a good place.

> I think we'd still need to WAL log operations that decrease the amount 
> of free space on page. Otherwise, after we have partial vacuum, we might 
> never revisit a page, and update the FSM, even though there's usable 
> space on the page, leaving the space forgotten forever.

I think, if you update actual free space on each page which is recorded in the 
WAL then you should have actual FSM. Something like this:

RecordPageWithFreeSpace(PageGetFreeSpace(..))

<snip>

> 
> Other important test cases are various bulk insert kind of tests.

Parallel bulkload could be problem. Another problem could be CREATE TEMP TABLE 
command which updates catalog, which is usually small.
    Zdenek



-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: New FSM patch

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Here's a new patch, updated per your comments.

I did a read-through of the portions of this patch that change the rest
of the system (ie, not the guts of the new FSM itself).  Mostly it looks
pretty nice, but I have a few gripes:


Does smgrimmedsync at the bottom of nbtsort.c need to cover FSM too?
Maybe not, since index's FSM should be empty, but in that case you
still should add a comment saying so.

Likewise for smgrimmedsync in tablecmds.c's copy_relation_data

Grepping for P_NEW suggests that you missed some places where
FreeSpaceMapExtendRel or IndexFreeSpaceMapExtend calls should be added.
In particular GiST/GIN.  (I assume hash indexes still don't use FSM.
I wonder whether it'd be a good idea to get rid of hash bitmap pages
in favor of using FSM?  TODO item, not material for this patch.)

The change in catalog/heap.c invalidates the comment immediately
above it.

In vacuum.c's vac_update_fsm, the outPages counter is now useless.

In vacuumlazy.c, the num_free_pages, max_free_pages, and tot_free_pages
members of LVRelStats are now useless, as is the comment above them.
You need to take out the reporting of tot_free_pages if you are not
going to track it.

I think it's a modularity violation for bufmgr.c to be calling FSM.
Furthermore, it's pretty useless to have RelationTruncate doing
the fixup only for heaps and not indexes.  Please move that out
to the callers instead.

Does smgr.c still need to include storage/freespace.h at all?
Again, I think it would be a modularity violation to have it
calling FSM, now that FSM lives on top of storage.

RESOURCES_FSM needs to be removed from utils/guc_tables.h

The NOTE in the enum ForkNumber declaration was wrong before and
still is.

GetFreeSpaceOnPage() seems a bit misleadingly named; it's not obvious
from the name that it's not giving you the *true* free space on the page
but rather what FSM thinks it is.  Maybe call it something like
GetRecordedFreeSpace().  Also, please do not use the declaration style
that omits parameter names; I think those are important for
documentation/readability.

Doc updates are missing, but you knew that.
        regards, tom lane


Re: New FSM patch

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> No, FANOUT^4 doesn't fit in int, good catch. Actually, FANOUTPOWERS 
> table doesn't need to go that far, so that's just a leftover. It only 
> needs to have DEPTH elements. However, we have the same problem if 
> DEPTH==3, FANOUT^4 will not fit into int. I put a comment there. 
> Ideally, the 4th element would be #iffed out, but I couldn't immediately 
> figure out how to do that.

This is a "must fix" IMHO --- I don't plan to tolerate a scary compiler
warning ...

BTW, the comment about and computation of DEPTH are wrong anyway.
We support up to 2^32-1 pages, so I think the cutoff should be 1626.

I did a bit of testing and immediately got an Assert failure:

regression=# create table foo as select x from generate_series(1,100000) x;
SELECT
regression=# create index fooi on foo(x);
CREATE INDEX
regression=# delete from foo;
DELETE 100000
regression=# vacuum foo;
VACUUM
regression=# vacuum foo;
server closed the connection unexpectedly       This probably means the server terminated abnormally       before or
whileprocessing the request.
 
The connection to the server was lost. Attempting reset: Failed.

The reason is that the Assert in FSM_CATEGORY_AVAIL is failing:
TRAP: FailedAssertion("!(x < 8192)", File: "freespace.c", Line: 46)
LOG:  server process (PID 17691) was terminated by signal 6: Aborted

because RecordFreeIndexPage passes in BLCKSZ which is an illegal
value.  Maybe use BLCKSZ-1 instead?

The scary part of that is that it gets through the regression tests ---
doesn't leave one with a warm feeling about how much of VACUUM gets
exercised by regression.

I take it the comment at the top of indexfsm.c about using one bit per
page should be recast as a possible future improvement?
        regards, tom lane


Re: New FSM patch

From
Tom Lane
Date:
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
>>> 1) remove WAL logging. I think that FSM record should be recovered 
>>> during processing of others WAL records (like insert, update). 

Why are we WAL-logging FSM operations at all?  It's only a hint.

>> I think we'd still need to WAL log operations that decrease the amount 
>> of free space on page. Otherwise, after we have partial vacuum, we might 
>> never revisit a page, and update the FSM, even though there's usable 
>> space on the page, leaving the space forgotten forever.

Well, it'd be recovered eventually since sooner or later we'd have to
visit the page for tuple freezing purposes.  I'm not that worried about
losing space on individual pages anyway.  A more serious issue would be
if corruption of an upper-level FSM page caused a large swath of the
table to be effectively "forgotten", because the upper page had too
small a value in its entry.  But wouldn't this be more or less
self-repairing?  Assuming that the underlying table is active (if it
isn't you probably haven't got free space in it anyway) then once VACUUM
records free space on any page in the lost range, that would bubble up.

I think we could give serious consideration to not WAL-logging FSM,
with maybe a tweak here or there to improve the odds of self-repair.
        regards, tom lane


Re: New FSM patch

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Here's a new patch, updated per your comments.
> 
> I did a read-through of the portions of this patch that change the rest
> of the system (ie, not the guts of the new FSM itself).  Mostly it looks
> pretty nice, but I have a few gripes:

Thanks. It's probably not worthwhile to dig too deep into the FSM 
internals, until the performance problem is solved.

> Does smgrimmedsync at the bottom of nbtsort.c need to cover FSM too?
> Maybe not, since index's FSM should be empty, but in that case you
> still should add a comment saying so.

Right, the FSM should be empty at that point, it's extended in btbuild 
after that. nbtsort.c isn't concerned about FSM at all. I can add a 
comment on that.

> Likewise for smgrimmedsync in tablecmds.c's copy_relation_data

Well, no, copy_relation_data() copies and syncs only one fork at a time. 
Hmm, perhaps it should be renamed to reflect that, copy_fork() might be 
more appropriate.

> Grepping for P_NEW suggests that you missed some places where
> FreeSpaceMapExtendRel or IndexFreeSpaceMapExtend calls should be added.
> In particular GiST/GIN. 

Hmm, actually I think there's a problem with this approach to extending. 
If we crash after extending the file, but before the FSM extension has 
been WAL-logged, the next time the relation is vacuumed, vacuum will try 
to mark the page that FSM doesn't know about as free, and 
RecordPageWithFreeSpace doesn't like that.

I think we'll have to chance that rule so that the  FSM is extended 
automatically when RecordPageWithFreeSpace() is called on a page that's 
not in the FSM yet. That way we also won't need to remember to extend 
the FSM whenever the relation is extended.

That's easy to do by always calling FreeSpaceMapExtendRel from 
RecordPageWithFreeSpace(), but I'm afraid of the performance implication 
of that. Perhaps we could store the previously observed size of the FSM 
in RelationData, and only try to extend it when we get a request for a 
page beyond that. I think we'll then need locking to avoid extending the 
FSM from two backends at the same time, though, I'm relying on the 
extension lock of the main relation at the moment.

> The change in catalog/heap.c invalidates the comment immediately
> above it.

Thanks. I'm actually a bit unhappy about creating the FSM fork there.

> In vacuumlazy.c, the num_free_pages, max_free_pages, and tot_free_pages
> members of LVRelStats are now useless, as is the comment above them.
> You need to take out the reporting of tot_free_pages if you are not
> going to track it.

I took them all out now, though it would be nice to keep tracking it.

> Does smgr.c still need to include storage/freespace.h at all?
> Again, I think it would be a modularity violation to have it
> calling FSM, now that FSM lives on top of storage.

Hmm, seems to work fine without it.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: New FSM patch

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> I did a bit of testing and immediately got an Assert failure:
> 
> ...
> 
> The scary part of that is that it gets through the regression tests ---
> doesn't leave one with a warm feeling about how much of VACUUM gets
> exercised by regression.

Ouch..

> I take it the comment at the top of indexfsm.c about using one bit per
> page should be recast as a possible future improvement?

Yep. I also noted that the code in GetIndexFreeSpace() didn't match the 
comment above it.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: New FSM patch

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
>>>> 1) remove WAL logging. I think that FSM record should be recovered 
>>>> during processing of others WAL records (like insert, update). 
> 
> Why are we WAL-logging FSM operations at all?  It's only a hint.

- to ensure self-consistency of the tree, so that if an upper-level page 
says there's no free space on pages in range X-Z, there really isn't
- to avoid in-page corruption from torn pages
- to have the FSM useful immediately after recovery (warm standby, mainly)

> I think we could give serious consideration to not WAL-logging FSM,
> with maybe a tweak here or there to improve the odds of self-repair.

Yeah, I'm starting to lean towards that option too.

Hmm, we could have a vacuum pass over the FSM, fixing any corruption 
from the torn-page problem, as well as updating the upper-level pages. 
That leaves us just the problem of propagating the FSM information to 
the standby, and that we could handle by updating the FSM in the redo 
functions of heap and indexes.

That sounds like a good idea anyway, but we still haven't actually 
established that the WAL-logging is causing the performance degradation 
Zdenek observed.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: New FSM patch

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> ... but we still haven't actually 
> established that the WAL-logging is causing the performance degradation 
> Zdenek observed.

Yeah, that's a good point.  I did some simple performance testing on
bulk inserts and updates, and found that while they indeed tended to be
WALInsertLock heavy, the FSM traffic seemed to be only a small part of
it.  Here are some xlog record type counts from a bulk update test:
686555 XLogInsert: rm 10 info 20    HEAP_UPDATE 89117 XLogInsert: rm 10 info 29    HEAP_UPDATE + bkp blk + removable
24526XLogInsert: rm 10 info 25    HEAP_UPDATE + bkp blk + removable  3199 XLogInsert: rm 10 info 2d    HEAP_UPDATE + 2
bkpblks + removable 27676 XLogInsert: rm 7 info 00    FSM_SET_AVAIL    35 XLogInsert: rm 7 info 09    SET_AVAIL + bkp
blk+ removable
 

So either by record count or by volume, the FSM traffic doesn't seem to
be much.  I wonder whether Zdenek knows what the xlog traffic is like
for his test in an unpatched database ...
        regards, tom lane


Re: New FSM patch

From
Tom Lane
Date:
I did some editorializing on the FSM README file, in the line of
familiarizing myself with the contents.  Attached is an updated version.

Here are a couple of other random comments I jotted in the process:

search_avail makes me nervous: in the presence of a corrupt tree
I think it could index off the end of the page.  There needs to be
protection against trying to access a leaf node that doesn't exist.
(The search upward is okay because it cannot "miss" the root, and
you already checked the root is big enough; but a comment about
that wouldn't hurt.)  Also, I think it shouldn't throw a hard error
if the tree is corrupt, but just elog(LOG) and take corrective action.

I'd be happier if "next" were declared as int, as that makes it more
likely to be atomically fetchable/storable than if it's int16.  (On some
platforms a partial-word store is done by read, modify, write of a full
word.)  And could we name it something less generic than "next"?

The functions in fsmpage.c have names that are far too generic to be
exposed as global symbols.  If you don't want to fold that file into
freespace.c and make them static, then you need to rename them.

            regards, tom lane

$PostgreSQL$

Free Space Map
--------------

The purpose of the free space map is to quickly locate a page with enough
free space to hold a tuple to be stored; or to determine that no such page
exists and the relation must be extended by one page.  As of PostgreSQL 8.4
each relation has its own, extensible free space map stored in a separate
"fork" of its relation.  This eliminates the disadvantages of the former
fixed-size FSM.

It is important to keep the map small so that it can be searched rapidly.
Therefore, we don't attempt to record the exact free space on a page.
We allocate one map byte to each page, allowing us to record free space
at a granularity of 1/256th of a page.  Another way to say it is that
the stored value is the free space divided by BLCKSZ/256 (rounding down).
We assume that the free space must always be less than BLCKSZ, since
all pages have some overhead; so the maximum map value is 255.

To assist in fast searching, the map isn't simply an array of per-page
entries, but has a tree structure above those entries.  There is a tree
structure of pages, and a tree structure within each page, as described
below.

FSM page structure
------------------

Within each FSM page, we use a binary tree structure where leaf nodes store
the amount of free space on heap pages (or lower level FSM pages, see
"Higher-level structure" below), with one leaf node per heap page. A non-leaf
node stores the max amount of free space on any of its children.

For example:

    4
 4     2
3 4   0 2    <- This level represents heap pages

We need two basic operations: search and update.

To search for a page with X amount of free space, traverse down the tree
along a path where n >= X, until you hit the bottom. If both children of a
node satisfy the condition, you can pick either one arbitrarily.

To update the amount of free space on a page to X, first update the leaf node
corresponding to the heap page, then "bubble up" the change to upper nodes,
by walking up to each parent and recomputing its value as the max of its
two children.  Repeat until reaching the root or a parent whose value
doesn't change.

This data structure has a couple of nice properties:
- to discover that there is no page with X bytes of free space, you only
  need to look at the root node
- by varying which child to traverse to in the search algorithm, when you have
  a choice, we can implement various strategies, like preferring pages closer
  to a given page, or spreading the load across the table.

Higher-level routines that use FSM pages access them through the set_avail()
and search_avail() functions. The interface to those functions hides the
page's internal tree structure, treating the FSM page as a black box that has
a certain number of "slots" for storing free space information.  (However,
the higher routines have to be aware of the tree structure of the whole map.)

The binary tree is stored on each FSM page as an array. Because the page
header takes some space on a page, the binary tree isn't perfect. That is,
a few right-most leaf nodes are missing, and there are some useless non-leaf
nodes at the right. So the tree looks something like this:

       0
   1       2
 3   4   5   6
7 8 9 A B

where the numbers denote each node's position in the array.  Note that the
tree is guaranteed complete above the leaf level; only some leaf nodes are
missing.  This is reflected in the number of usable "slots" per page not
being an exact power of 2.

A FSM page also has a "next" pointer that determines where to start the next
search for free space within that page.  The reason for that is to spread out
the pages that are returned by FSM searches.  When several backends are
concurrently inserting into a relation, contention can be avoided by having
them insert into different pages.  The FSM is responsible for making that
happen, and the "next" pointer helps provide the desired behavior.

Higher-level structure
----------------------

To scale up the data structure described above beyond a single page, we
maintain a similar tree-structure across pages. Leaf nodes in higher level
pages correspond to lower level FSM pages. The root node within each page
has the same value as the corresponding leaf node on its parent page.

The root page is always stored at physical block 0.

For example, assuming each FSM page can hold information about 4 pages (in
reality, it holds (BLCKSZ - headers) / 2, or ~4000 with default BLCKSZ),
we get a disk layout like this:

 0     <-- page 0 at level 2 (root page)
  0     <-- page 0 at level 1
   0     <-- page 0 at level 0
   1     <-- page 1 at level 0
   2     <-- ...
   3
  1     <-- page 1 at level 1
   4
   5
   6
   7
  2
   8
   9
   10
   11
  3
   12
   13
   14
   15

where the numbers are page numbers *at that level*, starting from 0.

To find the physical block # corresponding to leaf page n, we need to
calculate

(number of preceding leaf pages) + (number of preceding upper level pages).
This turns out to be

y = n + (n / F + 1) + (n / F^2 + 1) + ... + 1

where F is the fanout (4 in the above example).

>From that, you can figure out the formulas for finding a given child page of
an upper-level page, or the parent of a page (XXX: fill in details)

To keep things simple, the tree is always constant height. To cover the
maximum relation size of 2^32 blocks, three levels is enough with the default
BLCKSZ (4000^3 > 2^32).

Addressing
----------

The higher-level routines operate on "logical" addresses, consisting of
- level,
- logical page number, and
- slot (if applicable)

Bottom level FSM pages have level of 0, the level above that 1, and root 2.
As in the diagram above, logical page number is the page number at that level,
starting from 0.

Locking
-------

When traversing down to search for free space, only one page is locked at a
time: the parent page is released before locking the child. If the child page
is concurrently modified, and there no longer is free space on the child page
when you land on it, you need to start from scratch (after correcting the
parent page, so that you don't get into an infinite loop).

When bubbling up a change, the parent page is locked before the lock on the
child is released. This ensures that the value at the root node of the child
page is always in sync with the corresponding leaf node on the parent.

We use shared buffer locks when searching, but exclusive buffer lock when
updating a page.  However, the "next" search pointer is updated during
searches even though we have only a shared lock.  "next" is just a hint and
we can easily reset it if it gets corrupted; so it seems better to accept
some risk of that type than to pay the overhead of exclusive locking.


TODO
----

- fastroot to avoid traversing upper nodes with just 1 child
- use a different system for tables that fit into one FSM page, with a
  mechanism to switch to the real thing as it grows.