Thread: Free-space-map management thoughts

Free-space-map management thoughts

From
Tom Lane
Date:
I've been thinking about improving the algorithm that the free space map
(FSM) uses to decide what to store when it's not got enough shared
memory to keep track of everything.  The present design uses a dynamically
adjusted threshold for each relation, throwing away pages whose free
space amount is less than the threshold.  This unfortunately seems not
to work as well as I'd hoped when I wrote it :-(.  In particular it
cannot cope effectively with situations where many pages have exactly
the same amount of free space --- it degenerates to all-or-nothing.
This problem has become acute now that btree indexes use the FSM to
keep track of free pages: by definition, all those pages have exactly
the same amount of free space.

I had some off-list discussions last fall with Steve Marshall, who
was trying to improve the thresholding algorithm to work better, but
what he ended up with seemed to me to have a lot of ad-hoc logic and
magic numbers in it.  So I wasn't real satisfied with that.

This is a request for comments about the following redesign:

1. Continue to keep track of the average request size seen by
GetPageWithFreeSpace(), but make it a pure request-size average; don't
muck it up with thresholding adjustments.  Add an entry point to make
the average request size for a relation available to callers.  VACUUM
can use this as a starting point for its internal decisions about which
pages are even worth reporting to the FSM.

2. Eliminate "retail" addition of FSM entries.  RecordFreeSpace() isn't
being used anyway, and we can restrict RecordAndGetPageWithFreeSpace()
to only update existing entries not make new ones.  Only wholesale
replacement of a relation's FSM data, via MultiRecordFreeSpace(), need
be supported as a way of adding page entries to FSM.

3. With the above two changes, the numbers of pages passed to the FSM
by MultiRecordFreeSpace() calls become useful statistics in themselves.
We can keep track of those numbers in the FSM's per-relation statistics
independently of the number of pages actually stored in FSM.

4. In particular, the sum of the MultiRecordFreeSpace (henceforth MRFS)
page counts gives us the total number of pages we would *like* to keep
track of; the ratio of this number to the actual allocated max_fsm_pages
is our "oversubscription ratio".  (One thing we can do in passing is
make VACUUM VERBOSE print these numbers, so that people finally have
some intelligent way of adjusting their FSM parameters.)

5. When FSM space is oversubscribed, we can divide each relation's MRFS
requested page count by the oversubscription ratio to arrive at an exact
target page count for each relation.  This page count is stable as long
as the space-allocation picture isn't changing much, which is a big
improvement over the existing inherently history-dependent thresholding
algorithm.

6. A reasonably effective way to reduce a relation's stored page list
to the target (or select out the pages to actually remember from an MRFS
request) is as follows: * Prescan the page data to compute a histogram of available-space   values, with maybe 32 bins.
*Find the histogram bin whose inclusion would make us exceed the target   page count.  Set thresholdL = its lower edge
value,thresholdU = its   upper edge value, and bincount = target page count minus sum of counts   in higher histogram
bins.* Scan pages.  Keep all those >= thresholdU; drop those < thresholdL;   of the pages between the thresholds, keep
thefirst bincount entries. This approach will drop some pages with more free space while keeping some with less --- but
thedifference between those dropped and those kept is no more than the bin width.
 

This algorithm still requires some heuristics around the edges --- in
particular, in step 5 we probably don't want a purely linear division of
available space, but should allow some minimum number of pages for each
relation before divvying up the leftover space.  But it doesn't seem to
need nearly as much ad-hoc tuning as the thresholding approach does.

One thing this does not do that the original code tries to do is give
preference to more-heavily-used relations --- in this approach, each
table's share of FSM is determined only by its number of pages with
free space.  However, one could argue that that is an indirect measure
of activity, since it certainly reflects past deletion/update activity.
So it may be okay not to give any explicit preference to active tables.

Comments anyone?
        regards, tom lane

PS: Another idea I'm toying with is to dump out the FSM contents at
postmaster shutdown and reload them at restart, so that the FSM doesn't
have to start from ground zero on every restart cycle.  But that's
independent of the management algorithm...


Re: Free-space-map management thoughts

From
"Matthew T. O'Connor"
Date:
> PS: Another idea I'm toying with is to dump out the FSM contents at
> postmaster shutdown and reload them at restart, so that the FSM doesn't
> have to start from ground zero on every restart cycle.  But that's
> independent of the management algorithm...

Correct me if I'm wrong, but the FSM is only populated by vacuum, so there
is no FSM information for any given table / database until it's vacuumed, in
a long running production enviornment this may not be that important, but it
could result in a large increase in file size any time the database is
restarted.

I think this change, while independent of your proposal, is important.



Re: Free-space-map management thoughts

From
Tom Lane
Date:
"Matthew T. O'Connor" <matthew@zeut.net> writes:
> Correct me if I'm wrong, but the FSM is only populated by vacuum, so there
> is no FSM information for any given table / database until it's vacuumed, in
> a long running production enviornment this may not be that important, but it
> could result in a large increase in file size any time the database is
> restarted.

Right.  The original assumption was that people don't restart production
servers, so startup-transient behavior isn't very important.  But that's
obviously not a great assumption.  It seemed expedient at the time
(partly because I wasn't sure FSM would fly at all) --- but now it's
time to go back and fill in the holes.
        regards, tom lane


Re: Free-space-map management thoughts

From
Tom Lane
Date:
Stephen Marshall <smarshall@wsi.com> writes:
> 1. When the FSM is oversubscribed and one is trying to decide which 
> pages to keep, remember that page info is stored in  groups of 
> CHUNK_SIZE pages, where CHUNK_SIZE is current 32.

Right, oversubscription would actually need to be measured in chunks
not single pages.

> 2. The histogram concept is a neat idea, but I think some reorganization 
> of the page information might make it unnecessary.  Currently the FSM 
> pages are sorted by BlockNumber.  This was particularly useful for 
> adding information about a single page, but since that interface is no 
> longer to be supported, perhaps the decision to sort by BlockNumber 
> should also be revisited.  

I was thinking about that, but we do still need to handle
RecordAndGetFreeSpace --- in fact that should be the most common
operation.  The histogram approximation seems an okay price to pay for
not slowing down RecordAndGetFreeSpace.  If you wanted to depend on
the ordering-by-free-space property to any large extent,
RecordAndGetFreeSpace would actually have to move the old page down in
the list after adjusting its free space :-(

> If we sort the page info by available space, we could then use binary 
> search to find space thresholds when we are handling oversubscription. 

The list-of-chunks storage layout provides only limited traction for
searching anyway, and none at all for a binary search AFAICS.  I toyed
with smarter data structures such as hashes or btrees, but couldn't
convince myself that the extra space would be justified.

> Am I missing something that requires the FSM to be ordered by block number?

We could doubtless make it work either way, but I think optimizing
VACUUM at the expense of RecordAndGetFreeSpace is probably not the way
to go.

Another factor to consider is that the round-robin algorithm for handing
out pages during GetFreeSpace would behave considerably differently if
the block list is sorted by free space not block number.  I'm not sure
offhand that it would be worse, but we'd have to think about the
consequences.
        regards, tom lane


Re: Free-space-map management thoughts

From
Robert Treat
Date:
On Wed, 2003-02-26 at 15:54, Tom Lane wrote:
<snip>
> 
> Comments anyone?
> 

Now that indexes are getting some reporting, my understanding is an
index would report fewer pages overall than it's associated table, but
those pages would be completely empty.  However, given that they don't
reported non-empty pages, the percentage of freeable space to total
space would be unfairly lower (if I'm right in thinking that the back
end will assume that non-reported pages don't have empty space in them).
This would tend to hurt index management even though it's pages are the
best candidates for removal (100% empty). Is this a valid concern, or am
I misreading something?

Robert Treat  




Re: Free-space-map management thoughts

From
Tom Lane
Date:
Robert Treat <xzilla@users.sourceforge.net> writes:
> Now that indexes are getting some reporting, my understanding is an
> index would report fewer pages overall than it's associated table, but
> those pages would be completely empty.  However, given that they don't
> reported non-empty pages, the percentage of freeable space to total
> space would be unfairly lower (if I'm right in thinking that the back
> end will assume that non-reported pages don't have empty space in them).
> This would tend to hurt index management even though it's pages are the
> best candidates for removal (100% empty). Is this a valid concern, or am
> I misreading something?

I'm not following your point... across relations, the proposed scheme
only considers numbers of pages, not how much space is believed free in
each such page.  If anything I suspect it would over-favor the indexes.
        regards, tom lane


Re: Free-space-map management thoughts

From
Robert Treat
Date:
On Thu, 2003-02-27 at 11:00, Tom Lane wrote:
> Robert Treat <xzilla@users.sourceforge.net> writes:
> > Now that indexes are getting some reporting, my understanding is an
> > index would report fewer pages overall than it's associated table, but
> > those pages would be completely empty.  However, given that they don't
> > reported non-empty pages, the percentage of freeable space to total
> > space would be unfairly lower (if I'm right in thinking that the back
> > end will assume that non-reported pages don't have empty space in them).
> > This would tend to hurt index management even though it's pages are the
> > best candidates for removal (100% empty). Is this a valid concern, or am
> > I misreading something?
> 
> I'm not following your point... across relations, the proposed scheme
> only considers numbers of pages, not how much space is believed free in
> each such page.  If anything I suspect it would over-favor the indexes.
> 

I think I was thinking that a given table will always report more pages
than an index on that table, since tables can report 50% empty pages
while indexes only report 100% empty pages. This would cause tables to
generally be favored over indexes, even though the index pages have the
most to gain.

Robert Treat  



Re: Free-space-map management thoughts

From
Tom Lane
Date:
Robert Treat <xzilla@users.sourceforge.net> writes:
> I think I was thinking that a given table will always report more pages
> than an index on that table, since tables can report 50% empty pages
> while indexes only report 100% empty pages. This would cause tables to
> generally be favored over indexes, even though the index pages have the
> most to gain.

But the other side of that coin is that the table needs the FSM service
more: it will be accepting larger insertions and trying to stuff them
into pages that aren't wholly empty.  So it just naturally needs to keep
track of more free pages.  (This argument is rigorously true if you
compare a table with its own index; it's probably qualitatively okay
when considering unrelated indexes.)  So like I said, I'm really more
concerned that the indexes may be over-favored.

Also keep in mind that, as Steve pointed out, we'll really be allocating
space on a chunk basis not a page basis.  For an index there is no need
to store free-space-per-page at all; this means we could fit 48 page
numbers into the same size chunk that normally holds 32 page numbers and
32 free space counts.  (I'm not sure I will do this, but I will if it
can be done without uglifying the code excessively.)  So if we allocate
space on a proportionate-number-of-chunks basis, the indexes should get
some additional win there.

We could throw in a fudge-factor multiplier that discriminates for or
against indexes while choosing how much space to give them.  Without any
clear idea where to set it, I'm not eager to do so --- but that'd provide
a solution if it becomes apparent that one side or the other is being
favored too much.
        regards, tom lane


Re: Free-space-map management thoughts

From
Tom Lane
Date:
I wrote:
> Stephen Marshall <smarshall@wsi.com> writes:
>> If we sort the page info by available space, we could then use binary 
>> search to find space thresholds when we are handling oversubscription. 

> The list-of-chunks storage layout provides only limited traction for
> searching anyway, and none at all for a binary search AFAICS.  I toyed
> with smarter data structures such as hashes or btrees, but couldn't
> convince myself that the extra space would be justified.

Here's a possibly off-the-wall idea: maybe the list-of-chunks
representation is not too simple, but too complex.  Suppose we stored
all the FSM page data as one big array (of ItemPointerData elements).
Any given relation would own a section of this array in which its page
data is sorted in page-number order.  Then RecordAndGetFreeSpace could
use a binary search to locate the old page it needs to update.

This would be noticeably faster as far as lookup operations go, but the
Achilles' heel would be inserting new data: in general you'd need to
push stuff around in the page array to make room.  Given fast memcpy()
this might not be too bad though.  Alternatively, the data-copying could
be combined with the scan we'd likely be making to remove undersized
pages.

Anyone like this idea?  Or should I leave well enough alone?
        regards, tom lane


Re: Free-space-map management thoughts

From
Tom Lane
Date:
Stephen Marshall <smarshall@wsi.com> writes:
> If I understand the concept correctly, the histogram will only be 
> calculated when MultiRecordFreeSpace is called AND the FSM is 
> oversubscribed.  However, when it is called, we will need to calculate a 
> histogram for, and potentially trim data from, all relations that have 
> entries in the FSM.  

Right --- at least for calls where the particular relation's request has
gone up from before.  We could skip examining the other rels when the
requested number of pages is the same or less.

> When vacuuming the entire database, we will end up with an N-squared 
> loop where we iterate over all the relations in vacuum, and iterate over 
> them again in each call to MultiRecordFreeSpace that occurs within each 
> vacuum.

Hmm ... good point.  I was thinking it was linear, but that's per
MultiRecordFreeSpace call.  The total work involved for a whole-database
vacuum does look like O(N^2).  Is there a way around that?

It may be that this isn't really a problem; the behavior of the existing
code could be characterized the same way.  (In fact I think it could be
worse than O(N^2) depending on how often acquire_free_space ends up
getting called.)  But it's something to think about.

> In any event, I don't really think this is a problem, just something to 
> pay attention to.  It also highlights the need to make the histogram 
> calculation and free space adjustment as efficient as possible.

Yeah.  We need that anyway since we'll be doing it with the FSM lock
held.  (I was trying to think of ways to avoid holding the lock
throughout MultiRecordFreeSpace, but came up dry.)
        regards, tom lane


Re: Free-space-map management thoughts

From
Stephen Marshall
Date:
Tom Lane wrote:<br /><blockquote cite="mid10845.1046357945@sss.pgh.pa.us" type="cite"><pre wrap="">Stephen Marshall <a
class="moz-txt-link-rfc2396E"href="mailto:smarshall@wsi.com"><smarshall@wsi.com></a> writes: </pre><blockquote
type="cite"><prewrap="">2. The histogram concept is a neat idea, but I think some reorganization 
 
of the page information might make it unnecessary.  Currently the FSM 
pages are sorted by BlockNumber.  This was particularly useful for 
adding information about a single page, but since that interface is no 
longer to be supported, perhaps the decision to sort by BlockNumber 
should also be revisited.     </pre></blockquote><pre wrap="">
I was thinking about that, but we do still need to handle
RecordAndGetFreeSpace --- in fact that should be the most common
operation.  The histogram approximation seems an okay price to pay for
not slowing down RecordAndGetFreeSpace.  If you wanted to depend on
the ordering-by-free-space property to any large extent,
RecordAndGetFreeSpace would actually have to move the old page down in
the list after adjusting its free space :-( </pre></blockquote> I hadn't considered the needs of RecordAndGetFreeSpace.
 Itis called so much more than MultiRecordFreeSpace that it make much better sense to optimize it, and hence organize
thepage information by BlockNumber.<br /><br /> I think you just sold me on the histogram idea :)   but I still have
somethoughts about its behavior in the oversubscribed state.<br /><br /> If I understand the concept correctly, the
histogramwill only be calculated when MultiRecordFreeSpace is called AND the FSM is oversubscribed.  However, when it
iscalled, we will need to calculate a histogram for, and potentially trim data from, all relations that have entries in
theFSM.  <br /><br /> When vacuuming the entire database, we will end up with an N-squared loop where we iterate over
allthe relations in vacuum, and iterate over them again in each call to MultiRecordFreeSpace that occurs within each
vacuum. If each relation consistantly requests the storage of the same amount of page info during each vacuum, the
extrawork of this N-squared loop will probably disappear after the system settles into an equilibrium, but inconsistant
requestscould cause more oscillations in the free space adjustment.<br /><br /> Do I understand how this will work
properly,or did I miss something?<br /><br /> In any event, I don't really think this is a problem, just something to
payattention to.  It also highlights the need to make the histogram calculation and free space adjustment as efficient
aspossible.<br /><br /> By-the-way, I think your other suggestions are great (e.g. changes to the public API,
maintainingmore internal statics, reporting more info in VACUUM VERBOSE, ensuring that a minimum amout of freespace
infois retained for all relations).  I think this will be a nice improvement to how postgres reclaims disk space.<br
/><br/> 

Re: Free-space-map management thoughts

From
Stephen Marshall
Date:
Tom,

I'm happy to see your attentions turning back to the FSM.  I like the 
design, but I do have a few suggestions, particularly about how to 
handle oversubscription of the FSM.

1. When the FSM is oversubscribed and one is trying to decide which 
pages to keep, remember that page info is stored in  groups of 
CHUNK_SIZE pages, where CHUNK_SIZE is current 32.  Thus, if you need to 
store info for 1 page, you have already committed space in the FSM for 
CHUNK_SIZE pages, so you might as well fill that chunk with valid page 
information.

Such logic is not needed just for optimization, but also to prevent the 
oversubscription logic from trying to use more chunks than the FSM has.

2. The histogram concept is a neat idea, but I think some reorganization 
of the page information might make it unnecessary.  Currently the FSM 
pages are sorted by BlockNumber.  This was particularly useful for 
adding information about a single page, but since that interface is no 
longer to be supported, perhaps the decision to sort by BlockNumber 
should also be revisited.  

If we sort the page info by available space, we could then use binary 
search to find space thresholds when we are handling oversubscription. I think this would be both faster and more exact
thanthe histogram 
 
approach.

Sorting by available space would make the sgmr code a bit less 
efficient, as we would not be able to use binary search to skip to the 
min block number provided in MultiRecordFreeSpace.  However, lazy 
vacuuming would be more efficient, as this function starts with pages 
ordered by available space, then sorts the page info by block number 
just prior to the call of MultiRecordFreeSpace.

Am I missing something that requires the FSM to be ordered by block number?

Yours,
Steve Marshall
-----
Tom Lane wrote:

>I've been thinking about improving the algorithm that the free space map
>(FSM) uses to decide what to store when it's not got enough shared
>memory to keep track of everything.  The present design uses a dynamically
>adjusted threshold for each relation, throwing away pages whose free
>space amount is less than the threshold.  This unfortunately seems not
>to work as well as I'd hoped when I wrote it :-(.  In particular it
>cannot cope effectively with situations where many pages have exactly
>the same amount of free space --- it degenerates to all-or-nothing.
>This problem has become acute now that btree indexes use the FSM to
>keep track of free pages: by definition, all those pages have exactly
>the same amount of free space.
>
>I had some off-list discussions last fall with Steve Marshall, who
>was trying to improve the thresholding algorithm to work better, but
>what he ended up with seemed to me to have a lot of ad-hoc logic and
>magic numbers in it.  So I wasn't real satisfied with that.
>
>This is a request for comments about the following redesign:
>
>1. Continue to keep track of the average request size seen by
>GetPageWithFreeSpace(), but make it a pure request-size average; don't
>muck it up with thresholding adjustments.  Add an entry point to make
>the average request size for a relation available to callers.  VACUUM
>can use this as a starting point for its internal decisions about which
>pages are even worth reporting to the FSM.
>
>2. Eliminate "retail" addition of FSM entries.  RecordFreeSpace() isn't
>being used anyway, and we can restrict RecordAndGetPageWithFreeSpace()
>to only update existing entries not make new ones.  Only wholesale
>replacement of a relation's FSM data, via MultiRecordFreeSpace(), need
>be supported as a way of adding page entries to FSM.
>
>3. With the above two changes, the numbers of pages passed to the FSM
>by MultiRecordFreeSpace() calls become useful statistics in themselves.
>We can keep track of those numbers in the FSM's per-relation statistics
>independently of the number of pages actually stored in FSM.
>
>4. In particular, the sum of the MultiRecordFreeSpace (henceforth MRFS)
>page counts gives us the total number of pages we would *like* to keep
>track of; the ratio of this number to the actual allocated max_fsm_pages
>is our "oversubscription ratio".  (One thing we can do in passing is
>make VACUUM VERBOSE print these numbers, so that people finally have
>some intelligent way of adjusting their FSM parameters.)
>
>5. When FSM space is oversubscribed, we can divide each relation's MRFS
>requested page count by the oversubscription ratio to arrive at an exact
>target page count for each relation.  This page count is stable as long
>as the space-allocation picture isn't changing much, which is a big
>improvement over the existing inherently history-dependent thresholding
>algorithm.
>
>6. A reasonably effective way to reduce a relation's stored page list
>to the target (or select out the pages to actually remember from an MRFS
>request) is as follows:
>  * Prescan the page data to compute a histogram of available-space
>    values, with maybe 32 bins.
>  * Find the histogram bin whose inclusion would make us exceed the target
>    page count.  Set thresholdL = its lower edge value, thresholdU = its
>    upper edge value, and bincount = target page count minus sum of counts
>    in higher histogram bins.
>  * Scan pages.  Keep all those >= thresholdU; drop those < thresholdL;
>    of the pages between the thresholds, keep the first bincount entries.
>  This approach will drop some pages with more free space while keeping
>  some with less --- but the difference between those dropped and those
>  kept is no more than the bin width.
>
>This algorithm still requires some heuristics around the edges --- in
>particular, in step 5 we probably don't want a purely linear division of
>available space, but should allow some minimum number of pages for each
>relation before divvying up the leftover space.  But it doesn't seem to
>need nearly as much ad-hoc tuning as the thresholding approach does.
>
>One thing this does not do that the original code tries to do is give
>preference to more-heavily-used relations --- in this approach, each
>table's share of FSM is determined only by its number of pages with
>free space.  However, one could argue that that is an indirect measure
>of activity, since it certainly reflects past deletion/update activity.
>So it may be okay not to give any explicit preference to active tables.
>
>Comments anyone?
>
>            regards, tom lane
>
>PS: Another idea I'm toying with is to dump out the FSM contents at
>postmaster shutdown and reload them at restart, so that the FSM doesn't
>have to start from ground zero on every restart cycle.  But that's
>independent of the management algorithm...
>  
>