Thread: New FSM allocation policy

New FSM allocation policy

From
Heikki Linnakangas
Date:
The way my rewritten FSM works at the moment is that pages are handed 
out of the FSM in random order, to spread the load of multiple backends 
to different pages. That's good for spreading the load, but it occurred 
to me while observing a test case that inserts a lot of rows to an 
almost empty table with plenty of free space, that that sucks for the 
case of a single backend populating a table. Currently, FSM will hand 
out pages in sequential order, so from OS point of view we're reading 
and writing the pages sequentially. If the pages are handed out in 
random order, we'll do random I/O instead.

Fortunately there's an easy fix for that. If we optimize 
RecordAndGetPageWithFreeSpace so that it will always return the next 
page if it has enough space, we'll be doing sequential I/O again. That's 
trivial as long as the next heap page is on the same FSM page, and 
probably not too hard even if it's not. If we limit this optimization to 
within the same FSM page, we'll effectively be filling fully a 32MB stripes

Thoughts?

I'm running more tests, and there's more issues that need discussion, 
but I'll start separate threads for each. I'll also post an updated 
patch separately.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: New FSM allocation policy

From
Gregory Stark
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

> Fortunately there's an easy fix for that. If we optimize
> RecordAndGetPageWithFreeSpace so that it will always return the next page if it
> has enough space, we'll be doing sequential I/O again. That's trivial as long
> as the next heap page is on the same FSM page, and probably not too hard even
> if it's not. If we limit this optimization to within the same FSM page, we'll
> effectively be filling fully a 32MB stripes

Starting from an arbitrary block that would be on average a 16MB stripe. 

One idea, we could scan the rest of the current page and use the first match.

Another, given the way your tree structure works you can also descend the tree
with a "target" page. You can find the first page with enough free space after
the target page if there are any. (Take left branch if it's > target and has
enough free space else take right branch if there's enough free space else
take left branch).

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning


Re: New FSM allocation policy

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> One idea, we could scan the rest of the current page and use the first match.

> Another, given the way your tree structure works you can also descend the tree
> with a "target" page. You can find the first page with enough free space after
> the target page if there are any. (Take left branch if it's > target and has
> enough free space else take right branch if there's enough free space else
> take left branch).

I think the trick here is how to also preserve the property that
different backends tend to be inserting into different pages.  There may
be no very good way to do that without maintaining some shared state,
ie the last page handed out.

However, it would probably be sufficient to do that for some pretty
small number of tables, allowing a fixed-size shared memory area to be
sufficient.
        regards, tom lane


Re: New FSM allocation policy

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
>> One idea, we could scan the rest of the current page and use the first match.
> 
>> Another, given the way your tree structure works you can also descend the tree
>> with a "target" page. You can find the first page with enough free space after
>> the target page if there are any. (Take left branch if it's > target and has
>> enough free space else take right branch if there's enough free space else
>> take left branch).
> 
> I think the trick here is how to also preserve the property that
> different backends tend to be inserting into different pages. 

Yep. If we just always look at the next page, there's the danger of 
having multiple backends compete for the same pages again.

> There may
> be no very good way to do that without maintaining some shared state,
> ie the last page handed out.
> 
> However, it would probably be sufficient to do that for some pretty
> small number of tables, allowing a fixed-size shared memory area to be
> sufficient.

.. similar to how we handle synchronized seqscans. Yep, that's one 
option. I wish we could get rid of the shared memory area and lock 
altogether, though.

We could put an extra field on the FSM root page. Hmm, it doesn't 
actually need to be a single global field, so we could have a field like 
that on every FSM page, for better concurrency. So the algorithm for the 
set+search operation becomes:

1. Lock FSM page for the old heap page X
2. Set the value for X
3. See if the page pointed to by (fsmpage->nextPtr++) has enough free 
space. If it does, return that.
4. Otherwise, start searching from root, with random traversal.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: New FSM allocation policy

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> There may
>> be no very good way to do that without maintaining some shared state,
>> ie the last page handed out.

> We could put an extra field on the FSM root page. Hmm, it doesn't 
> actually need to be a single global field, so we could have a field like 
> that on every FSM page, for better concurrency.

Yeah, that would work, and it is only a hint so you needn't WAL-log
changing it.
        regards, tom lane


Re: New FSM allocation policy

From
Simon Riggs
Date:
On Fri, 2008-08-29 at 18:55 +0300, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Gregory Stark <stark@enterprisedb.com> writes:
> >> One idea, we could scan the rest of the current page and use the first match.
> > 
> >> Another, given the way your tree structure works you can also descend the tree
> >> with a "target" page. You can find the first page with enough free space after
> >> the target page if there are any. (Take left branch if it's > target and has
> >> enough free space else take right branch if there's enough free space else
> >> take left branch).
> > 
> > I think the trick here is how to also preserve the property that
> > different backends tend to be inserting into different pages. 
> 
> Yep. If we just always look at the next page, there's the danger of 
> having multiple backends compete for the same pages again.

Can the FSM hand out page ranges? That way we would be able to use the
next page logic without fear of competition.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: New FSM allocation policy

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> There may
>>> be no very good way to do that without maintaining some shared state,
>>> ie the last page handed out.
>
>> We could put an extra field on the FSM root page. Hmm, it doesn't
>> actually need to be a single global field, so we could have a field like
>> that on every FSM page, for better concurrency.
>
> Yeah, that would work, and it is only a hint so you needn't WAL-log
> changing it.

Done. That seems to have helped a lot with the case of two concurrent
backends bulk inserting to a table.

I'm now happy with the bulk insert performance and behavior. I've been
using the attached test case to measure that. It uses pgbench to measure
the speed of bulk inserting one million rows, with these variations:
- one client, start with a freshly-created table (i.e FSM is empty, and
the table is extended)
- same, but with two clients
- one client, start with a pre-existing, but empty, table. So all
requests for free space are satisfied by the FSM, the table is not extended.
- same, but with two clients.

Running that test on my laptop, with data directory on a RAM drive to
measure just the CPU overhead and concurrency, it looks like the patched
version is now on par with the current implementation. Looking at the
pattern that pages are filled, by polling pg_freespace('tmp') while the
test is running, the pages are being filled sequentially, with the
target pages of the two backends interleaved, which is as expected and
emulates the behavior of the current implementation. I haven't tried
larger I/O bound tests yet, but because the access pattern is now the
same as without the patch, I would expect the performance to be comparable.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: New FSM allocation policy

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> The way my rewritten FSM works at the moment is that pages are handed 
> out of the FSM in random order, to spread the load of multiple backends 
> to different pages. That's good for spreading the load, but it occurred 
> to me while observing a test case that inserts a lot of rows to an 
> almost empty table with plenty of free space, that that sucks for the 
> case of a single backend populating a table. Currently, FSM will hand 
> out pages in sequential order, so from OS point of view we're reading 
> and writing the pages sequentially. If the pages are handed out in 
> random order, we'll do random I/O instead.
> 
> Fortunately there's an easy fix for that. If we optimize 
> RecordAndGetPageWithFreeSpace so that it will always return the next 
> page if it has enough space, we'll be doing sequential I/O again. That's 
> trivial as long as the next heap page is on the same FSM page, and 
> probably not too hard even if it's not. If we limit this optimization to 
> within the same FSM page, we'll effectively be filling fully a 32MB stripes
> 
> Thoughts?
> 
> I'm running more tests, and there's more issues that need discussion, 
> but I'll start separate threads for each. I'll also post an updated 
> patch separately.

One other thing to keep in mind is that VACUUM can reduce a table's size
if the trailing blocks are empty, so there is some gain if the earlier
parts of the table are preferred for inserts.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: New FSM allocation policy

From
Decibel!
Date:
On Sep 5, 2008, at 9:43 PM, Bruce Momjian wrote:
>> Fortunately there's an easy fix for that. If we optimize
>> RecordAndGetPageWithFreeSpace so that it will always return the next
>> page if it has enough space, we'll be doing sequential I/O again.  
>> That's
>> trivial as long as the next heap page is on the same FSM page, and
>> probably not too hard even if it's not. If we limit this  
>> optimization to
>> within the same FSM page, we'll effectively be filling fully a  
>> 32MB stripes
>>
>> Thoughts?
>>
>> I'm running more tests, and there's more issues that need discussion,
>> but I'll start separate threads for each. I'll also post an updated
>> patch separately.
>
> One other thing to keep in mind is that VACUUM can reduce a table's  
> size
> if the trailing blocks are empty, so there is some gain if the earlier
> parts of the table are preferred for inserts.


Yeah; I would actually really, really like to see a mode you could  
set on a table that says "I want to try and shrink this table". One  
of the things that would mean is that the FSM should prefer pages at  
the beginning of the heap.

Also related to this is the idea of asking the FSM for pages within a  
specific range so that you can try and maintain cluster order on a  
table. You would look in the clustering index for the closest value  
to your key and where it is in the heap and then ask for a page in  
that neighborhood. (You'd probably want to look at more than just one  
index tuple, but you get the idea).
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: New FSM allocation policy

From
Heikki Linnakangas
Date:
Decibel! wrote:
> On Sep 5, 2008, at 9:43 PM, Bruce Momjian wrote:
>> One other thing to keep in mind is that VACUUM can reduce a table's size
>> if the trailing blocks are empty, so there is some gain if the earlier
>> parts of the table are preferred for inserts.>
> Yeah; I would actually really, really like to see a mode you could set 
> on a table that says "I want to try and shrink this table". One of the 
> things that would mean is that the FSM should prefer pages at the 
> beginning of the heap.

Not sure how exactly that should work, but it should be pretty easy to 
do with the new FSM data structure. Perhaps VACUUM should just reset the 
"next pointers" as it goes.

> Also related to this is the idea of asking the FSM for pages within a 
> specific range so that you can try and maintain cluster order on a 
> table. You would look in the clustering index for the closest value to 
> your key and where it is in the heap and then ask for a page in that 
> neighborhood. (You'd probably want to look at more than just one index 
> tuple, but you get the idea).

The new FSM data structure should make that much easier to implement as 
well, as it supports naturally the operation "give me page with X bytes 
of free space, as close as possible to page Y".

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com