Thread: New FSM allocation policy
The way my rewritten FSM works at the moment is that pages are handed out of the FSM in random order, to spread the load of multiple backends to different pages. That's good for spreading the load, but it occurred to me while observing a test case that inserts a lot of rows to an almost empty table with plenty of free space, that that sucks for the case of a single backend populating a table. Currently, FSM will hand out pages in sequential order, so from OS point of view we're reading and writing the pages sequentially. If the pages are handed out in random order, we'll do random I/O instead. Fortunately there's an easy fix for that. If we optimize RecordAndGetPageWithFreeSpace so that it will always return the next page if it has enough space, we'll be doing sequential I/O again. That's trivial as long as the next heap page is on the same FSM page, and probably not too hard even if it's not. If we limit this optimization to within the same FSM page, we'll effectively be filling fully a 32MB stripes Thoughts? I'm running more tests, and there's more issues that need discussion, but I'll start separate threads for each. I'll also post an updated patch separately. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Fortunately there's an easy fix for that. If we optimize > RecordAndGetPageWithFreeSpace so that it will always return the next page if it > has enough space, we'll be doing sequential I/O again. That's trivial as long > as the next heap page is on the same FSM page, and probably not too hard even > if it's not. If we limit this optimization to within the same FSM page, we'll > effectively be filling fully a 32MB stripes Starting from an arbitrary block that would be on average a 16MB stripe. One idea, we could scan the rest of the current page and use the first match. Another, given the way your tree structure works you can also descend the tree with a "target" page. You can find the first page with enough free space after the target page if there are any. (Take left branch if it's > target and has enough free space else take right branch if there's enough free space else take left branch). -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
Gregory Stark <stark@enterprisedb.com> writes: > One idea, we could scan the rest of the current page and use the first match. > Another, given the way your tree structure works you can also descend the tree > with a "target" page. You can find the first page with enough free space after > the target page if there are any. (Take left branch if it's > target and has > enough free space else take right branch if there's enough free space else > take left branch). I think the trick here is how to also preserve the property that different backends tend to be inserting into different pages. There may be no very good way to do that without maintaining some shared state, ie the last page handed out. However, it would probably be sufficient to do that for some pretty small number of tables, allowing a fixed-size shared memory area to be sufficient. regards, tom lane
Tom Lane wrote: > Gregory Stark <stark@enterprisedb.com> writes: >> One idea, we could scan the rest of the current page and use the first match. > >> Another, given the way your tree structure works you can also descend the tree >> with a "target" page. You can find the first page with enough free space after >> the target page if there are any. (Take left branch if it's > target and has >> enough free space else take right branch if there's enough free space else >> take left branch). > > I think the trick here is how to also preserve the property that > different backends tend to be inserting into different pages. Yep. If we just always look at the next page, there's the danger of having multiple backends compete for the same pages again. > There may > be no very good way to do that without maintaining some shared state, > ie the last page handed out. > > However, it would probably be sufficient to do that for some pretty > small number of tables, allowing a fixed-size shared memory area to be > sufficient. .. similar to how we handle synchronized seqscans. Yep, that's one option. I wish we could get rid of the shared memory area and lock altogether, though. We could put an extra field on the FSM root page. Hmm, it doesn't actually need to be a single global field, so we could have a field like that on every FSM page, for better concurrency. So the algorithm for the set+search operation becomes: 1. Lock FSM page for the old heap page X 2. Set the value for X 3. See if the page pointed to by (fsmpage->nextPtr++) has enough free space. If it does, return that. 4. Otherwise, start searching from root, with random traversal. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> There may >> be no very good way to do that without maintaining some shared state, >> ie the last page handed out. > We could put an extra field on the FSM root page. Hmm, it doesn't > actually need to be a single global field, so we could have a field like > that on every FSM page, for better concurrency. Yeah, that would work, and it is only a hint so you needn't WAL-log changing it. regards, tom lane
On Fri, 2008-08-29 at 18:55 +0300, Heikki Linnakangas wrote: > Tom Lane wrote: > > Gregory Stark <stark@enterprisedb.com> writes: > >> One idea, we could scan the rest of the current page and use the first match. > > > >> Another, given the way your tree structure works you can also descend the tree > >> with a "target" page. You can find the first page with enough free space after > >> the target page if there are any. (Take left branch if it's > target and has > >> enough free space else take right branch if there's enough free space else > >> take left branch). > > > > I think the trick here is how to also preserve the property that > > different backends tend to be inserting into different pages. > > Yep. If we just always look at the next page, there's the danger of > having multiple backends compete for the same pages again. Can the FSM hand out page ranges? That way we would be able to use the next page logic without fear of competition. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Tom Lane wrote: >>> There may >>> be no very good way to do that without maintaining some shared state, >>> ie the last page handed out. > >> We could put an extra field on the FSM root page. Hmm, it doesn't >> actually need to be a single global field, so we could have a field like >> that on every FSM page, for better concurrency. > > Yeah, that would work, and it is only a hint so you needn't WAL-log > changing it. Done. That seems to have helped a lot with the case of two concurrent backends bulk inserting to a table. I'm now happy with the bulk insert performance and behavior. I've been using the attached test case to measure that. It uses pgbench to measure the speed of bulk inserting one million rows, with these variations: - one client, start with a freshly-created table (i.e FSM is empty, and the table is extended) - same, but with two clients - one client, start with a pre-existing, but empty, table. So all requests for free space are satisfied by the FSM, the table is not extended. - same, but with two clients. Running that test on my laptop, with data directory on a RAM drive to measure just the CPU overhead and concurrency, it looks like the patched version is now on par with the current implementation. Looking at the pattern that pages are filled, by polling pg_freespace('tmp') while the test is running, the pages are being filled sequentially, with the target pages of the two backends interleaved, which is as expected and emulates the behavior of the current implementation. I haven't tried larger I/O bound tests yet, but because the access pattern is now the same as without the patch, I would expect the performance to be comparable. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Heikki Linnakangas wrote: > The way my rewritten FSM works at the moment is that pages are handed > out of the FSM in random order, to spread the load of multiple backends > to different pages. That's good for spreading the load, but it occurred > to me while observing a test case that inserts a lot of rows to an > almost empty table with plenty of free space, that that sucks for the > case of a single backend populating a table. Currently, FSM will hand > out pages in sequential order, so from OS point of view we're reading > and writing the pages sequentially. If the pages are handed out in > random order, we'll do random I/O instead. > > Fortunately there's an easy fix for that. If we optimize > RecordAndGetPageWithFreeSpace so that it will always return the next > page if it has enough space, we'll be doing sequential I/O again. That's > trivial as long as the next heap page is on the same FSM page, and > probably not too hard even if it's not. If we limit this optimization to > within the same FSM page, we'll effectively be filling fully a 32MB stripes > > Thoughts? > > I'm running more tests, and there's more issues that need discussion, > but I'll start separate threads for each. I'll also post an updated > patch separately. One other thing to keep in mind is that VACUUM can reduce a table's size if the trailing blocks are empty, so there is some gain if the earlier parts of the table are preferred for inserts. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sep 5, 2008, at 9:43 PM, Bruce Momjian wrote: >> Fortunately there's an easy fix for that. If we optimize >> RecordAndGetPageWithFreeSpace so that it will always return the next >> page if it has enough space, we'll be doing sequential I/O again. >> That's >> trivial as long as the next heap page is on the same FSM page, and >> probably not too hard even if it's not. If we limit this >> optimization to >> within the same FSM page, we'll effectively be filling fully a >> 32MB stripes >> >> Thoughts? >> >> I'm running more tests, and there's more issues that need discussion, >> but I'll start separate threads for each. I'll also post an updated >> patch separately. > > One other thing to keep in mind is that VACUUM can reduce a table's > size > if the trailing blocks are empty, so there is some gain if the earlier > parts of the table are preferred for inserts. Yeah; I would actually really, really like to see a mode you could set on a table that says "I want to try and shrink this table". One of the things that would mean is that the FSM should prefer pages at the beginning of the heap. Also related to this is the idea of asking the FSM for pages within a specific range so that you can try and maintain cluster order on a table. You would look in the clustering index for the closest value to your key and where it is in the heap and then ask for a page in that neighborhood. (You'd probably want to look at more than just one index tuple, but you get the idea). -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Decibel! wrote: > On Sep 5, 2008, at 9:43 PM, Bruce Momjian wrote: >> One other thing to keep in mind is that VACUUM can reduce a table's size >> if the trailing blocks are empty, so there is some gain if the earlier >> parts of the table are preferred for inserts.> > Yeah; I would actually really, really like to see a mode you could set > on a table that says "I want to try and shrink this table". One of the > things that would mean is that the FSM should prefer pages at the > beginning of the heap. Not sure how exactly that should work, but it should be pretty easy to do with the new FSM data structure. Perhaps VACUUM should just reset the "next pointers" as it goes. > Also related to this is the idea of asking the FSM for pages within a > specific range so that you can try and maintain cluster order on a > table. You would look in the clustering index for the closest value to > your key and where it is in the heap and then ask for a page in that > neighborhood. (You'd probably want to look at more than just one index > tuple, but you get the idea). The new FSM data structure should make that much easier to implement as well, as it supports naturally the operation "give me page with X bytes of free space, as close as possible to page Y". -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com