Thread: Rewriting Free Space Map
I've started working on revamping Free Space Map, using the approach where we store a map of heap pages on every nth heap page. What we need now is discussion on the details of how exactly it should work. Here's my rough plan on the implementation. It's long, sorry. I'm fairly confident with it, but please let me know if you see any problems or have any suggestions or better ideas. Heap FSM -------- The FSM is stored in the special area of every nth heap page. When extending the relation, the heapam checks if the block number of the new page is one that belongs to the FSM. If it is, it let's the FSM to initialize it by calling InitFSMPage() on it, and extends the relation again to get another, normal heap page. I chose the "every nth page is an FSM page" approach, rather than using a separate relfilenode, which I also considered. The separate relfilenode approach has some advantages, like being able to scan all FSM pages in a sequential fashion, but it involves a fair amount of catalog and buffer manager changes. It's convenient that the FSM uses up the whole page, leaving no room for heap tuples. It simplifies the locking, as we don't need to worry with the possibility in the FSM that the caller is already holding a lock on the same page. In an FSM page, there's one byte for each of the next N heap pages, starting from the FSM page. That one byte stores the amount of free space on the corresponding heap page, in BLCKSZ/256 byte precision (32 bytes with default block size). The mapping of free space to these 256 "buckets" wouldn't necessarily have to be a linear one, we could for example have a single bucket for pages with more than BLCKSZ/2 bytes of free space and divide the rest linearly into 16 byte buckets, but let's keep it simple for now. Of course, we could also just use 2 bytes per page, and store the page size exactly, but 32 byte precision seems like enough to me. Index FSM --------- Indexes use a similar scheme, but since we only need to keep track whether a page is used or not, we only need one bit per page. If the amount of free space on pages is interesting for an indexam in the future, it can use the heap FSM implementation instead. Or no FSM at all, like the hash indexam. To use the index FSM, the indexam needs to leave every nth page alone, like in the heap. The B-tree assumes that the metapage is at block 0, but that's also where we would like to store the first index FSM page. To overcome that, we can make the index FSM special area a little bit smaller, so that the B-tree metadata fits on the same page as the FSM information. That will be wasted space on other FSM pages than block 0, but we're only talking about 24 bytes per FSM page, and we only need one FSM page per ~512 MB of index pages (with default BLCKSZ). Performance ----------- The patch I'm working on currently uses a naive way to find a page in the FSM. To find a page with X bytes of free space, it scans the FSM pages until one is found. And it always starts the scan from the beginning, which is far from optimal. And when there's page with enough free space, it still needs to scan all FSM pages just to find out that we need to extend the relation. To speed things up, we're going to need some mechanism to avoid that. First of all, we need to somehow cache the information that "there's no page with >= X bytes left", to avoid fruitless scanning. To speed up the case when there's only a few pages with enough free space, we can keep a limited size list of such pages in addition to the map. These information needs to be in shared memory, either on heap pages like the FSM pages and managed by the buffer manager, or in a separate shmem block. I would like to go with normal bufmgr managed pages, as fixed-sized memory blocks have their problems, and the cached information should be stored to disk as well. Let's have one special page in the heap file, called the Free Space List (FSL) page, in addition to the normal FSM pages. It has the following structure: struct { bit anypages[256] struct { BlockNumber blockno; uint8 freespace; } freespacelist[as large as fits on page] } Remember that we track the free space on each page using one byte, IOW, each page falls into one of 256 buckets of free space. In the anypages bitmap, we have one bit per bucket indicating "is there any pages with this much free space". When we look for a page with X bytes, we check the bits up to the bucket corresponding X bytes, and if there's no set bits we know not to bother scanning the FSM pages. To speed up the scan where there is space, we keep a simple list of pages with free space. This list is actually like the current FSM, but here we only use it as a small cache of the FSM pages. VACUUM and any other operations that update the FSM can put pages to the list when there's free slots. We can store the FSL page on a magic location, say block #100. For relations smaller than that, there's no need for the FSL and we might as well scan the FSM page. I'm not sure if we should have more than one FSL page for large tables. I'm not sure yet how the locking of FSL and FSM pages should work. It shouldn't be too hard, though, as the FSM/FSL information are just hints anyway. We do need to take care that we don't permanently lose track of any significant amount of free space, as after we can do partial vacuums using the visibility map, VACUUM might not visit one part of a table even if other parts it are frequently updated. Page allocation algorithm ------------------------- There's many different ways we can hand out pages from the FSM. Possible strategies, some of which are at odds with each other, include: 1. Try to spread out inserts of different backends, to avoid contention 2. Prefer low-numbered pages, to increase the chance of being able to truncate in VACUUM. 3. Reserve pages with lots of free space for big allocations, prefer almost full pages for small allocations. To use all space more efficiently. 4. If the table is clustered, try to use pages close to those with similar values. 5. On UPDATE, try to use pages close to the old tuple. 6. Prefer pages that are currently in cache, to avoid I/O. The current FSM tries to achieve only 1, and there haven't been many complaints, so I wouldn't worry too much about the other goals. 4 and 6 would need some major changes to the buffer manager and indexam interfaces, respectively, and 3 could lead to more I/O when you do a lot of small inserts. We can spread inserts of different backends in the Free Space List by moving a page to the end of the list when it's handed out, or removing it from there altogether. When the FSL is empty, we can vary the place where we start to scan the FSM. To prefer low-number pages, to increase the chance of being able to truncate the relation later, we can favor lower numbered blocks in VACUUM when we decide which blocks to put into the FSL. And we can also bias the starting point of where we start to scan the FSM when the FSL is empty. Visibility Map -------------- This far I've only talked about the FSM, but it's important to consider how the Visibility Map fits into the scheme. My current thinking is that there will be one bit per heap page in the visibility map. The exact semantics of that one bit is still not totally clear, but that's not important right now. There's a few alternatives: 1. Mix the visibility map with the FSM, stealing one bit of every FSM byte. There would then be 7 bits for storing how much space there is on each page, and 1 bit indicating the visibility. 2. Allocate part of the FSM pages for visibility map. For example, First 1/9 of the page for VM, and 8/9 for the FSM. The VM part would be a straight bitmap with one bit per page, and the FSM part would use one byte per page. 3. Use different pages for the VM, for example every nth page for VM, and every (n+1)th page for FSM. I'm leaning towards 2 at the moment. 3 is intriguing as well, though, because it would help with potential lock contention on the VM/FSM pages. Per-chunk relfrozenxid ---------------------- I'm imagining that we would set a bit in VM, if a page has no dead tuples at all, making it possible to use it for index-only scans. Such a page is also uninteresting for VACUUM. However, we would still need to scan the whole table to freeze tuples. To alleviate that, we could allocate some space in the FSM pages, indicating a smallest Xid in a range of heap pages. IOW, like relfrozenxid, but with a granularity of N pages, rather than whole relation. You would still have to scan that range of pages to update it, but that's much better than the whole relation. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > I've started working on revamping Free Space Map, using the approach > where we store a map of heap pages on every nth heap page. What we need > now is discussion on the details of how exactly it should work. You're cavalierly waving away a whole boatload of problems that will arise as soon as you start trying to make the index AMs play along with this :-(. Hash for instance has very narrow-minded ideas about page allocation within its indexes. Also, I don't think that "use the special space" will scale to handle other kinds of maps such as the proposed dead space map. (This is exactly why I said the other day that we need a design roadmap for all these ideas.) The idea that's becoming attractive to me while contemplating the multiple-maps problem is that we should adopt something similar to the old Mac OS idea of multiple "forks" in a relation. In addition to the main data fork which contains the same info as now, there could be one or more map forks which are separate files in the filesystem. They are named by relfilenode plus an extension, for instance a relation with relfilenode NNN would have a data fork in file NNN (plus perhaps NNN.1, NNN.2, etc) and a map fork named something like NNN.map (plus NNN.map.1 etc as needed). We'd have to add one more field to buffer lookup keys (BufferTag) to disambiguate which fork the referenced page is in. Having bitten that bullet, though, the idea trivially scales to any number of map forks with potentially different space requirements and different locking and WAL-logging requirements. Another possible advantage is that a new map fork could be added to an existing table without much trouble. Which is certainly something we'd need if we ever hope to get update-in-place working. The main disadvantage I can see is that for very small tables, the percentage overhead from multiple map forks of one page apiece is annoyingly high. However, most of the point of a map disappears if the table is small, so we might finesse that by not creating any maps until the table has reached some minimum size. regards, tom lane
Tom Lane wrote: > The idea that's becoming attractive to me while contemplating the > multiple-maps problem is that we should adopt something similar to > the old Mac OS idea of multiple "forks" in a relation. In addition > to the main data fork which contains the same info as now, there could > be one or more map forks which are separate files in the filesystem. I think something similar could be used to store tuple visibility bits separately from heap tuple data itself, so +1 to this idea. (The rough idea in my head was that you can do an indexscan and look up visibility bits without having to pull the whole heap along; and visibility updates are also cheaper, whether they come from indexscans or heap scans. Of course, the implicit cost is that a seqscan needs to fetch the visibility pages, too; and the locking is more complex.) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote: > Tom Lane wrote: > > > The idea that's becoming attractive to me while contemplating the > > multiple-maps problem is that we should adopt something similar to > > the old Mac OS idea of multiple "forks" in a relation. In addition > > to the main data fork which contains the same info as now, there could > > be one or more map forks which are separate files in the filesystem. Are'nt we in a way doing this for indexes ? > I think something similar could be used to store tuple visibility bits > separately from heap tuple data itself, so +1 to this idea. Not just "bits", but whole visibility info (xmin,xmax,tmin,tmax, plus bits) should be stored separately. A separate "fork" for visibility should be organized as a b-tree index (as we already have well-honed mechanisms for dealing with those effectively) but visibility fork is stored in a compressed form by storing ranges of all-visible or all-deleted tuples as two endpoints only and also the tree is reorganized when possible similar to what we currently do for HOT updates. This will keep the visibility index really small for cases with little updates, most likely one or two pages regardless of table size. One important difference from indexes is that visibility info should be stored first, before writing data to heap and creating ordinary index entries. > (The rough idea in my head was that you can do an indexscan and look > up visibility bits without having to pull the whole heap along; and > visibility updates are also cheaper, whether they come from indexscans > or heap scans. Of course, the implicit cost is that a seqscan needs to > fetch the visibility pages, too; and the locking is more complex.) another cost is heavy inserting/updating where there will probably be more lock contention as visibility info for new tuples will more often land on the same visibility pages due to visibility info being generally smaller. Of course, with visibility info in a separate fork, very narrow tables will have the ratios reversed - for one byte wide table visibility info will be a few times bigger than actual data, at least initially before compression has kicked in. -------------------- Hannu
Hannu Krosing <hannu@krosing.net> writes: > On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote: >> Tom Lane wrote: >>> The idea that's becoming attractive to me while contemplating the >>> multiple-maps problem is that we should adopt something similar to >>> the old Mac OS idea of multiple "forks" in a relation. > Are'nt we in a way doing this for indexes ? Not really --- indexes are closer to being independent entities, since they have their own relfilenode values, own pg_class entries, etc. What I'm imagining here is something that's so tightly tied to the core heap that there's no value in managing it as a distinct entity, thus the idea of same relfilenode with a different extension. The existence of multiple forks in a relation wouldn't be exposed at all at the SQL level. >> I think something similar could be used to store tuple visibility bits >> separately from heap tuple data itself, so +1 to this idea. > Not just "bits", but whole visibility info (xmin,xmax,tmin,tmax, plus > bits) should be stored separately. I'm entirely un-sold on this idea, but yeah it would be something that would be possible to experiment with once we have a multi-fork infrastructure. regards, tom lane
Tom Lane wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> I've started working on revamping Free Space Map, using the approach >> where we store a map of heap pages on every nth heap page. What we need >> now is discussion on the details of how exactly it should work. > > You're cavalierly waving away a whole boatload of problems that will > arise as soon as you start trying to make the index AMs play along > with this :-(. It doesn't seem very hard. An indexam wanting to use FSM needs a little bit of code where the relation is extended, to let the FSM initialize FSM pages. And then there's the B-tree metapage issue I mentioned. But that's all, AFAICS. > Hash for instance has very narrow-minded ideas about > page allocation within its indexes. Hash doesn't use FSM at all. > Also, I don't think that "use the special space" will scale to handle > other kinds of maps such as the proposed dead space map. (This is > exactly why I said the other day that we need a design roadmap for all > these ideas.) It works for anything that scales linearly with the relation itself. The proposed FSM and visibility map both fall into that category. A separate file is certainly more flexible. I was leaning towards that option originally (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00142.php) for that reason. > The idea that's becoming attractive to me while contemplating the > multiple-maps problem is that we should adopt something similar to > the old Mac OS idea of multiple "forks" in a relation. In addition > to the main data fork which contains the same info as now, there could > be one or more map forks which are separate files in the filesystem. > They are named by relfilenode plus an extension, for instance a relation > with relfilenode NNN would have a data fork in file NNN (plus perhaps > NNN.1, NNN.2, etc) and a map fork named something like NNN.map (plus > NNN.map.1 etc as needed). We'd have to add one more field to buffer > lookup keys (BufferTag) to disambiguate which fork the referenced page > is in. Having bitten that bullet, though, the idea trivially scales to > any number of map forks with potentially different space requirements > and different locking and WAL-logging requirements. Hmm. You also need to teach at least xlog.c and xlogutils.c about the map forks, for full page images and the invalid page tracking. I also wonder what the performance impact of extending BufferTag is. My original thought was to have a separate RelFileNode for each of the maps. That would require no smgr or xlog changes, and not very many changes in the buffer manager, though I guess you'd more catalog changes. You had doubts about that on the previous thread (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but the "map forks" idea certainly seems much more invasive than that. I like the "map forks" idea; it groups the maps nicely at the filesystem level, and I can see it being useful for all kinds of things in the future. The question is, is it really worth the extra code churn? If you think it is, I can try that approach. > Another possible advantage is that a new map fork could be added to an > existing table without much trouble. Which is certainly something we'd > need if we ever hope to get update-in-place working. Yep. > The main disadvantage I can see is that for very small tables, the > percentage overhead from multiple map forks of one page apiece is > annoyingly high. However, most of the point of a map disappears if > the table is small, so we might finesse that by not creating any maps > until the table has reached some minimum size. Yeah, the map fork idea is actually better than the "every nth heap page" approach from that point of view. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> You're cavalierly waving away a whole boatload of problems that will >> arise as soon as you start trying to make the index AMs play along >> with this :-(. > It doesn't seem very hard. The problem is that the index AMs are no longer in control of what goes where within their indexes, which has always been their prerogative to determine. The fact that you think you can kluge btree to still work doesn't mean that it will work for other AMs. >> Also, I don't think that "use the special space" will scale to handle >> other kinds of maps such as the proposed dead space map. (This is >> exactly why I said the other day that we need a design roadmap for all >> these ideas.) > It works for anything that scales linearly with the relation itself. The > proposed FSM and visibility map both fall into that category. It can work only with prearrangement among all the maps that are trying to coexist in the same special space. Every time a new one comes along, we'd have to reconsider the space allocation, re-optimize tradeoffs, and force an initdb that could not possibly be implemented in-place. If we had a short list of maps in mind with no real prospect of changing, then I think this would be acceptable; but that doesn't seem to be the case. It's certainly foolish to start detail design on something like this when we don't even have the "roadmap" I asked for about what sorts of maps we are agreed we want to have. >> The idea that's becoming attractive to me while contemplating the >> multiple-maps problem is that we should adopt something similar to >> the old Mac OS idea of multiple "forks" in a relation. > Hmm. You also need to teach at least xlog.c and xlogutils.c about the > map forks, for full page images and the invalid page tracking. Well, you'd have to teach them something anyway, for any incarnation of maps that they might need to update. > I also wonder what the performance impact of extending BufferTag is. That's a fair objection, and obviously something we'd need to check. But I don't recall seeing hash_any so high on any profile that I think it'd be a big problem. > My original thought was to have a separate RelFileNode for each of the > maps. That would require no smgr or xlog changes, and not very many > changes in the buffer manager, though I guess you'd more catalog > changes. You had doubts about that on the previous thread > (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but > the "map forks" idea certainly seems much more invasive than that. The main problems with that are (a) the need to expose every type of map in pg_class and (b) the need to pass all those relfilenode numbers down to pretty low levels of the system. The nice thing about the fork idea is that you don't need any added info to uniquely identify what relation you're working on. The fork numbers would be hard-wired into whatever code needed to know about particular forks. (Of course, these same advantages apply to using special space in an existing file. I'm just suggesting that we can keep these advantages without buying into the restrictions that special space would have.) regards, tom lane
On Mon, 2008-03-17 at 09:29 -0400, Tom Lane wrote: > Hannu Krosing <hannu@krosing.net> writes: > > On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote: > >> Tom Lane wrote: > >>> The idea that's becoming attractive to me while contemplating the > >>> multiple-maps problem is that we should adopt something similar to > >>> the old Mac OS idea of multiple "forks" in a relation. > > > Are'nt we in a way doing this for indexes ? > > Not really --- indexes are closer to being independent entities, since > they have their own relfilenode values, own pg_class entries, etc. What > I'm imagining here is something that's so tightly tied to the core heap > that there's no value in managing it as a distinct entity, thus the idea > of same relfilenode with a different extension. The existence of > multiple forks in a relation wouldn't be exposed at all at the SQL > level. > > >> I think something similar could be used to store tuple visibility bits > >> separately from heap tuple data itself, so +1 to this idea. > > > Not just "bits", but whole visibility info (xmin,xmax,tmin,tmax, plus > > bits) should be stored separately. > > I'm entirely un-sold on this idea, but yeah it would be something that > would be possible to experiment with once we have a multi-fork > infrastructure. > > regards, tom lane >
On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote: > Tom Lane wrote: > > > The idea that's becoming attractive to me while contemplating the > > multiple-maps problem is that we should adopt something similar to > > the old Mac OS idea of multiple "forks" in a relation. In addition > > to the main data fork which contains the same info as now, there could > > be one or more map forks which are separate files in the filesystem. > > I think something similar could be used to store tuple visibility bits > separately from heap tuple data itself, so +1 to this idea. > > (The rough idea in my head was that you can do an indexscan and look > up visibility bits without having to pull the whole heap along; and > visibility updates are also cheaper, whether they come from indexscans > or heap scans. Of course, the implicit cost is that a seqscan needs to > fetch the visibility pages, too; and the locking is more complex.) I very much like the idea of a generic method for including additional bulk metadata for a relation (heap or index). That neatly provides a general infrastructure for lots of clever things such as dead space, visibility or other properties, while at the same time maintaining modularity. Can we call them "maps" or "metadata maps"? "forks" sounds weird. We don't need to assume anything about the maps themselves at this stage, so we might imagine tightly coupled maps that are always updated as a relation changes, or loosely coupled maps that are lazily updated by background processes. Autovacuum then becomes the vehicle by which we execute "map maintenance" procedures, defined according to which AMs are installed and what relation options are set. So we have a completely generalised data/metadata storage infrastructure. Sensibly arranged this could provide an entry point for powerful new features within existing and future index AMs. It also sounds like it might avoid a whole class of bugs and special cases that I regrettably foresee would be unavoidable in Heikki's proposal. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> My original thought was to have a separate RelFileNode for each of the >> maps. That would require no smgr or xlog changes, and not very many >> changes in the buffer manager, though I guess you'd more catalog >> changes. You had doubts about that on the previous thread >> (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but >> the "map forks" idea certainly seems much more invasive than that. > > The main problems with that are (a) the need to expose every type of map > in pg_class and (b) the need to pass all those relfilenode numbers down > to pretty low levels of the system. The nice thing about the fork idea > is that you don't need any added info to uniquely identify what relation > you're working on. The fork numbers would be hard-wired into whatever > code needed to know about particular forks. (Of course, these same > advantages apply to using special space in an existing file. I'm > just suggesting that we can keep these advantages without buying into > the restrictions that special space would have.) One advantage of using separate relfilenodes would be that if we need to regenerate a map we could do it in a new relfilenode and swap it in like we do with heap rewrites. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
Gregory Stark wrote: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > > >> "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> >>> My original thought was to have a separate RelFileNode for each of the >>> maps. That would require no smgr or xlog changes, and not very many >>> changes in the buffer manager, though I guess you'd more catalog >>> changes. You had doubts about that on the previous thread >>> (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but >>> the "map forks" idea certainly seems much more invasive than that. >>> >> The main problems with that are (a) the need to expose every type of map >> in pg_class and (b) the need to pass all those relfilenode numbers down >> to pretty low levels of the system. The nice thing about the fork idea >> is that you don't need any added info to uniquely identify what relation >> you're working on. The fork numbers would be hard-wired into whatever >> code needed to know about particular forks. (Of course, these same >> advantages apply to using special space in an existing file. I'm >> just suggesting that we can keep these advantages without buying into >> the restrictions that special space would have.) >> > > One advantage of using separate relfilenodes would be that if we need to > regenerate a map we could do it in a new relfilenode and swap it in like we do > with heap rewrites. > > Why can't you just do that with a different extension and file rename? You'd need an exclusive lock while swapping in the new map, but you need that anyway, IIRC, and this way you don't even need a catalog change. cheers andrew
Gregory Stark <stark@enterprisedb.com> writes: > One advantage of using separate relfilenodes would be that if we need to > regenerate a map we could do it in a new relfilenode and swap it in like we do > with heap rewrites. You could probably do that using a temporary fork number, if the situation ever came up. regards, tom lane
Simon Riggs <simon@2ndquadrant.com> writes: >> Tom Lane wrote: >>> The idea that's becoming attractive to me while contemplating the >>> multiple-maps problem is that we should adopt something similar to >>> the old Mac OS idea of multiple "forks" in a relation. > Can we call them "maps" or "metadata maps"? "forks" sounds weird. I'm not wedded to "forks", that's just the name that was used in the only previous example I've seen. Classic Mac had a "resource fork" and a "data fork" within each file. Don't think I like "maps" though, as (a) that prejudges what the alternate forks might be used for, and (b) the name fails to be inclusive of the data fork. Other suggestions anyone? BTW, thinking about the Mac precedent a little more, I believe the way they grafted that Classic concept onto BSD was that applications (which the user thinks of as single objects) are now directories with multiple files inside them. Probably it'd be overkill to think of turning each relation into a subdirectory, but then again maybe not? regards, tom lane
On Mon, Mar 17, 2008 at 01:23:46PM -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > >> Tom Lane wrote: > >>> The idea that's becoming attractive to me while contemplating > >>> the multiple-maps problem is that we should adopt something > >>> similar to the old Mac OS idea of multiple "forks" in a > >>> relation. > > > Can we call them "maps" or "metadata maps"? "forks" sounds weird. > > I'm not wedded to "forks", that's just the name that was used in the > only previous example I've seen. Classic Mac had a "resource fork" > and a "data fork" within each file. > > Don't think I like "maps" though, as (a) that prejudges what the > alternate forks might be used for, and (b) the name fails to be > inclusive of the data fork. Other suggestions anyone? Segment? Section? Module? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Mon, 2008-03-17 at 13:23 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > >> Tom Lane wrote: > >>> The idea that's becoming attractive to me while contemplating the > >>> multiple-maps problem is that we should adopt something similar to > >>> the old Mac OS idea of multiple "forks" in a relation. > > > Can we call them "maps" or "metadata maps"? "forks" sounds weird. > > I'm not wedded to "forks", that's just the name that was used in the > only previous example I've seen. Classic Mac had a "resource fork" > and a "data fork" within each file. Layer? Slab? Sheet? Strata/um? Overlay? Layer makes sense to me because of the way GIS and CAD systems work. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk
On Mon, Mar 17, 2008 at 6:23 PM, Tom Lane wrote: > Simon Riggs writes: >> >> Can we call them "maps" or "metadata maps"? "forks" sounds weird. > > I'm not wedded to "forks", that's just the name that was used in the > only previous example I've seen. Classic Mac had a "resource fork" > and a "data fork" within each file. Microsoft / NTFS calls them Data Streams: http://www.wikistc.org/wiki/Alternate_data_streams Jochem
On Mon, 2008-03-17 at 13:23 -0400, Tom Lane wrote: > I'm not wedded to "forks", that's just the name that was used in the > only previous example I've seen. Classic Mac had a "resource fork" > and a "data fork" within each file. > > Don't think I like "maps" though, as (a) that prejudges what the > alternate forks might be used for, and (b) the name fails to be > inclusive of the data fork. Other suggestions anyone? I believe that in the world of NTFS the concept is called "streams": http://support.microsoft.com/kb/105763. HTH, Mark. -- Mark Cave-Ayland Sirius Corporation - The Open Source Experts http://www.siriusit.co.uk T: +44 870 608 0063
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > I'm not wedded to "forks", that's just the name that was used in the > only previous example I've seen. Classic Mac had a "resource fork" > and a "data fork" within each file. fwiw forks are not unique to MacOS, c.f.: http://en.wikipedia.org/wiki/Fork_%28filesystem%29 However I'm not sure reusing any of these terms is such a hot idea. All it's going to do is confuse someone into thinking we're actually talking about HFS forks or NTFS data streams or whatever. Better to pick a term that isn't already being used for such things so people don't get misled. > BTW, thinking about the Mac precedent a little more, I believe > the way they grafted that Classic concept onto BSD was that > applications (which the user thinks of as single objects) are > now directories with multiple files inside them. Probably it'd be > overkill to think of turning each relation into a subdirectory, but > then again maybe not? Well there are upsides and downsides. Many OSes have difficulties when you have many files in a single directory. This would tend to reduce that. On the other hand it would drastically increase the number of directory files the OS has to keep track of and the total number of inodes being referenced. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
Tom Lane wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> Tom Lane wrote: >>> You're cavalierly waving away a whole boatload of problems that will >>> arise as soon as you start trying to make the index AMs play along >>> with this :-(. > >> It doesn't seem very hard. > > The problem is that the index AMs are no longer in control of what goes > where within their indexes, which has always been their prerogative to > determine. The fact that you think you can kluge btree to still work > doesn't mean that it will work for other AMs. Well, it does work with all the existing AMs AFAICS. I do agree with the general point; it'd certainly be cleaner, more modular and more flexible if the AMs didn't need to know about the existence of the maps. >>> The idea that's becoming attractive to me while contemplating the >>> multiple-maps problem is that we should adopt something similar to >>> the old Mac OS idea of multiple "forks" in a relation. > >> Hmm. You also need to teach at least xlog.c and xlogutils.c about the >> map forks, for full page images and the invalid page tracking. > > Well, you'd have to teach them something anyway, for any incarnation > of maps that they might need to update. Umm, the WAL code doesn't care where the pages it operates on came from. Sure, we'll need rmgr-specific code that know what to do with the maps, but the full page image code would work without changes with the multiple RelFileNode approach. The essential change with the map fork idea is that a RelFileNode no longer uniquely identifies a file on disk (ignoring the segmentation which is handled in smgr for now). Anything that operates on RelFileNodes, without any higher level information of what it is, needs to be modified to use RelFileNode+forkid instead. That includes at least the buffer manager, smgr, and the full page image code in xlog.c. It's probably a pretty mechanical change, even though it affects a lot of code. We'd probably want to have a new struct, let's call it PhysFileId for now, for RelFileNode+forkid, and basically replace all occurrences of RelFileNode with PhysFileId in smgr, bufmgr and xlog code. >> I also wonder what the performance impact of extending BufferTag is. > > That's a fair objection, and obviously something we'd need to check. > But I don't recall seeing hash_any so high on any profile that I think > it'd be a big problem. I do remember seeing hash_any in some oprofile runs. But that's fairly easy to test: we don't need to actually implement any of the stuff, other than add a field to BufferTag, and run pgbench. >> My original thought was to have a separate RelFileNode for each of the >> maps. That would require no smgr or xlog changes, and not very many >> changes in the buffer manager, though I guess you'd more catalog >> changes. You had doubts about that on the previous thread >> (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but >> the "map forks" idea certainly seems much more invasive than that. > > The main problems with that are (a) the need to expose every type of map > in pg_class and (b) the need to pass all those relfilenode numbers down > to pretty low levels of the system. (a) is certainly a valid point. Regarding (b), I don't think the low level stuff (I assume you mean smgr, bufmgr, bgwriter, xlog by that) would need to be passed any additional relfilenode numbers. Or rather, they already work with relfilenodes, and they don't need to know whether the relfilenode is for an index, a heap, or an FSM attached to something else. The relfilenodes would be in RelationData, and we already have that around whenever we do anything that needs to differentiate between those. Another consideration is which approach is easiest to debug. The "map fork" approach seems better on that front, as you can immediately see from the PhysFileId if a page is coming from an auxiliary map or the main data portion. That might turn out to be handy in the buffer manager or bgwriter as well; they don't currently have any knowledge of what a page contains. > The nice thing about the fork idea > is that you don't need any added info to uniquely identify what relation > you're working on. The fork numbers would be hard-wired into whatever > code needed to know about particular forks. (Of course, these same > advantages apply to using special space in an existing file. I'm > just suggesting that we can keep these advantages without buying into > the restrictions that special space would have.) I don't see that advantage. All the higher-level code that care which relation you're working on already have Relation around. All the lower-level stuff don't care. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, Mar 17, 2008 at 6:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm not wedded to "forks", that's just the name that was used in the > only previous example I've seen. Classic Mac had a "resource fork" > and a "data fork" within each file. > > Don't think I like "maps" though, as (a) that prejudges what the > alternate forks might be used for, and (b) the name fails to be > inclusive of the data fork. Other suggestions anyone? Shadow? As each err, fork trails each relfilenode? (Or perhaps "shade"). Hints? As something more generic than "map"? Regards, Dawid
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, Mar 17, 2008 at 01:23:46PM -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > >> Tom Lane wrote: > >>> The idea that's becoming attractive to me while contemplating the > >>> multiple-maps problem is that we should adopt something similar to > >>> the old Mac OS idea of multiple "forks" in a relation. > > > Can we call them "maps" or "metadata maps"? "forks" sounds weird. Actually, I do like "forks", but to add a little bit diversity: facets? aspects? FWIW, the idea of mapping a relation to a directory quite compelling. Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFH33c7Bcgs9XrR2kYRAuBQAJ9MjISqgn37umRIydxtUBYONORwDgCbBKkE y7adUy7s/30TxQPQiJZZejA= =PAQ9 -----END PGP SIGNATURE-----
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > More precisely, on CVS HEAD it takes between 7.1-7.2%. After extending > BufferTag with one uint32, it takes 7.4-7.5%. So the effect is > measurable if you try hard enough, but not anything to get worried about. And if we adopt the allegedly-faster hash_any that's in the patch queue, the difference should get smaller yet. regards, tom lane
Heikki Linnakangas wrote: > Tom Lane wrote: >> "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >>> I also wonder what the performance impact of extending BufferTag is. >> >> That's a fair objection, and obviously something we'd need to check. >> But I don't recall seeing hash_any so high on any profile that I think >> it'd be a big problem. > > I do remember seeing hash_any in some oprofile runs. But that's fairly > easy to test: we don't need to actually implement any of the stuff, > other than add a field to BufferTag, and run pgbench. I tried that. hash_any wasn't significant on pgbench, but I was able to construct a test case where it is. It goes like this: BEGIN; CREATE TEMPORARY TABLE foo (id int4); INSERT INTO foo SELECT a FROM generate_series(1, 10000) a; INSERT INTO foo SELECT * FROM foo; INSERT INTO foo SELECT * FROM foo; INSERT INTO foo SELECT * FROM foo; ... (repeat multiple times) oprofile says that hash_any consumes ~7 % of CPU time on that test on my laptop. More precisely, on CVS HEAD it takes between 7.1-7.2%. After extending BufferTag with one uint32, it takes 7.4-7.5%. So the effect is measurable if you try hard enough, but not anything to get worried about. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com