Thread: heap metapages
At dinner on Friday night at PGCon, the end of the table that included Tom Lane, Stephen Frost, and myself got to talking about the idea of including some kind of metapage in every relation, including heap relations. At least some index relations already have something like this (cf _bt_initmetapage, _hash_metapinit). I believe that adding this for all relations, including heaps, would allow us to make improvements in several areas. 1. Tom was interested in the idea of trying to make the system catalog entries which describe the system catalogs themselves completely immutable, so that they can potentially be shared between databases. For example, we might have shared catalogs pg_class_shared and pg_attribute_shared, describing the structure of all the system catalogs; and then we might also have pg_class and pg_attribute within each database, describing the structure of tables which exist only within that database. Right now, this is not possible, because values like relpages, reltuples, and relfrozenxid can vary from database to database. However, if those values were stored in a metapage associated with the heap relation rather than in the system catalogs, then potentially we could make this work. The most obvious benefit of this is that it would reduce the on-disk footprint of a new database, but there are other possible benefits as well. For example, a process not bound to a database could read a shared catalog even if it weren't nailed, and if we ever implement a prefork system for backends, they'd be able to do more of their initialization steps before learning which database they were to target. 2. I'm interested in having a cleaner way to associate non-transactional state with a relation. This has come up a few times. We currently handle this by having lazy VACUUM do in-place heap updates to replace values like relpages, reltuples, and relfrozenxid, but this feels like a kludge. It's particularly scary to think about relying on this for anything critical given that non-inplace heap updates can be happening simultaneously, and the consequences of losing an update to relfrozenxid in particular are disastrous. Plus, it requires hackery in pg_upgrade to preserve the value between the old and new clusters; we've already had to fix two data-destroying bugs in that logic. There are several other things that we might want to do that have similar requirements. For example, Pavan's idea of folding VACUUM's second heap pass into the next vacuum cycle requires a relation-wide piece of state which can probably be represented as a single bit, but putting that bit in pg_class would require the same sorts of hacks there that we already have for relfrozenxid, with similar consequences if it's not properly preserved. Making unlogged tables logged or the other way around appears to require some piece of relation-level state *that can be accessed during recovery*, and pg_class is not going to work for that.Page checksums have a similar requirement if the granularityfor turning them on and off is anything less than the entire cluster. Whenever we decide to roll out a new page version, we'll want a place to record the oldest page version that might be present in a particular relation, so that we can easily check whether a cluster can be upgraded to a new release that has dropped support for an old page version. Having a common framework for all of these things seems like it will probably be easier than solving each problem individually, and a metapage is a good place to store non-transactional state. 3. Right now, a new table uses up a minimum of 3 inodes, even if it has no indexes: one for the main fork, one for the visibility map, and one for the free space map. For people who have lots and lots of little tiny tables, this is quite inefficient. The amount of information we'd have to store in a heap metapage would presumably not be very big, so we could potentially move the first, say, 1K of the visibility map into the heap metapage, meaning that tables less than 64MB would no longer require a separate visibility map fork. Something similar could possibly be done with the free-space map, though I am unsure of the details. Right now, a relation containing just one tuple consumes 5 8k blocks on disk (1 for the main fork, 3 for the FSM, and 1 for the VM) and 3 inodes; getting that down to 8kB and 1 inode would be very nice. The case of a completely-empty relation is a bit annoying; that right now takes 1 inode and 0 blocks and I suspect we'd end up with 1 inode and 1 block, but I think it might still be a win overall. 4. Every once in a while, somebody's database ends up in pieces in lost+found. We could make this a bit easier to recover from by including the database OID, relfilenode, and table OID in the metapage. This wouldn't be perfect, since a relation over one GB would still only have one metapage, so additional relation segments would still be a problem. But it would be still be a huge improvement over the status quo: some very large percentage of the work of putting everything back where it goes could probably be done by a Perl script that read all the metapages, and if you needed to know, say, which file contained pg_class, that would be a whole lot easier, too. Now, there are a couple of obvious problems here, the biggest of which is probably that we want to avoid breaking pg_upgrade. I don't have a great solution to that problem. The most realistic option I can think of at the moment is to fudge things so that existing features can continue to work even if the metapage isn't present. Any table rewrite would add a metapage; if you want to use a new feature that requires a metapage, you have to rewrite the table first to get one. However, that's pretty unfortunate in terms of goal #1 and some parts of goal #2, because if you can't be certain of having the metapage present then you can't really store data there in lieu of pg_class; the best you'll be able to do is have it both places, at least until you're ready to deprecate upgrades from releases that don't contain metapage support. Hopefully someone has a better idea... Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, May 21, 2012 at 12:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: > At dinner on Friday night at PGCon, the end of the table that included > Tom Lane, Stephen Frost, and myself got to talking about the idea of > including some kind of metapage in every relation, including heap > relations. At least some index relations already have something like > this (cf _bt_initmetapage, _hash_metapinit). I believe that adding > this for all relations, including heaps, would allow us to make > improvements in several areas. The first thing that jumps to mind is: why can't the metapage be extended to span multiple pages if necessary? I've often wondered why the visibility map isn't stored within the heap itself... merlin
On Mon, May 21, 2012 at 2:22 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Mon, May 21, 2012 at 12:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> At dinner on Friday night at PGCon, the end of the table that included >> Tom Lane, Stephen Frost, and myself got to talking about the idea of >> including some kind of metapage in every relation, including heap >> relations. At least some index relations already have something like >> this (cf _bt_initmetapage, _hash_metapinit). I believe that adding >> this for all relations, including heaps, would allow us to make >> improvements in several areas. > > The first thing that jumps to mind is: why can't the metapage be > extended to span multiple pages if necessary? I've often wondered why > the visibility map isn't stored within the heap itself... Well, the idea of a metapage, almost by definition, is that it stores a small amount of information whose size is pretty much fixed and which can be reasonably anticipated to always fit in one page. If you're trying to store some data that can get bigger than that (or even, come close to filling that up), you need a different system. I'm anticipating that the amount of relation metadata we need to store will fit into a 512-byte sector with significant room left over, leaving us with the rest of the block for whatever we'd like to use it for (e.g. bits of the FSM or VM). If at some point in the future, we need some kind of relation-level metadata that can grow beyond a handful of bytes, we can either put it in its own fork, or store one or more block pointers in the metapage indicating the blocks where information is stored - but right now I'm not seeing the need for anything that fancy. Now, that having been said, I don't think there's any particular reason why we coudn't multiplex all the relation forks onto a single physical file if we were so inclined. The FSM and VM are small enough that interleaving them with the actual data probably wouldn't slow down seq scans materially. But on the other hand I am not sure that we'd gain much by it in general. I see the value of doing it for small relations: it saves inodes, potentially quite a lot of inodes if you're on a system that uses schemas to implement multi-tenancy. But it's not clear to me that it's worthwhile in general. Sticking all the FSM stuff in its own relation may allow the OS to lay out those pages physically closer to each other on disk, whereas interleaving them with the data blocks would probably give up that advantage, and it's not clear to me what we'd be getting in exchange. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
* Robert Haas (robertmhaas@gmail.com) wrote: > The FSM and VM are small enough > that interleaving them with the actual data probably wouldn't slow > down seq scans materially. Wouldn't that end up potentially causing lots of random i/o if you need to look at many parts of the FSM or VM..? Also, wouldn't having it at the start of the heap reduce the changes needed to the SM? Along with make such things easier to find themselves, when talking about forensics? Of course, the real challenge here is dealing with such an on-disk format change... If we were starting from scratch, I doubt there would be much resistance, but figuring out how to do this and still support pg_upgrade could be quite ugly. Thanks, Stephen
On 21 May 2012 13:56, Robert Haas <robertmhaas@gmail.com> wrote: > At dinner on Friday night at PGCon, the end of the table that included > Tom Lane, Stephen Frost, and myself got to talking about the idea of > including some kind of metapage in every relation, including heap > relations. At least some index relations already have something like > this (cf _bt_initmetapage, _hash_metapinit). I believe that adding > this for all relations, including heaps, would allow us to make > improvements in several areas. The only thing against these ideas is that you're putting the design before the requirements, which always makes me nervous. I very much like the idea of a common framework to support multiple requirements. If we can view a couple of other designs as well it may quickly become clear this is the right way. In any case, the topics discussed here are important ones, so thanks for covering them. What springs immediately to mind is why this would not be just another fork. > 1. Tom was interested in the idea of trying to make the system catalog > entries which describe the system catalogs themselves completely > immutable, so that they can potentially be shared between databases. > For example, we might have shared catalogs pg_class_shared and > pg_attribute_shared, describing the structure of all the system > catalogs; and then we might also have pg_class and pg_attribute within > each database, describing the structure of tables which exist only > within that database. Right now, this is not possible, because values > like relpages, reltuples, and relfrozenxid can vary from database to > database. However, if those values were stored in a metapage > associated with the heap relation rather than in the system catalogs, > then potentially we could make this work. The most obvious benefit of > this is that it would reduce the on-disk footprint of a new database, > but there are other possible benefits as well. For example, a process > not bound to a database could read a shared catalog even if it weren't > nailed, and if we ever implement a prefork system for backends, they'd > be able to do more of their initialization steps before learning which > database they were to target. This is important. I like the idea of breaking down the barriers between databases to allow it to be an option for one backend to access tables in multiple databases. The current mechanism doesn't actually prevent looking at data from other databases using internal APIs, so full security doesn't exist. It's a very common user requirement to wish to join tables stored in different databases, which ought to be possible more cleanly with correct privileges. > 2. I'm interested in having a cleaner way to associate > non-transactional state with a relation. This has come up a few > times. We currently handle this by having lazy VACUUM do in-place > heap updates to replace values like relpages, reltuples, and > relfrozenxid, but this feels like a kludge. It's particularly scary > to think about relying on this for anything critical given that > non-inplace heap updates can be happening simultaneously, and the > consequences of losing an update to relfrozenxid in particular are > disastrous. Plus, it requires hackery in pg_upgrade to preserve the > value between the old and new clusters; we've already had to fix two > data-destroying bugs in that logic. There are several other things > that we might want to do that have similar requirements. For example, > Pavan's idea of folding VACUUM's second heap pass into the next vacuum > cycle requires a relation-wide piece of state which can probably be > represented as a single bit, but putting that bit in pg_class would > require the same sorts of hacks there that we already have for > relfrozenxid, with similar consequences if it's not properly > preserved. Making unlogged tables logged or the other way around > appears to require some piece of relation-level state *that can be > accessed during recovery*, and pg_class is not going to work for that. > Page checksums have a similar requirement if the granularity for > turning them on and off is anything less than the entire cluster. > Whenever we decide to roll out a new page version, we'll want a place > to record the oldest page version that might be present in a > particular relation, so that we can easily check whether a cluster can > be upgraded to a new release that has dropped support for an old page > version. Having a common framework for all of these things seems like > it will probably be easier than solving each problem individually, and > a metapage is a good place to store non-transactional state. I thought there was a patch that put that info in a separate table 1:1 with pg_class. Not very sure why a metapage is better than a catalog table. We would still want a view that allows us to access that data as if it were a catalog table. > 3. Right now, a new table uses up a minimum of 3 inodes, even if it > has no indexes: one for the main fork, one for the visibility map, and > one for the free space map. For people who have lots and lots of > little tiny tables, this is quite inefficient. The amount of > information we'd have to store in a heap metapage would presumably not > be very big, so we could potentially move the first, say, 1K of the > visibility map into the heap metapage, meaning that tables less than > 64MB would no longer require a separate visibility map fork. > Something similar could possibly be done with the free-space map, > though I am unsure of the details. Right now, a relation containing > just one tuple consumes 5 8k blocks on disk (1 for the main fork, 3 > for the FSM, and 1 for the VM) and 3 inodes; getting that down to 8kB > and 1 inode would be very nice. The case of a completely-empty > relation is a bit annoying; that right now takes 1 inode and 0 blocks > and I suspect we'd end up with 1 inode and 1 block, but I think it > might still be a win overall. Again, there are other ways to optimise the FSM for small tables. > 4. Every once in a while, somebody's database ends up in pieces in > lost+found. We could make this a bit easier to recover from by > including the database OID, relfilenode, and table OID in the > metapage. This wouldn't be perfect, since a relation over one GB > would still only have one metapage, so additional relation segments > would still be a problem. But it would be still be a huge improvement > over the status quo: some very large percentage of the work of putting > everything back where it goes could probably be done by a Perl script > that read all the metapages, and if you needed to know, say, which > file contained pg_class, that would be a whole lot easier, too. That sounds like the requirement that is driving this idea. > Now, there are a couple of obvious problems here, the biggest of which > is probably that we want to avoid breaking pg_upgrade. I don't have a > great solution to that problem. The most realistic option I can think > of at the moment is to fudge things so that existing features can > continue to work even if the metapage isn't present. Any table > rewrite would add a metapage; if you want to use a new feature that > requires a metapage, you have to rewrite the table first to get one. > However, that's pretty unfortunate in terms of goal #1 and some parts > of goal #2, because if you can't be certain of having the metapage > present then you can't really store data there in lieu of pg_class; > the best you'll be able to do is have it both places, at least until > you're ready to deprecate upgrades from releases that don't contain > metapage support. Hopefully someone has a better idea... You don't have to rewrite the table, you just need to update the rows so they migrate to another block. That seems easy enough, but still not sure why you wouldn't just use another fork. Or another idea would be to have the first page have a non-zero pd_special. I know you were recording what was discussed as an initial starting point. Looks like a good set of problems to solve. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
* Simon Riggs (simon@2ndQuadrant.com) wrote: > The only thing against these ideas is that you're putting the design > before the requirements, which always makes me nervous. [...] > What springs immediately to mind is why this would not be just another fork. One of the requirements, though perhaps it wasn't made very clear, really is to reduce the on-disk footprint, both in terms of inodes and actual disk usage, if possible. > This is important. I like the idea of breaking down the barriers > between databases to allow it to be an option for one backend to > access tables in multiple databases. The current mechanism doesn't > actually prevent looking at data from other databases using internal > APIs, so full security doesn't exist. It's a very common user > requirement to wish to join tables stored in different databases, > which ought to be possible more cleanly with correct privileges. That's really a whole different ball of wax and I don't believe what Robert was proposing would actually allow that to happen due to the other database-level things which are needed to keep everything consistent... That's my understanding, anyway. I'd be happy as anyone if we could actually make it work, but isn't like the SysCache stuff per database? Also, cross-database queries would actually make it more difficult to have per-database roles, which is one thing that I was hoping we might be able to work into this, though perhaps we could have a shared roles table and a per-database roles table and only 'global' roles would be able to issue cross-database queries.. > Not very sure why a metapage is better than a catalog table. We would > still want a view that allows us to access that data as if it were a > catalog table. Right, we were discussing that, and what would happen if someone did a 'select *' against it... Having to pass through all of the files on disk wouldn't be good, but if we could make it use a cache to return that information, perhaps it'd work. > Again, there are other ways to optimise the FSM for small tables. Sure, but is there one where we also reduce the number of inodes we allocate for tiny tables..? > That sounds like the requirement that is driving this idea. Regarding forensics, it's a nice bonus, but I think the real requirement is the reduction of inode and disk usage, both for the per-database catalog and for tiny tables. > You don't have to rewrite the table, you just need to update the rows > so they migrate to another block. Well, that depends on exactly how it gets implemented, but that's an interesting idea, certainly.. Thanks, Stephen
On Mon, May 21, 2012 at 3:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > I very much like the idea of a common framework to support multiple > requirements. If we can view a couple of other designs as well it may > quickly become clear this is the right way. In any case, the topics > discussed here are important ones, so thanks for covering them. I considered a couple of other possibilities: - We could split pg_class into pg_class and pg_class_nt (non-transactional). This would solve problem #1 (allowing pg_class/pg_attribute entries for system catalogs to be shared across all databases) but it doesn't do anything for problem #3 (excessive inode consumption) or problem #4 (watermarking for crash recovery) and isn't very good for problem #2 (maintenance of non-transactional state) either, since part of the hope here is that we'd be able to get at this state during recovery even when HS is not used. - In lieu of adding an entire meta-page, we could just add some special space to the first page, or maybe to every N'th page. Adding space to every N'th page would be the best solution to problem #4 (watermarking), and adding even a small amount of state to the first page would be enough for problems #1 and #2. However, I don't think it would work for problem #3 (reducing inode consumption) because even if the special space is pretty big, you won't really be able to mix tuples and visibility map information (for example) on the same page without complicating the buffer locking regimen unbearably. The dance we have to do to make the visibility map crash-safe is already a lot hairier than I'd really prefer. Also, I think we really need a lot of this info for both tables and indexes, and I think it will be simpler to decide that everything has a metapage rather than to decide that some things have a metapage and some things just have a little extra stuff crammed into the special space. - I considered the idea of designing a crash-safe persistent hash table, that would be sort of like a table but really more like a key-value store with keys and values being C structs. This would be similar to the pg_class/pg_class_nt split idea, except that pg_class_nt would be one of these new crash-safe persistent hash table objects, rather than a normal table; and there's a decent possibility we'd find other applications for such a beast. However, it wouldn't help with problem #3 or problem #4; and Tom seemed to be gravitating toward the design in my OP rather than this idea. One point that was raised is that btree and hash indexes already have a metapage, so sticking a little more data into it doesn't really cost anything; and heap relations are pretty much going to end up nailing the visibility map and free space map pages in cache, so it's not clear that this is any less cache-efficient in those cases either. For all that, I kind of like the idea of a persistent hash table object, which I suspect could be used to solve some problems not on the list in my OP as well as some of the ones that are there, but I don't feel too bad laying that idea aside for now. If it's really a good idea, it'll come up again. > What springs immediately to mind is why this would not be just another fork. This was pretty much the first thing I considered, but it makes problem #3 worse, and I really don't want do that. I think 3 inodes per table is already too many, and I expect the problem to get worse. I feel like every third crazy feature idea I come up with involves creating yet another relation fork, and I'm pretty sure I won't be the last person to think about such things, and so we're probably headed that way, but I think we'd better try to hold the line as much as is reasonably possible. One random idea would be to have pg_upgrade create a special one-block relation fork for the heap metapage that would get folded into the main fork the first time the table gets rewritten. So we'd add another fork, but only as a hack to facilitate in-place upgrade. > This is important. I like the idea of breaking down the barriers > between databases to allow it to be an option for one backend to > access tables in multiple databases. The current mechanism doesn't > actually prevent looking at data from other databases using internal > APIs, so full security doesn't exist. It's a very common user > requirement to wish to join tables stored in different databases, > which ought to be possible more cleanly with correct privileges. As Stephen says, this would require a lot more than just making pg_class_shared/pg_attribute_shared work, and I don't particularly believe it's a good idea anyway. That having been said, if we decided we wanted to go this way in some future release, having done this first couldn't but help. > I thought there was a patch that put that info in a separate table 1:1 > with pg_class. > > Not very sure why a metapage is better than a catalog table. Mostly because there's no chance of the startup process accessing a catalog table during recovery, but it can read a metapage. > We would > still want a view that allows us to access that data as if it were a > catalog table. Agreed. Tom said the same. > Again, there are other ways to optimise the FSM for small tables. True, but that doesn't make this a bad one. >> 4. Every once in a while, somebody's database ends up in pieces in >> lost+found. We could make this a bit easier to recover from by >> including the database OID, relfilenode, and table OID in the >> metapage. This wouldn't be perfect, since a relation over one GB >> would still only have one metapage, so additional relation segments >> would still be a problem. But it would be still be a huge improvement >> over the status quo: some very large percentage of the work of putting >> everything back where it goes could probably be done by a Perl script >> that read all the metapages, and if you needed to know, say, which >> file contained pg_class, that would be a whole lot easier, too. > > That sounds like the requirement that is driving this idea. No, I listed it fourth because I think it's the least interesting benefit. It IS a benefit, but if this were the primary goal it would be a LOT simpler to shove a few bytes into every N'th heap special space. I coded up a patch for that on my other laptop, and then reformatted the hard drive without saving the patch (brilliant!), so I no longer have working code for this. But it's not that hard. I am much more interested in benefit #2, the ability to maintain non-transactional state that can be read by the startup process during recovery, than I am in this goal. Unfortunately that's harder, but I think it's worth the effort. > You don't have to rewrite the table, you just need to update the rows > so they migrate to another block. True. > That seems easy enough, but still not sure why you wouldn't just use > another fork. Or another idea would be to have the first page have a > non-zero pd_special. See above for a discussion of these points. > I know you were recording what was discussed as an initial starting > point. Looks like a good set of problems to solve. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, May 21, 2012 at 3:15 PM, Stephen Frost <sfrost@snowman.net> wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: >> The FSM and VM are small enough >> that interleaving them with the actual data probably wouldn't slow >> down seq scans materially. > > Wouldn't that end up potentially causing lots of random i/o if you need > to look at many parts of the FSM or VM..? I doubt it. They probably stay in core anyway. > Also, wouldn't having it at the start of the heap reduce the changes > needed to the SM? Along with make such things easier to find > themselves, when talking about forensics? The metapage, surely yes. If we wanted to fold the FSM and VM into the main fork in their entirety, probably not. But I don't have much desire to do that. I think it's fine for a BIG relation to eat a couple of inodes. I just don't want a little one to do that. > Of course, the real challenge here is dealing with such an on-disk > format change... If we were starting from scratch, I doubt there would > be much resistance, but figuring out how to do this and still support > pg_upgrade could be quite ugly. That does seem to be the ten million dollar question, but already we've batted around a few solutions on this thread, so I suspect we'll find a way to make it work. I think my next step is going to be to spend some more time studying what the various index AMs already have in terms of metapages. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22 May 2012 02:50, Robert Haas <robertmhaas@gmail.com> wrote: >> Not very sure why a metapage is better than a catalog table. > > Mostly because there's no chance of the startup process accessing a > catalog table during recovery, but it can read a metapage. OK, sounds reasonable. Based upon all you've said, I'd suggest that we make a new kind of fork, in a separate file for this, .meta. But we also optimise the VM and FSM in the way you suggest so that we can replace .vm and .fsm with just .meta in most cases. Big tables would get a .vm and .fsm appearing when they get big enough, but that won't challenge the inode limits. When .vm and .fsm do appear, we remove that info from the metapage - that means we keep all code as it is currently, accept for an optimisation of .vm and .fsm when those are small enough to do so. We can watermark data files using special space on block zero using some code to sneak that in when the page is next written, but that is regarded as optional, rather than an essential aspect of an upgrade/normal operation. Having pg_upgrade touch data files is both dangerous and difficult to back out in case of mistake, so I am wary of putting the metapage at block 0. Doing it the way I suggest means the .meta files would be wholly new and can be deleted as a back-out. We can also clean away any unnecessary .vm/.fsm files as a later step. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, May 22, 2012 at 4:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Based upon all you've said, I'd suggest that we make a new kind of > fork, in a separate file for this, .meta. But we also optimise the VM > and FSM in the way you suggest so that we can replace .vm and .fsm > with just .meta in most cases. Big tables would get a .vm and .fsm > appearing when they get big enough, but that won't challenge the inode > limits. When .vm and .fsm do appear, we remove that info from the > metapage - that means we keep all code as it is currently, accept for > an optimisation of .vm and .fsm when those are small enough to do so. Well, let's see. That would mean that a small heap relation has 2 forks instead of 3, and a large relation has 4 forks instead of 3. In my proposal, a small relation has 1 fork instead of 3, and a large relation still has 3 forks. So I like mine better. Also, I think that we need a good chunk of the metadata here for both tables and indexes. For example, if we use the metapage to store information about whether a relation is logged, unlogged, being converted from logged to unlogged, or being converted from logged to unlogged, we need that information both for tables and for indexes. Now, there's no absolute reason why those cases have to be handled symmetrically, but I think things will be a lot simpler if they are. If we settle on the rule that block 0 of every relation contains a certain chunk of metadata at a certain byte offset, then the code to retrieve that data when needed is pretty darn simple. If tables put it in a separate fork and indexes put it in the main fork inside the metablock somewhere, then things are not so simple. And I sure don't want to add a separate fork for every index just to hold the metadata: that would be a huge hit in terms of total inode consumption. > We can watermark data files using special space on block zero using > some code to sneak that in when the page is next written, but that is > regarded as optional, rather than an essential aspect of an > upgrade/normal operation. > > Having pg_upgrade touch data files is both dangerous and difficult to > back out in case of mistake, so I am wary of putting the metapage at > block 0. Doing it the way I suggest means the .meta files would be > wholly new and can be deleted as a back-out. We can also clean away > any unnecessary .vm/.fsm files as a later step. It seems pretty clear to me that making pg_upgrade responsible for emptying block zero is a non-starter. But I don't think that's a reason to throw out the design; I think it's a problem we can work around. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22 May 2012 13:52, Robert Haas <robertmhaas@gmail.com> wrote: > It seems pretty clear to me that making pg_upgrade responsible for > emptying block zero is a non-starter. But I don't think that's a > reason to throw out the design; I think it's a problem we can work > around. I like your design better as well *if* you can explain how we can get to it. My proposal was a practical alternative that would allow the idea to proceed. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, May 22, 2012 at 09:52:30AM +0100, Simon Riggs wrote: > Having pg_upgrade touch data files is both dangerous and difficult to > back out in case of mistake, so I am wary of putting the metapage at > block 0. Doing it the way I suggest means the .meta files would be > wholly new and can be deleted as a back-out. We can also clean away > any unnecessary .vm/.fsm files as a later step. Pg_upgrade never modifies the old cluster, except to lock it in link mode, so there is never anything to back out. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 24 May 2012 23:02, Bruce Momjian <bruce@momjian.us> wrote: > On Tue, May 22, 2012 at 09:52:30AM +0100, Simon Riggs wrote: >> Having pg_upgrade touch data files is both dangerous and difficult to >> back out in case of mistake, so I am wary of putting the metapage at >> block 0. Doing it the way I suggest means the .meta files would be >> wholly new and can be deleted as a back-out. We can also clean away >> any unnecessary .vm/.fsm files as a later step. > > Pg_upgrade never modifies the old cluster, except to lock it in link > mode, so there is never anything to back out. Agreed. Robert's proposal was to make pg_upgrade modify the cluster, which I was observing wasn't a good plan. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 5/22/12 12:09 PM, Simon Riggs wrote: > On 22 May 2012 13:52, Robert Haas<robertmhaas@gmail.com> wrote: > >> It seems pretty clear to me that making pg_upgrade responsible for >> emptying block zero is a non-starter. But I don't think that's a >> reason to throw out the design; I think it's a problem we can work >> around. > > I like your design better as well *if* you can explain how we can get > to it. My proposal was a practical alternative that would allow the > idea to proceed. It occurred to me that having a metapage with information useful to recovery operations in *every segment* would be useful;it certainly seems worth the extra block. It then occurred to me that we've basically been stuck with 2 places tostore relation data; either at the relation level in pg_class or on each page. Sometimes neither one is a good fit. ISTM that a lot of problems we've faced in the past few years are because there's not a good abstraction between a (mostly)linear tuplespace and the physical storage that goes underneath it. - pg_upgrade progress is blocked because we can't deal with a new page that's > BLKSZ - There's no good way to deal with table (or worse, index) bloat - There's no good way to add the concept of a heap metapage - Forks are being used to store data that might not belong there only because there's no other choice (visibility info) Would it make sense to take a step back and think about ways to abstract between logical tuplespace and physical storage?What if 1GB segments had their own metadata? Or groups of segments? Could certain operations that currently haveto rewrite an entire table be changed so that they slowly moved pages from one group of segments to another, with a meansof marking old pages as having been moved? Einstein said that "problems cannot be solved by the same level of thinking that created them." Perhaps we're at the pointwhere we need to take a step back from our current storage organization and look for a bigger picture? -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Fri, May 25, 2012 at 5:57 PM, Jim Nasby <jim@nasby.net> wrote: > It occurred to me that having a metapage with information useful to recovery > operations in *every segment* would be useful; it certainly seems worth the > extra block. It then occurred to me that we've basically been stuck with 2 > places to store relation data; either at the relation level in pg_class or > on each page. Sometimes neither one is a good fit. AFAICS, having metadata in every segment is most only helpful for recovering from the situation where files have become disassociated from their filenames, i.e. database -> lost+found. From the view point of virtually the entire server, the block number space is just a continuous sequence that starts at 0 and counts up forever (or, anyway, until 2^32-1). While it wouldn't be impossible to allow that knowledge to percolate up to other parts of the server, it would basically involve drilling a fairly arbitrary hole through an abstraction boundary that has been intact for a very long time, and it's not clear that there's anything magical about 1GB. Nonwithstanding the foregoing... > ISTM that a lot of problems we've faced in the past few years are because > there's not a good abstraction between a (mostly) linear tuplespace and the > physical storage that goes underneath it. ...I agree with this. I'm not sure exactly what the replacement model would look like, but it's definitely worth some thought - e.g. perhaps there ought to be another mapping layer between logical block numbers and files on disk, so that we can effectively delete blocks out of the middle of a relation without requiring any special OS support, and so that we can multiplex many small relation forks onto a single physical file to minimize inode consumption. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company