heap metapages - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | heap metapages |
Date | |
Msg-id | CA+TgmoZ9zn-eaa9Vt7zTd13G1JujcQtaXwG9xNAsm9EyVEs8UQ@mail.gmail.com Whole thread Raw |
Responses |
Re: heap metapages
Re: heap metapages |
List | pgsql-hackers |
At dinner on Friday night at PGCon, the end of the table that included Tom Lane, Stephen Frost, and myself got to talking about the idea of including some kind of metapage in every relation, including heap relations. At least some index relations already have something like this (cf _bt_initmetapage, _hash_metapinit). I believe that adding this for all relations, including heaps, would allow us to make improvements in several areas. 1. Tom was interested in the idea of trying to make the system catalog entries which describe the system catalogs themselves completely immutable, so that they can potentially be shared between databases. For example, we might have shared catalogs pg_class_shared and pg_attribute_shared, describing the structure of all the system catalogs; and then we might also have pg_class and pg_attribute within each database, describing the structure of tables which exist only within that database. Right now, this is not possible, because values like relpages, reltuples, and relfrozenxid can vary from database to database. However, if those values were stored in a metapage associated with the heap relation rather than in the system catalogs, then potentially we could make this work. The most obvious benefit of this is that it would reduce the on-disk footprint of a new database, but there are other possible benefits as well. For example, a process not bound to a database could read a shared catalog even if it weren't nailed, and if we ever implement a prefork system for backends, they'd be able to do more of their initialization steps before learning which database they were to target. 2. I'm interested in having a cleaner way to associate non-transactional state with a relation. This has come up a few times. We currently handle this by having lazy VACUUM do in-place heap updates to replace values like relpages, reltuples, and relfrozenxid, but this feels like a kludge. It's particularly scary to think about relying on this for anything critical given that non-inplace heap updates can be happening simultaneously, and the consequences of losing an update to relfrozenxid in particular are disastrous. Plus, it requires hackery in pg_upgrade to preserve the value between the old and new clusters; we've already had to fix two data-destroying bugs in that logic. There are several other things that we might want to do that have similar requirements. For example, Pavan's idea of folding VACUUM's second heap pass into the next vacuum cycle requires a relation-wide piece of state which can probably be represented as a single bit, but putting that bit in pg_class would require the same sorts of hacks there that we already have for relfrozenxid, with similar consequences if it's not properly preserved. Making unlogged tables logged or the other way around appears to require some piece of relation-level state *that can be accessed during recovery*, and pg_class is not going to work for that.Page checksums have a similar requirement if the granularityfor turning them on and off is anything less than the entire cluster. Whenever we decide to roll out a new page version, we'll want a place to record the oldest page version that might be present in a particular relation, so that we can easily check whether a cluster can be upgraded to a new release that has dropped support for an old page version. Having a common framework for all of these things seems like it will probably be easier than solving each problem individually, and a metapage is a good place to store non-transactional state. 3. Right now, a new table uses up a minimum of 3 inodes, even if it has no indexes: one for the main fork, one for the visibility map, and one for the free space map. For people who have lots and lots of little tiny tables, this is quite inefficient. The amount of information we'd have to store in a heap metapage would presumably not be very big, so we could potentially move the first, say, 1K of the visibility map into the heap metapage, meaning that tables less than 64MB would no longer require a separate visibility map fork. Something similar could possibly be done with the free-space map, though I am unsure of the details. Right now, a relation containing just one tuple consumes 5 8k blocks on disk (1 for the main fork, 3 for the FSM, and 1 for the VM) and 3 inodes; getting that down to 8kB and 1 inode would be very nice. The case of a completely-empty relation is a bit annoying; that right now takes 1 inode and 0 blocks and I suspect we'd end up with 1 inode and 1 block, but I think it might still be a win overall. 4. Every once in a while, somebody's database ends up in pieces in lost+found. We could make this a bit easier to recover from by including the database OID, relfilenode, and table OID in the metapage. This wouldn't be perfect, since a relation over one GB would still only have one metapage, so additional relation segments would still be a problem. But it would be still be a huge improvement over the status quo: some very large percentage of the work of putting everything back where it goes could probably be done by a Perl script that read all the metapages, and if you needed to know, say, which file contained pg_class, that would be a whole lot easier, too. Now, there are a couple of obvious problems here, the biggest of which is probably that we want to avoid breaking pg_upgrade. I don't have a great solution to that problem. The most realistic option I can think of at the moment is to fudge things so that existing features can continue to work even if the metapage isn't present. Any table rewrite would add a metapage; if you want to use a new feature that requires a metapage, you have to rewrite the table first to get one. However, that's pretty unfortunate in terms of goal #1 and some parts of goal #2, because if you can't be certain of having the metapage present then you can't really store data there in lieu of pg_class; the best you'll be able to do is have it both places, at least until you're ready to deprecate upgrades from releases that don't contain metapage support. Hopefully someone has a better idea... Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: