Thread: heap metapages

heap metapages

From

Robert Haas

Date:

21 May 2012, 14:57:12

At dinner on Friday night at PGCon, the end of the table that included
Tom Lane, Stephen Frost, and myself got to talking about the idea of
including some kind of metapage in every relation, including heap
relations. At least some index relations already have something like
this (cf _bt_initmetapage, _hash_metapinit). I believe that adding
this for all relations, including heaps, would allow us to make
improvements in several areas.

1. Tom was interested in the idea of trying to make the system catalog
entries which describe the system catalogs themselves completely
immutable, so that they can potentially be shared between databases.
For example, we might have shared catalogs pg_class_shared and
pg_attribute_shared, describing the structure of all the system
catalogs; and then we might also have pg_class and pg_attribute within
each database, describing the structure of tables which exist only
within that database. Right now, this is not possible, because values
like relpages, reltuples, and relfrozenxid can vary from database to
database. However, if those values were stored in a metapage
associated with the heap relation rather than in the system catalogs,
then potentially we could make this work. The most obvious benefit of
this is that it would reduce the on-disk footprint of a new database,
but there are other possible benefits as well. For example, a process
not bound to a database could read a shared catalog even if it weren't
nailed, and if we ever implement a prefork system for backends, they'd
be able to do more of their initialization steps before learning which
database they were to target.

2. I'm interested in having a cleaner way to associate
non-transactional state with a relation. This has come up a few
times. We currently handle this by having lazy VACUUM do in-place
heap updates to replace values like relpages, reltuples, and
relfrozenxid, but this feels like a kludge. It's particularly scary
to think about relying on this for anything critical given that
non-inplace heap updates can be happening simultaneously, and the
consequences of losing an update to relfrozenxid in particular are
disastrous. Plus, it requires hackery in pg_upgrade to preserve the
value between the old and new clusters; we've already had to fix two
data-destroying bugs in that logic. There are several other things
that we might want to do that have similar requirements. For example,
Pavan's idea of folding VACUUM's second heap pass into the next vacuum
cycle requires a relation-wide piece of state which can probably be
represented as a single bit, but putting that bit in pg_class would
require the same sorts of hacks there that we already have for
relfrozenxid, with similar consequences if it's not properly
preserved. Making unlogged tables logged or the other way around
appears to require some piece of relation-level state *that can be
accessed during recovery*, and pg_class is not going to work for that.Page checksums have a similar requirement if the
granularityfor

turning them on and off is anything less than the entire cluster.
Whenever we decide to roll out a new page version, we'll want a place
to record the oldest page version that might be present in a
particular relation, so that we can easily check whether a cluster can
be upgraded to a new release that has dropped support for an old page
version. Having a common framework for all of these things seems like
it will probably be easier than solving each problem individually, and
a metapage is a good place to store non-transactional state.

3. Right now, a new table uses up a minimum of 3 inodes, even if it
has no indexes: one for the main fork, one for the visibility map, and
one for the free space map. For people who have lots and lots of
little tiny tables, this is quite inefficient. The amount of
information we'd have to store in a heap metapage would presumably not
be very big, so we could potentially move the first, say, 1K of the
visibility map into the heap metapage, meaning that tables less than
64MB would no longer require a separate visibility map fork.
Something similar could possibly be done with the free-space map,
though I am unsure of the details. Right now, a relation containing
just one tuple consumes 5 8k blocks on disk (1 for the main fork, 3
for the FSM, and 1 for the VM) and 3 inodes; getting that down to 8kB
and 1 inode would be very nice. The case of a completely-empty
relation is a bit annoying; that right now takes 1 inode and 0 blocks
and I suspect we'd end up with 1 inode and 1 block, but I think it
might still be a win overall.

4. Every once in a while, somebody's database ends up in pieces in
lost+found. We could make this a bit easier to recover from by
including the database OID, relfilenode, and table OID in the
metapage. This wouldn't be perfect, since a relation over one GB
would still only have one metapage, so additional relation segments
would still be a problem. But it would be still be a huge improvement
over the status quo: some very large percentage of the work of putting
everything back where it goes could probably be done by a Perl script
that read all the metapages, and if you needed to know, say, which
file contained pg_class, that would be a whole lot easier, too.

Now, there are a couple of obvious problems here, the biggest of which
is probably that we want to avoid breaking pg_upgrade. I don't have a
great solution to that problem. The most realistic option I can think
of at the moment is to fudge things so that existing features can
continue to work even if the metapage isn't present. Any table
rewrite would add a metapage; if you want to use a new feature that
requires a metapage, you have to rewrite the table first to get one.
However, that's pretty unfortunate in terms of goal #1 and some parts
of goal #2, because if you can't be certain of having the metapage
present then you can't really store data there in lieu of pg_class;
the best you'll be able to do is have it both places, at least until
you're ready to deprecate upgrades from releases that don't contain
metapage support. Hopefully someone has a better idea...

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: heap metapages

From

Merlin Moncure

Date:

21 May 2012, 15:22:30

On Mon, May 21, 2012 at 12:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> At dinner on Friday night at PGCon, the end of the table that included
> Tom Lane, Stephen Frost, and myself got to talking about the idea of
> including some kind of metapage in every relation, including heap
> relations.  At least some index relations already have something like
> this (cf _bt_initmetapage, _hash_metapinit).  I believe that adding
> this for all relations, including heaps, would allow us to make
> improvements in several areas.

The first thing that jumps to mind is: why can't the metapage be
extended to span multiple pages if necessary?  I've often wondered why
the visibility map isn't stored within the heap itself...

merlin

Re: heap metapages

From

Robert Haas

Date:

21 May 2012, 15:37:21

On Mon, May 21, 2012 at 2:22 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Mon, May 21, 2012 at 12:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> At dinner on Friday night at PGCon, the end of the table that included
>> Tom Lane, Stephen Frost, and myself got to talking about the idea of
>> including some kind of metapage in every relation, including heap
>> relations.  At least some index relations already have something like
>> this (cf _bt_initmetapage, _hash_metapinit).  I believe that adding
>> this for all relations, including heaps, would allow us to make
>> improvements in several areas.
>
> The first thing that jumps to mind is: why can't the metapage be
> extended to span multiple pages if necessary?  I've often wondered why
> the visibility map isn't stored within the heap itself...

Well, the idea of a metapage, almost by definition, is that it stores
a small amount of information whose size is pretty much fixed and
which can be reasonably anticipated to always fit in one page.  If
you're trying to store some data that can get bigger than that (or
even, come close to filling that up), you need a different system.
I'm anticipating that the amount of relation metadata we need to store
will fit into a 512-byte sector with significant room left over,
leaving us with the rest of the block for whatever we'd like to use it
for (e.g. bits of the FSM or VM).   If at some point in the future, we
need some kind of relation-level metadata that can grow beyond a
handful of bytes, we can either put it in its own fork, or store one
or more block pointers in the metapage indicating the blocks where
information is stored - but right now I'm not seeing the need for
anything that fancy.

Now, that having been said, I don't think there's any particular
reason why we coudn't multiplex all the relation forks onto a single
physical file if we were so inclined.  The FSM and VM are small enough
that interleaving them with the actual data probably wouldn't slow
down seq scans materially.  But on the other hand I am not sure that
we'd gain much by it in general.  I see the value of doing it for
small relations: it saves inodes, potentially quite a lot of inodes if
you're on a system that uses schemas to implement multi-tenancy.  But
it's not clear to me that it's worthwhile in general.  Sticking all
the FSM stuff in its own relation may allow the OS to lay out those
pages physically closer to each other on disk, whereas interleaving
them with the data blocks would probably give up that advantage, and
it's not clear to me what we'd be getting in exchange.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: heap metapages

From

Stephen Frost

Date:

21 May 2012, 16:15:45

* Robert Haas (robertmhaas@gmail.com) wrote:
> The FSM and VM are small enough
> that interleaving them with the actual data probably wouldn't slow
> down seq scans materially.

Wouldn't that end up potentially causing lots of random i/o if you need
to look at many parts of the FSM or VM..?

Also, wouldn't having it at the start of the heap reduce the changes
needed to the SM?  Along with make such things easier to find
themselves, when talking about forensics?

Of course, the real challenge here is dealing with such an on-disk
format change...  If we were starting from scratch, I doubt there would
be much resistance, but figuring out how to do this and still support
pg_upgrade could be quite ugly.
Thanks,
    Stephen

Re: heap metapages

From

Simon Riggs

Date:

21 May 2012, 16:16:28

On 21 May 2012 13:56, Robert Haas <robertmhaas@gmail.com> wrote:

> At dinner on Friday night at PGCon, the end of the table that included
> Tom Lane, Stephen Frost, and myself got to talking about the idea of
> including some kind of metapage in every relation, including heap
> relations.  At least some index relations already have something like
> this (cf _bt_initmetapage, _hash_metapinit).  I believe that adding
> this for all relations, including heaps, would allow us to make
> improvements in several areas.

The only thing against these ideas is that you're putting the design
before the requirements, which always makes me nervous.

I very much like the idea of a common framework to support multiple
requirements. If we can view a couple of other designs as well it may
quickly become clear this is the right way. In any case, the topics
discussed here are important ones, so thanks for covering them.

What springs immediately to mind is why this would not be just another fork.

> 1. Tom was interested in the idea of trying to make the system catalog
> entries which describe the system catalogs themselves completely
> immutable, so that they can potentially be shared between databases.
> For example, we might have shared catalogs pg_class_shared and
> pg_attribute_shared, describing the structure of all the system
> catalogs; and then we might also have pg_class and pg_attribute within
> each database, describing the structure of tables which exist only
> within that database.  Right now, this is not possible, because values
> like relpages, reltuples, and relfrozenxid can vary from database to
> database.  However, if those values were stored in a metapage
> associated with the heap relation rather than in the system catalogs,
> then potentially we could make this work.  The most obvious benefit of
> this is that it would reduce the on-disk footprint of a new database,
> but there are other possible benefits as well.  For example, a process
> not bound to a database could read a shared catalog even if it weren't
> nailed, and if we ever implement a prefork system for backends, they'd
> be able to do more of their initialization steps before learning which
> database they were to target.

This is important. I like the idea of breaking down the barriers
between databases to allow it to be an option for one backend to
access tables in multiple databases. The current mechanism doesn't
actually prevent looking at data from other databases using internal
APIs, so full security doesn't exist. It's a very common user
requirement to wish to join tables stored in different databases,
which ought to be possible more cleanly with correct privileges.

> 2. I'm interested in having a cleaner way to associate
> non-transactional state with a relation.  This has come up a few
> times.  We currently handle this by having lazy VACUUM do in-place
> heap updates to replace values like relpages, reltuples, and
> relfrozenxid, but this feels like a kludge.  It's particularly scary
> to think about relying on this for anything critical given that
> non-inplace heap updates can be happening simultaneously, and the
> consequences of losing an update to relfrozenxid in particular are
> disastrous.  Plus, it requires hackery in pg_upgrade to preserve the
> value between the old and new clusters; we've already had to fix two
> data-destroying bugs in that logic.  There are several other things
> that we might want to do that have similar requirements.  For example,
> Pavan's idea of folding VACUUM's second heap pass into the next vacuum
> cycle requires a relation-wide piece of state which can probably be
> represented as a single bit, but putting that bit in pg_class would
> require the same sorts of hacks there that we already have for
> relfrozenxid, with similar consequences if it's not properly
> preserved.  Making unlogged tables logged or the other way around
> appears to require some piece of relation-level state *that can be
> accessed during recovery*, and pg_class is not going to work for that.
>  Page checksums have a similar requirement if the granularity for
> turning them on and off is anything less than the entire cluster.
> Whenever we decide to roll out a new page version, we'll want a place
> to record the oldest page version that might be present in a
> particular relation, so that we can easily check whether a cluster can
> be upgraded to a new release that has dropped support for an old page
> version.  Having a common framework for all of these things seems like
> it will probably be easier than solving each problem individually, and
> a metapage is a good place to store non-transactional state.

I thought there was a patch that put that info in a separate table 1:1
with pg_class.

Not very sure why a metapage is better than a catalog table. We would
still want a view that allows us to access that data as if it were a
catalog table.

> 3. Right now, a new table uses up a minimum of 3 inodes, even if it
> has no indexes: one for the main fork, one for the visibility map, and
> one for the free space map.  For people who have lots and lots of
> little tiny tables, this is quite inefficient.  The amount of
> information we'd have to store in a heap metapage would presumably not
> be very big, so we could potentially move the first, say, 1K of the
> visibility map into the heap metapage, meaning that tables less than
> 64MB would no longer require a separate visibility map fork.
> Something similar could possibly be done with the free-space map,
> though I am unsure of the details.  Right now, a relation containing
> just one tuple consumes 5 8k blocks on disk (1 for the main fork, 3
> for the FSM, and 1 for the VM) and 3 inodes; getting that down to 8kB
> and 1 inode would be very nice.  The case of a completely-empty
> relation is a bit annoying; that right now takes 1 inode and 0 blocks
> and I suspect we'd end up with 1 inode and 1 block, but I think it
> might still be a win overall.

Again, there are other ways to optimise the FSM for small tables.

> 4. Every once in a while, somebody's database ends up in pieces in
> lost+found.  We could make this a bit easier to recover from by
> including the database OID, relfilenode, and table OID in the
> metapage.  This wouldn't be perfect, since a relation over one GB
> would still only have one metapage, so additional relation segments
> would still be a problem.  But it would be still be a huge improvement
> over the status quo: some very large percentage of the work of putting
> everything back where it goes could probably be done by a Perl script
> that read all the metapages, and if you needed to know, say, which
> file contained pg_class, that would be a whole lot easier, too.

That sounds like the requirement that is driving this idea.

> Now, there are a couple of obvious problems here, the biggest of which
> is probably that we want to avoid breaking pg_upgrade.  I don't have a
> great solution to that problem.  The most realistic option I can think
> of at the moment is to fudge things so that existing features can
> continue to work even if the metapage isn't present.  Any table
> rewrite would add a metapage; if you want to use a new feature that
> requires a metapage, you have to rewrite the table first to get one.
> However, that's pretty unfortunate in terms of goal #1 and some parts
> of goal #2, because if you can't be certain of having the metapage
> present then you can't really store data there in lieu of pg_class;
> the best you'll be able to do is have it both places, at least until
> you're ready to deprecate upgrades from releases that don't contain
> metapage support.  Hopefully someone has a better idea...

You don't have to rewrite the table, you just need to update the rows
so they migrate to another block.

That seems easy enough, but still not sure why you wouldn't just use
another fork. Or another idea would be to have the first page have a
non-zero pd_special.

I know you were recording what was discussed as an initial starting
point. Looks like a good set of problems to solve.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: heap metapages

From

Stephen Frost

Date:

21 May 2012, 16:41:11

* Simon Riggs (simon@2ndQuadrant.com) wrote:
> The only thing against these ideas is that you're putting the design
> before the requirements, which always makes me nervous.
[...]
> What springs immediately to mind is why this would not be just another fork.

One of the requirements, though perhaps it wasn't made very clear,
really is to reduce the on-disk footprint, both in terms of inodes and
actual disk usage, if possible.

> This is important. I like the idea of breaking down the barriers
> between databases to allow it to be an option for one backend to
> access tables in multiple databases. The current mechanism doesn't
> actually prevent looking at data from other databases using internal
> APIs, so full security doesn't exist. It's a very common user
> requirement to wish to join tables stored in different databases,
> which ought to be possible more cleanly with correct privileges.

That's really a whole different ball of wax and I don't believe what
Robert was proposing would actually allow that to happen due to the
other database-level things which are needed to keep everything
consistent...  That's my understanding, anyway.  I'd be happy as anyone
if we could actually make it work, but isn't like the SysCache stuff per
database?  Also, cross-database queries would actually make it more
difficult to have per-database roles, which is one thing that I was
hoping we might be able to work into this, though perhaps we could have
a shared roles table and a per-database roles table and only 'global'
roles would be able to issue cross-database queries..

> Not very sure why a metapage is better than a catalog table. We would
> still want a view that allows us to access that data as if it were a
> catalog table.

Right, we were discussing that, and what would happen if someone did a
'select *' against it...  Having to pass through all of the files on
disk wouldn't be good, but if we could make it use a cache to return
that information, perhaps it'd work.

> Again, there are other ways to optimise the FSM for small tables.

Sure, but is there one where we also reduce the number of inodes we
allocate for tiny tables..?

> That sounds like the requirement that is driving this idea.

Regarding forensics, it's a nice bonus, but I think the real requirement
is the reduction of inode and disk usage, both for the per-database
catalog and for tiny tables.

> You don't have to rewrite the table, you just need to update the rows
> so they migrate to another block.

Well, that depends on exactly how it gets implemented, but that's an
interesting idea, certainly..
Thanks,
    Stephen

Re: heap metapages

From

Robert Haas

Date:

21 May 2012, 22:51:06

On Mon, May 21, 2012 at 3:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I very much like the idea of a common framework to support multiple
> requirements. If we can view a couple of other designs as well it may
> quickly become clear this is the right way. In any case, the topics
> discussed here are important ones, so thanks for covering them.

I considered a couple of other possibilities:

- We could split pg_class into pg_class and pg_class_nt
(non-transactional).  This would solve problem #1 (allowing
pg_class/pg_attribute entries for system catalogs to be shared across
all databases) but it doesn't do anything for problem #3 (excessive
inode consumption) or problem #4 (watermarking for crash recovery) and
isn't very good for problem #2 (maintenance of non-transactional
state) either, since part of the hope here is that we'd be able to get
at this state during recovery even when HS is not used.

- In lieu of adding an entire meta-page, we could just add some
special space to the first page, or maybe to every N'th page.  Adding
space to every N'th page would be the best solution to problem #4
(watermarking), and adding even a small amount of state to the first
page would be enough for problems #1 and #2.  However, I don't think
it would work for problem #3 (reducing inode consumption) because even
if the special space is pretty big, you won't really be able to mix
tuples and visibility map information (for example) on the same page
without complicating the buffer locking regimen unbearably.  The dance
we have to do to make the visibility map crash-safe is already a lot
hairier than I'd really prefer.  Also, I think we really need a lot of
this info for both tables and indexes, and I think it will be simpler
to decide that everything has a metapage rather than to decide that
some things have a metapage and some things just have a little extra
stuff crammed into the special space.

- I considered the idea of designing a crash-safe persistent hash
table, that would be sort of like a table but really more like a
key-value store with keys and values being C structs.  This would be
similar to the pg_class/pg_class_nt split idea, except that
pg_class_nt would be one of these new crash-safe persistent hash table
objects, rather than a normal table; and there's a decent possibility
we'd find other applications for such a beast.  However, it wouldn't
help with problem #3 or problem #4; and Tom seemed to be gravitating
toward the design in my OP rather than this idea.  One point that was
raised is that btree and hash indexes already have a metapage, so
sticking a little more data into it doesn't really cost anything; and
heap relations are pretty much going to end up nailing the visibility
map and free space map pages in cache, so it's not clear that this is
any less cache-efficient in those cases either.  For all that, I kind
of like the idea of a persistent hash table object, which I suspect
could be used to solve some problems not on the list in my OP as well
as some of the ones that are there, but I don't feel too bad laying
that idea aside for now.  If it's really a good idea, it'll come up
again.

> What springs immediately to mind is why this would not be just another fork.

This was pretty much the first thing I considered, but it makes
problem #3 worse, and I really don't want do that.  I think 3 inodes
per table is already too many, and I expect the problem to get worse.
I feel like every third crazy feature idea I come up with involves
creating yet another relation fork, and I'm pretty sure I won't be the
last person to think about such things, and so we're probably headed
that way, but I think we'd better try to hold the line as much as is
reasonably possible.

One random idea would be to have pg_upgrade create a special one-block
relation fork for the heap metapage that would get folded into the
main fork the first time the table gets rewritten.  So we'd add
another fork, but only as a hack to facilitate in-place upgrade.

> This is important. I like the idea of breaking down the barriers
> between databases to allow it to be an option for one backend to
> access tables in multiple databases. The current mechanism doesn't
> actually prevent looking at data from other databases using internal
> APIs, so full security doesn't exist. It's a very common user
> requirement to wish to join tables stored in different databases,
> which ought to be possible more cleanly with correct privileges.

As Stephen says, this would require a lot more than just making
pg_class_shared/pg_attribute_shared work, and I don't particularly
believe it's a good idea anyway.  That having been said, if we decided
we wanted to go this way in some future release, having done this
first couldn't but help.

> I thought there was a patch that put that info in a separate table 1:1
> with pg_class.
>
> Not very sure why a metapage is better than a catalog table.

Mostly because there's no chance of the startup process accessing a
catalog table during recovery, but it can read a metapage.

> We would
> still want a view that allows us to access that data as if it were a
> catalog table.

Agreed.  Tom said the same.

> Again, there are other ways to optimise the FSM for small tables.

True, but that doesn't make this a bad one.

>> 4. Every once in a while, somebody's database ends up in pieces in
>> lost+found.  We could make this a bit easier to recover from by
>> including the database OID, relfilenode, and table OID in the
>> metapage.  This wouldn't be perfect, since a relation over one GB
>> would still only have one metapage, so additional relation segments
>> would still be a problem.  But it would be still be a huge improvement
>> over the status quo: some very large percentage of the work of putting
>> everything back where it goes could probably be done by a Perl script
>> that read all the metapages, and if you needed to know, say, which
>> file contained pg_class, that would be a whole lot easier, too.
>
> That sounds like the requirement that is driving this idea.

No, I listed it fourth because I think it's the least interesting
benefit.  It IS a benefit, but if this were the primary goal it would
be a LOT simpler to shove a few bytes into every N'th heap special
space.  I coded up a patch for that on my other laptop, and then
reformatted the hard drive without saving the patch (brilliant!), so I
no longer have working code for this.  But it's not that hard.  I am
much more interested in benefit #2, the ability to maintain
non-transactional state that can be read by the startup process during
recovery, than I am in this goal.  Unfortunately that's harder, but I
think it's worth the effort.

> You don't have to rewrite the table, you just need to update the rows
> so they migrate to another block.

True.

> That seems easy enough, but still not sure why you wouldn't just use
> another fork. Or another idea would be to have the first page have a
> non-zero pd_special.

See above for a discussion of these points.

> I know you were recording what was discussed as an initial starting
> point. Looks like a good set of problems to solve.

Thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: heap metapages

From

Robert Haas

Date:

21 May 2012, 22:54:24

On Mon, May 21, 2012 at 3:15 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> The FSM and VM are small enough
>> that interleaving them with the actual data probably wouldn't slow
>> down seq scans materially.
>
> Wouldn't that end up potentially causing lots of random i/o if you need
> to look at many parts of the FSM or VM..?

I doubt it.  They probably stay in core anyway.

> Also, wouldn't having it at the start of the heap reduce the changes
> needed to the SM?  Along with make such things easier to find
> themselves, when talking about forensics?

The metapage, surely yes.  If we wanted to fold the FSM and VM into
the main fork in their entirety, probably not.  But I don't have much
desire to do that.  I think it's fine for a BIG relation to eat a
couple of inodes.  I just don't want a little one to do that.

> Of course, the real challenge here is dealing with such an on-disk
> format change...  If we were starting from scratch, I doubt there would
> be much resistance, but figuring out how to do this and still support
> pg_upgrade could be quite ugly.

That does seem to be the ten million dollar question, but already
we've batted around a few solutions on this thread, so I suspect we'll
find a way to make it work.  I think my next step is going to be to
spend some more time studying what the various index AMs already have
in terms of metapages.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: heap metapages

From

Simon Riggs

Date:

22 May 2012, 05:52:52

On 22 May 2012 02:50, Robert Haas <robertmhaas@gmail.com> wrote:

>> Not very sure why a metapage is better than a catalog table.
>
> Mostly because there's no chance of the startup process accessing a
> catalog table during recovery, but it can read a metapage.

OK, sounds reasonable.

Based upon all you've said, I'd suggest that we make a new kind of
fork, in a separate file for this, .meta. But we also optimise the VM
and FSM in the way you suggest so that we can replace .vm and .fsm
with just .meta in most cases. Big tables would get a .vm and .fsm
appearing when they get big enough, but that won't challenge the inode
limits. When .vm and .fsm do appear, we remove that info from the
metapage - that means we keep all code as it is currently, accept for
an optimisation of .vm and .fsm when those are small enough to do so.

We can watermark data files using special space on block zero using
some code to sneak that in when the page is next written, but that is
regarded as optional, rather than an essential aspect of an
upgrade/normal operation.

Having pg_upgrade touch data files is both dangerous and difficult to
back out in case of mistake, so I am wary of putting the metapage at
block 0. Doing it the way I suggest means the .meta files would be
wholly new and can be deleted as a back-out. We can also clean away
any unnecessary .vm/.fsm files as a later step.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: heap metapages

From

Robert Haas

Date:

22 May 2012, 09:52:26

On Tue, May 22, 2012 at 4:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Based upon all you've said, I'd suggest that we make a new kind of
> fork, in a separate file for this, .meta. But we also optimise the VM
> and FSM in the way you suggest so that we can replace .vm and .fsm
> with just .meta in most cases. Big tables would get a .vm and .fsm
> appearing when they get big enough, but that won't challenge the inode
> limits. When .vm and .fsm do appear, we remove that info from the
> metapage - that means we keep all code as it is currently, accept for
> an optimisation of .vm and .fsm when those are small enough to do so.

Well, let's see.  That would mean that a small heap relation has 2
forks instead of 3, and a large relation has 4 forks instead of 3.  In
my proposal, a small relation has 1 fork instead of 3, and a large
relation still has 3 forks.  So I like mine better.

Also, I think that we need a good chunk of the metadata here for both
tables and indexes.  For example, if we use the metapage to store
information about whether a relation is logged, unlogged, being
converted from logged to unlogged, or being converted from logged to
unlogged, we need that information both for tables and for indexes.
Now, there's no absolute reason why those cases have to be handled
symmetrically, but I think things will be a lot simpler if they are.
If we settle on the rule that block 0 of every relation contains a
certain chunk of metadata at a certain byte offset, then the code to
retrieve that data when needed is pretty darn simple.  If tables put
it in a separate fork and indexes put it in the main fork inside the
metablock somewhere, then things are not so simple.  And I sure don't
want to add a separate fork for every index just to hold the metadata:
that would be a huge hit in terms of total inode consumption.

> We can watermark data files using special space on block zero using
> some code to sneak that in when the page is next written, but that is
> regarded as optional, rather than an essential aspect of an
> upgrade/normal operation.
>
> Having pg_upgrade touch data files is both dangerous and difficult to
> back out in case of mistake, so I am wary of putting the metapage at
> block 0. Doing it the way I suggest means the .meta files would be
> wholly new and can be deleted as a back-out. We can also clean away
> any unnecessary .vm/.fsm files as a later step.

It seems pretty clear to me that making pg_upgrade responsible for
emptying block zero is a non-starter.  But I don't think that's a
reason to throw out the design; I think it's a problem we can work
around.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: heap metapages

From

Simon Riggs

Date:

22 May 2012, 14:10:11

On 22 May 2012 13:52, Robert Haas <robertmhaas@gmail.com> wrote:

> It seems pretty clear to me that making pg_upgrade responsible for
> emptying block zero is a non-starter.  But I don't think that's a
> reason to throw out the design; I think it's a problem we can work
> around.

I like your design better as well *if* you can explain how we can get
to it. My proposal was a practical alternative that would allow the
idea to proceed.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: heap metapages

From

Bruce Momjian

Date:

24 May 2012, 19:02:59

On Tue, May 22, 2012 at 09:52:30AM +0100, Simon Riggs wrote:
> Having pg_upgrade touch data files is both dangerous and difficult to
> back out in case of mistake, so I am wary of putting the metapage at
> block 0. Doing it the way I suggest means the .meta files would be
> wholly new and can be deleted as a back-out. We can also clean away
> any unnecessary .vm/.fsm files as a later step.

Pg_upgrade never modifies the old cluster, except to lock it in link
mode, so there is never anything to back out.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: heap metapages

From

Simon Riggs

Date:

25 May 2012, 08:32:13

On 24 May 2012 23:02, Bruce Momjian <bruce@momjian.us> wrote:
> On Tue, May 22, 2012 at 09:52:30AM +0100, Simon Riggs wrote:
>> Having pg_upgrade touch data files is both dangerous and difficult to
>> back out in case of mistake, so I am wary of putting the metapage at
>> block 0. Doing it the way I suggest means the .meta files would be
>> wholly new and can be deleted as a back-out. We can also clean away
>> any unnecessary .vm/.fsm files as a later step.
>
> Pg_upgrade never modifies the old cluster, except to lock it in link
> mode, so there is never anything to back out.

Agreed. Robert's proposal was to make pg_upgrade modify the cluster,
which I was observing wasn't a good plan.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: heap metapages

From

Jim Nasby

Date:

25 May 2012, 18:58:19

On 5/22/12 12:09 PM, Simon Riggs wrote:
> On 22 May 2012 13:52, Robert Haas<robertmhaas@gmail.com>  wrote:
>
>> It seems pretty clear to me that making pg_upgrade responsible for
>> emptying block zero is a non-starter.  But I don't think that's a
>> reason to throw out the design; I think it's a problem we can work
>> around.
>
> I like your design better as well *if* you can explain how we can get
> to it. My proposal was a practical alternative that would allow the
> idea to proceed.

It occurred to me that having a metapage with information useful to recovery operations in *every segment* would be
useful;it certainly seems worth the extra block. It then occurred to me that we've basically been stuck with 2 places
tostore relation data; either at the relation level in pg_class or on each page. Sometimes neither one is a good fit.

ISTM that a lot of problems we've faced in the past few years are because there's not a good abstraction between a
(mostly)linear tuplespace and the physical storage that goes underneath it.

- pg_upgrade progress is blocked because we can't deal with a new page that's > BLKSZ
- There's no good way to deal with table (or worse, index) bloat
- There's no good way to add the concept of a heap metapage
- Forks are being used to store data that might not belong there only because there's no other choice (visibility
info)

Would it make sense to take a step back and think about ways to abstract between logical tuplespace and physical
storage?What if 1GB segments had their own metadata? Or groups of segments? Could certain operations that currently
haveto rewrite an entire table be changed so that they slowly moved pages from one group of segments to another, with a
meansof marking old pages as having been moved?

Einstein said that "problems cannot be solved by the same level of thinking that created them." Perhaps we're at the
pointwhere we need to take a step back from our current storage organization and look for a bigger picture?

-- 
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: heap metapages

From

Robert Haas

Date:

25 May 2012, 20:42:21

On Fri, May 25, 2012 at 5:57 PM, Jim Nasby <jim@nasby.net> wrote:
> It occurred to me that having a metapage with information useful to recovery
> operations in *every segment* would be useful; it certainly seems worth the
> extra block. It then occurred to me that we've basically been stuck with 2
> places to store relation data; either at the relation level in pg_class or
> on each page. Sometimes neither one is a good fit.

AFAICS, having metadata in every segment is most only helpful for
recovering from the situation where files have become disassociated
from their filenames, i.e. database -> lost+found.  From the view
point of virtually the entire server, the block number space is just a
continuous sequence that starts at 0 and counts up forever (or,
anyway, until 2^32-1).  While it wouldn't be impossible to allow that
knowledge to percolate up to other parts of the server, it would
basically involve drilling a fairly arbitrary hole through an
abstraction boundary that has been intact for a very long time, and
it's not clear that there's anything magical about 1GB.
Nonwithstanding the foregoing...

> ISTM that a lot of problems we've faced in the past few years are because
> there's not a good abstraction between a (mostly) linear tuplespace and the
> physical storage that goes underneath it.

...I agree with this.  I'm not sure exactly what the replacement model
would look like, but it's definitely worth some thought - e.g. perhaps
there ought to be another mapping layer between logical block numbers
and files on disk, so that we can effectively delete blocks out of the
middle of a relation without requiring any special OS support, and so
that we can multiplex many small relation forks onto a single physical
file to minimize inode consumption.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company