Thread: Rewriting Free Space Map

Rewriting Free Space Map

From
"Heikki Linnakangas"
Date:
I've started working on revamping Free Space Map, using the approach 
where we store a map of heap pages on every nth heap page. What we need 
now is discussion on the details of how exactly it should work.

Here's my rough plan on the implementation. It's long, sorry. I'm fairly 
confident with it, but please let me know if you see any problems or 
have any suggestions or better ideas.

Heap FSM
--------

The FSM is stored in the special area of every nth heap page. When 
extending the relation, the heapam checks if the block number of the new 
page is one that belongs to the FSM. If it is, it let's the FSM to 
initialize it by calling InitFSMPage() on it, and extends the relation 
again to get another, normal heap page.

I chose the "every nth page is an FSM page" approach, rather than using 
a separate relfilenode, which I also considered. The separate 
relfilenode approach has some advantages, like being able to scan all 
FSM pages in a sequential fashion, but it involves a fair amount of 
catalog and buffer manager changes.

It's convenient that the FSM uses up the whole page, leaving no room for 
heap tuples. It simplifies the locking, as we don't need to worry with 
the possibility in the FSM that the caller is already holding a lock on 
the same page.

In an FSM page, there's one byte for each of the next N heap pages, 
starting from the FSM page. That one byte stores the amount of free 
space on the corresponding heap page, in BLCKSZ/256 byte precision (32 
bytes with default block size).

The mapping of free space to these 256 "buckets" wouldn't necessarily 
have to be a linear one, we could for example have a single bucket for 
pages with more than BLCKSZ/2 bytes of free space and divide the rest 
linearly into 16 byte buckets, but let's keep it simple for now. Of 
course, we could also just use 2 bytes per page, and store the page size 
exactly, but 32 byte precision seems like enough to me.


Index FSM
---------

Indexes use a similar scheme, but since we only need to keep track 
whether a page is used or not, we only need one bit per page. If the 
amount of free space on pages is interesting for an indexam in the 
future, it can use the heap FSM implementation instead. Or no FSM at 
all, like the hash indexam.

To use the index FSM, the indexam needs to leave every nth page alone, 
like in the heap. The B-tree assumes that the metapage is at block 0, 
but that's also where we would like to store the first index FSM page. 
To overcome that, we can make the index FSM special area a little bit 
smaller, so that the B-tree metadata fits on the same page as the FSM 
information. That will be wasted space on other FSM pages than block 0, 
but we're only talking about 24 bytes per FSM page, and we only need one 
FSM page per ~512 MB of index pages (with default BLCKSZ).


Performance
-----------

The patch I'm working on currently uses a naive way to find a page in 
the FSM. To find a page with X bytes of free space, it scans the FSM 
pages until one is found. And it always starts the scan from the 
beginning, which is far from optimal. And when there's page with enough 
free space, it still needs to scan all FSM pages just to find out that 
we need to extend the relation.

To speed things up, we're going to need some mechanism to avoid that. 
First of all, we need to somehow cache the information that "there's no 
page with >= X bytes left", to avoid fruitless scanning. To speed up the 
case when there's only a few pages with enough free space, we can keep a 
limited size list of such pages in addition to the map.

These information needs to be in shared memory, either on heap pages 
like the FSM pages and managed by the buffer manager, or in a separate 
shmem block. I would like to go with normal bufmgr managed pages, as 
fixed-sized memory blocks have their problems, and the cached 
information should be stored to disk as well.

Let's have one special page in the heap file, called the Free Space List 
(FSL) page, in addition to the normal FSM pages. It has the following 
structure:

struct {  bit anypages[256]
 struct {    BlockNumber blockno;    uint8 freespace;  } freespacelist[as large as fits on page]
}

Remember that we track the free space on each page using one byte, IOW, 
each page falls into one of 256 buckets of free space. In the anypages 
bitmap, we have one bit per bucket indicating "is there any pages with 
this much free space". When we look for a page with X bytes, we check 
the bits up to the bucket corresponding X bytes, and if there's no set 
bits we know not to bother scanning the FSM pages.

To speed up the scan where there is space, we keep a simple list of 
pages with free space. This list is actually like the current FSM, but 
here we only use it as a small cache of the FSM pages. VACUUM and any 
other operations that update the FSM can put pages to the list when 
there's free slots.

We can store the FSL page on a magic location, say block #100. For 
relations smaller than that, there's no need for the FSL and we might as 
well scan the FSM page. I'm not sure if we should have more than one FSL 
page for large tables.

I'm not sure yet how the locking of FSL and FSM pages should work. It 
shouldn't be too hard, though, as the FSM/FSL information are just hints 
anyway. We do need to take care that we don't permanently lose track of 
any significant amount of free space, as after we can do partial vacuums 
using the visibility map, VACUUM might not visit one part of a table 
even if other parts it are frequently updated.


Page allocation algorithm
-------------------------

There's many different ways we can hand out pages from the FSM. Possible 
strategies, some of which are at odds with each other, include:

1. Try to spread out inserts of different backends, to avoid contention
2. Prefer low-numbered pages, to increase the chance of being able to 
truncate in VACUUM.
3. Reserve pages with lots of free space for big allocations, prefer 
almost full pages for small allocations. To use all space more efficiently.
4. If the table is clustered, try to use pages close to those with 
similar values.
5. On UPDATE, try to use pages close to the old tuple.
6. Prefer pages that are currently in cache, to avoid I/O.

The current FSM tries to achieve only 1, and there haven't been many 
complaints, so I wouldn't worry too much about the other goals. 4 and 6 
would need some major changes to the buffer manager and indexam 
interfaces, respectively, and 3 could lead to more I/O when you do a lot 
of small inserts.

We can spread inserts of different backends in the Free Space List by 
moving a page to the end of the list when it's handed out, or removing 
it from there altogether. When the FSL is empty, we can vary the place 
where we start to scan the FSM.

To prefer low-number pages, to increase the chance of being able to 
truncate the relation later, we can favor lower numbered blocks in 
VACUUM when we decide which blocks to put into the FSL. And we can also 
bias the starting point of where we start to scan the FSM when the FSL 
is empty.


Visibility Map
--------------

This far I've only talked about the FSM, but it's important to consider 
how the Visibility Map fits into the scheme. My current thinking is that 
there will be one bit per heap page in the visibility map. The exact 
semantics of that one bit is still not totally clear, but that's not 
important right now.

There's a few alternatives:
1. Mix the visibility map with the FSM, stealing one bit of every FSM 
byte. There would then be 7 bits for storing how much space there is on 
each page, and 1 bit indicating the visibility.

2. Allocate part of the FSM pages for visibility map. For example, First 
1/9 of the page for VM, and 8/9 for the FSM. The VM part would be a 
straight bitmap with one bit per page, and the FSM part would use one 
byte per page.

3. Use different pages for the VM, for example every nth page for VM, 
and every (n+1)th page for FSM.

I'm leaning towards 2 at the moment. 3 is intriguing as well, though,
because it would help with potential lock contention on the VM/FSM pages.


Per-chunk relfrozenxid
----------------------
I'm imagining that we would set a bit in VM, if a page has no dead 
tuples at all, making it possible to use it for index-only scans. Such a 
page is also uninteresting for VACUUM. However, we would still need to 
scan the whole table to freeze tuples.

To alleviate that, we could allocate some space in the FSM pages, 
indicating a smallest Xid in a range of heap pages. IOW, like 
relfrozenxid, but with a granularity of N pages, rather than whole 
relation. You would still have to scan that range of pages to update it, 
but that's much better than the whole relation.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Rewriting Free Space Map

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> I've started working on revamping Free Space Map, using the approach 
> where we store a map of heap pages on every nth heap page. What we need 
> now is discussion on the details of how exactly it should work.

You're cavalierly waving away a whole boatload of problems that will
arise as soon as you start trying to make the index AMs play along
with this :-(.  Hash for instance has very narrow-minded ideas about
page allocation within its indexes.

Also, I don't think that "use the special space" will scale to handle
other kinds of maps such as the proposed dead space map.  (This is
exactly why I said the other day that we need a design roadmap for all
these ideas.)

The idea that's becoming attractive to me while contemplating the
multiple-maps problem is that we should adopt something similar to
the old Mac OS idea of multiple "forks" in a relation.  In addition
to the main data fork which contains the same info as now, there could
be one or more map forks which are separate files in the filesystem.
They are named by relfilenode plus an extension, for instance a relation
with relfilenode NNN would have a data fork in file NNN (plus perhaps
NNN.1, NNN.2, etc) and a map fork named something like NNN.map (plus
NNN.map.1 etc as needed).  We'd have to add one more field to buffer
lookup keys (BufferTag) to disambiguate which fork the referenced page
is in.  Having bitten that bullet, though, the idea trivially scales to
any number of map forks with potentially different space requirements
and different locking and WAL-logging requirements.

Another possible advantage is that a new map fork could be added to an
existing table without much trouble.  Which is certainly something we'd
need if we ever hope to get update-in-place working.

The main disadvantage I can see is that for very small tables, the
percentage overhead from multiple map forks of one page apiece is
annoyingly high.  However, most of the point of a map disappears if
the table is small, so we might finesse that by not creating any maps
until the table has reached some minimum size.
        regards, tom lane


Re: Rewriting Free Space Map

From
Alvaro Herrera
Date:
Tom Lane wrote:

> The idea that's becoming attractive to me while contemplating the
> multiple-maps problem is that we should adopt something similar to
> the old Mac OS idea of multiple "forks" in a relation.  In addition
> to the main data fork which contains the same info as now, there could
> be one or more map forks which are separate files in the filesystem.

I think something similar could be used to store tuple visibility bits
separately from heap tuple data itself, so +1 to this idea.

(The rough idea in my head was that you can do an indexscan and look
up visibility bits without having to pull the whole heap along; and
visibility updates are also cheaper, whether they come from indexscans
or heap scans.  Of course, the implicit cost is that a seqscan needs to
fetch the visibility pages, too; and the locking is more complex.)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Rewriting Free Space Map

From
Hannu Krosing
Date:
On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote:
> Tom Lane wrote:
> 
> > The idea that's becoming attractive to me while contemplating the
> > multiple-maps problem is that we should adopt something similar to
> > the old Mac OS idea of multiple "forks" in a relation.  In addition
> > to the main data fork which contains the same info as now, there could
> > be one or more map forks which are separate files in the filesystem.

Are'nt we in a way doing this for indexes ?

> I think something similar could be used to store tuple visibility bits
> separately from heap tuple data itself, so +1 to this idea.

Not just "bits", but whole visibility info (xmin,xmax,tmin,tmax, plus
bits) should be stored separately.

A separate "fork" for visibility should be organized as a b-tree index
(as we already have well-honed mechanisms for dealing with those
effectively) but visibility fork is stored in a compressed form by
storing ranges of all-visible or all-deleted tuples as two endpoints
only and also the tree is reorganized when possible similar to what we
currently do for HOT updates.

This will keep the visibility index really small for cases with little
updates, most likely one or two pages regardless of table size.

One important difference from indexes is that visibility info should be
stored first, before writing data to heap and creating ordinary index
entries.

> (The rough idea in my head was that you can do an indexscan and look
> up visibility bits without having to pull the whole heap along; and
> visibility updates are also cheaper, whether they come from indexscans
> or heap scans.  Of course, the implicit cost is that a seqscan needs to
> fetch the visibility pages, too; and the locking is more complex.)

another cost is heavy inserting/updating where there will probably be
more lock contention as visibility info for new tuples will more often
land on the same visibility pages due to visibility info being generally
smaller.

Of course, with visibility info in a separate fork, very narrow tables
will have the ratios reversed - for one byte wide table visibility info
will be a few times bigger than actual data, at least initially before
compression has kicked in.

--------------------
Hannu












Re: Rewriting Free Space Map

From
Tom Lane
Date:
Hannu Krosing <hannu@krosing.net> writes:
> On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote:
>> Tom Lane wrote:
>>> The idea that's becoming attractive to me while contemplating the
>>> multiple-maps problem is that we should adopt something similar to
>>> the old Mac OS idea of multiple "forks" in a relation.

> Are'nt we in a way doing this for indexes ?

Not really --- indexes are closer to being independent entities, since
they have their own relfilenode values, own pg_class entries, etc.  What
I'm imagining here is something that's so tightly tied to the core heap
that there's no value in managing it as a distinct entity, thus the idea
of same relfilenode with a different extension.  The existence of
multiple forks in a relation wouldn't be exposed at all at the SQL
level.

>> I think something similar could be used to store tuple visibility bits
>> separately from heap tuple data itself, so +1 to this idea.

> Not just "bits", but whole visibility info (xmin,xmax,tmin,tmax, plus
> bits) should be stored separately.

I'm entirely un-sold on this idea, but yeah it would be something that
would be possible to experiment with once we have a multi-fork
infrastructure.
        regards, tom lane


Re: Rewriting Free Space Map

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> I've started working on revamping Free Space Map, using the approach 
>> where we store a map of heap pages on every nth heap page. What we need 
>> now is discussion on the details of how exactly it should work.
> 
> You're cavalierly waving away a whole boatload of problems that will
> arise as soon as you start trying to make the index AMs play along
> with this :-(.  

It doesn't seem very hard. An indexam wanting to use FSM needs a little 
bit of code where the relation is extended, to let the FSM initialize 
FSM pages. And then there's the B-tree metapage issue I mentioned. But 
that's all, AFAICS.

> Hash for instance has very narrow-minded ideas about
> page allocation within its indexes.

Hash doesn't use FSM at all.

> Also, I don't think that "use the special space" will scale to handle
> other kinds of maps such as the proposed dead space map.  (This is
> exactly why I said the other day that we need a design roadmap for all
> these ideas.)

It works for anything that scales linearly with the relation itself. The 
proposed FSM and visibility map both fall into that category.

A separate file is certainly more flexible. I was leaning towards that 
option originally 
(http://archives.postgresql.org/pgsql-hackers/2007-11/msg00142.php) for 
that reason.

> The idea that's becoming attractive to me while contemplating the
> multiple-maps problem is that we should adopt something similar to
> the old Mac OS idea of multiple "forks" in a relation.  In addition
> to the main data fork which contains the same info as now, there could
> be one or more map forks which are separate files in the filesystem.
> They are named by relfilenode plus an extension, for instance a relation
> with relfilenode NNN would have a data fork in file NNN (plus perhaps
> NNN.1, NNN.2, etc) and a map fork named something like NNN.map (plus
> NNN.map.1 etc as needed).  We'd have to add one more field to buffer
> lookup keys (BufferTag) to disambiguate which fork the referenced page
> is in.  Having bitten that bullet, though, the idea trivially scales to
> any number of map forks with potentially different space requirements
> and different locking and WAL-logging requirements.

Hmm. You also need to teach at least xlog.c and xlogutils.c about the 
map forks, for full page images and the invalid page tracking. I also 
wonder what the performance impact of extending BufferTag is.

My original thought was to have a separate RelFileNode for each of the 
maps. That would require no smgr or xlog changes, and not very many 
changes in the buffer manager, though I guess you'd more catalog 
changes. You had doubts about that on the previous thread 
(http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but 
the "map forks" idea certainly seems much more invasive than that.

I like the "map forks" idea; it groups the maps nicely at the filesystem 
level, and I can see it being useful for all kinds of things in the 
future. The question is, is it really worth the extra code churn? If you 
think it is, I can try that approach.

> Another possible advantage is that a new map fork could be added to an
> existing table without much trouble.  Which is certainly something we'd
> need if we ever hope to get update-in-place working.

Yep.

> The main disadvantage I can see is that for very small tables, the
> percentage overhead from multiple map forks of one page apiece is
> annoyingly high.  However, most of the point of a map disappears if
> the table is small, so we might finesse that by not creating any maps
> until the table has reached some minimum size.

Yeah, the map fork idea is actually better than the "every nth heap 
page" approach from that point of view.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Rewriting Free Space Map

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> You're cavalierly waving away a whole boatload of problems that will
>> arise as soon as you start trying to make the index AMs play along
>> with this :-(.  

> It doesn't seem very hard.

The problem is that the index AMs are no longer in control of what goes
where within their indexes, which has always been their prerogative to
determine.  The fact that you think you can kluge btree to still work
doesn't mean that it will work for other AMs.

>> Also, I don't think that "use the special space" will scale to handle
>> other kinds of maps such as the proposed dead space map.  (This is
>> exactly why I said the other day that we need a design roadmap for all
>> these ideas.)

> It works for anything that scales linearly with the relation itself. The 
> proposed FSM and visibility map both fall into that category.

It can work only with prearrangement among all the maps that are trying
to coexist in the same special space.  Every time a new one comes along,
we'd have to reconsider the space allocation, re-optimize tradeoffs,
and force an initdb that could not possibly be implemented in-place.

If we had a short list of maps in mind with no real prospect of
changing, then I think this would be acceptable; but that doesn't seem
to be the case.  It's certainly foolish to start detail design on
something like this when we don't even have the "roadmap" I asked
for about what sorts of maps we are agreed we want to have.

>> The idea that's becoming attractive to me while contemplating the
>> multiple-maps problem is that we should adopt something similar to
>> the old Mac OS idea of multiple "forks" in a relation.

> Hmm. You also need to teach at least xlog.c and xlogutils.c about the 
> map forks, for full page images and the invalid page tracking.

Well, you'd have to teach them something anyway, for any incarnation
of maps that they might need to update.

> I also wonder what the performance impact of extending BufferTag is.

That's a fair objection, and obviously something we'd need to check.
But I don't recall seeing hash_any so high on any profile that I think
it'd be a big problem.

> My original thought was to have a separate RelFileNode for each of the 
> maps. That would require no smgr or xlog changes, and not very many 
> changes in the buffer manager, though I guess you'd more catalog 
> changes. You had doubts about that on the previous thread 
> (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but 
> the "map forks" idea certainly seems much more invasive than that.

The main problems with that are (a) the need to expose every type of map
in pg_class and (b) the need to pass all those relfilenode numbers down
to pretty low levels of the system.  The nice thing about the fork idea
is that you don't need any added info to uniquely identify what relation
you're working on.  The fork numbers would be hard-wired into whatever
code needed to know about particular forks.  (Of course, these same
advantages apply to using special space in an existing file.  I'm
just suggesting that we can keep these advantages without buying into
the restrictions that special space would have.)
        regards, tom lane


Re: Rewriting Free Space Map

From
Lars-Erik Bjørk
Date:
On Mon, 2008-03-17 at 09:29 -0400, Tom Lane wrote:
> Hannu Krosing <hannu@krosing.net> writes:
> > On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote:
> >> Tom Lane wrote:
> >>> The idea that's becoming attractive to me while contemplating the
> >>> multiple-maps problem is that we should adopt something similar to
> >>> the old Mac OS idea of multiple "forks" in a relation.
> 
> > Are'nt we in a way doing this for indexes ?
> 
> Not really --- indexes are closer to being independent entities, since
> they have their own relfilenode values, own pg_class entries, etc.  What
> I'm imagining here is something that's so tightly tied to the core heap
> that there's no value in managing it as a distinct entity, thus the idea
> of same relfilenode with a different extension.  The existence of
> multiple forks in a relation wouldn't be exposed at all at the SQL
> level.
> 
> >> I think something similar could be used to store tuple visibility bits
> >> separately from heap tuple data itself, so +1 to this idea.
> 
> > Not just "bits", but whole visibility info (xmin,xmax,tmin,tmax, plus
> > bits) should be stored separately.
> 
> I'm entirely un-sold on this idea, but yeah it would be something that
> would be possible to experiment with once we have a multi-fork
> infrastructure.
> 
>             regards, tom lane
> 



Re: Rewriting Free Space Map

From
Simon Riggs
Date:
On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote:
> Tom Lane wrote:
> 
> > The idea that's becoming attractive to me while contemplating the
> > multiple-maps problem is that we should adopt something similar to
> > the old Mac OS idea of multiple "forks" in a relation.  In addition
> > to the main data fork which contains the same info as now, there could
> > be one or more map forks which are separate files in the filesystem.
> 
> I think something similar could be used to store tuple visibility bits
> separately from heap tuple data itself, so +1 to this idea.
> 
> (The rough idea in my head was that you can do an indexscan and look
> up visibility bits without having to pull the whole heap along; and
> visibility updates are also cheaper, whether they come from indexscans
> or heap scans.  Of course, the implicit cost is that a seqscan needs to
> fetch the visibility pages, too; and the locking is more complex.)

I very much like the idea of a generic method for including additional
bulk metadata for a relation (heap or index). That neatly provides a
general infrastructure for lots of clever things such as dead space,
visibility or other properties, while at the same time maintaining
modularity. 

Can we call them "maps" or "metadata maps"? "forks" sounds weird.

We don't need to assume anything about the maps themselves at this
stage, so we might imagine tightly coupled maps that are always updated
as a relation changes, or loosely coupled maps that are lazily updated
by background processes. Autovacuum then becomes the vehicle by which we
execute "map maintenance" procedures, defined according to which AMs are
installed and what relation options are set. So we have a completely
generalised data/metadata storage infrastructure.

Sensibly arranged this could provide an entry point for powerful new
features within existing and future index AMs. It also sounds like it
might avoid a whole class of bugs and special cases that I regrettably
foresee would be unavoidable in Heikki's proposal.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com 
 PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk



Re: Rewriting Free Space Map

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> My original thought was to have a separate RelFileNode for each of the 
>> maps. That would require no smgr or xlog changes, and not very many 
>> changes in the buffer manager, though I guess you'd more catalog 
>> changes. You had doubts about that on the previous thread 
>> (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but 
>> the "map forks" idea certainly seems much more invasive than that.
>
> The main problems with that are (a) the need to expose every type of map
> in pg_class and (b) the need to pass all those relfilenode numbers down
> to pretty low levels of the system.  The nice thing about the fork idea
> is that you don't need any added info to uniquely identify what relation
> you're working on.  The fork numbers would be hard-wired into whatever
> code needed to know about particular forks.  (Of course, these same
> advantages apply to using special space in an existing file.  I'm
> just suggesting that we can keep these advantages without buying into
> the restrictions that special space would have.)

One advantage of using separate relfilenodes would be that if we need to
regenerate a map we could do it in a new relfilenode and swap it in like we do
with heap rewrites.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Rewriting Free Space Map

From
Andrew Dunstan
Date:

Gregory Stark wrote:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>
>   
>> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>>     
>>> My original thought was to have a separate RelFileNode for each of the 
>>> maps. That would require no smgr or xlog changes, and not very many 
>>> changes in the buffer manager, though I guess you'd more catalog 
>>> changes. You had doubts about that on the previous thread 
>>> (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but 
>>> the "map forks" idea certainly seems much more invasive than that.
>>>       
>> The main problems with that are (a) the need to expose every type of map
>> in pg_class and (b) the need to pass all those relfilenode numbers down
>> to pretty low levels of the system.  The nice thing about the fork idea
>> is that you don't need any added info to uniquely identify what relation
>> you're working on.  The fork numbers would be hard-wired into whatever
>> code needed to know about particular forks.  (Of course, these same
>> advantages apply to using special space in an existing file.  I'm
>> just suggesting that we can keep these advantages without buying into
>> the restrictions that special space would have.)
>>     
>
> One advantage of using separate relfilenodes would be that if we need to
> regenerate a map we could do it in a new relfilenode and swap it in like we do
> with heap rewrites.
>
>   

Why can't you just do that with a different extension and file rename? 
You'd need an exclusive lock while swapping in the new map, but you need 
that anyway, IIRC, and this way you don't even need a catalog change.

cheers

andrew


Re: Rewriting Free Space Map

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> One advantage of using separate relfilenodes would be that if we need to
> regenerate a map we could do it in a new relfilenode and swap it in like we do
> with heap rewrites.

You could probably do that using a temporary fork number, if the
situation ever came up.
        regards, tom lane


Re: Rewriting Free Space Map

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
>> Tom Lane wrote:
>>> The idea that's becoming attractive to me while contemplating the
>>> multiple-maps problem is that we should adopt something similar to
>>> the old Mac OS idea of multiple "forks" in a relation.

> Can we call them "maps" or "metadata maps"? "forks" sounds weird.

I'm not wedded to "forks", that's just the name that was used in the
only previous example I've seen.  Classic Mac had a "resource fork"
and a "data fork" within each file.

Don't think I like "maps" though, as (a) that prejudges what the
alternate forks might be used for, and (b) the name fails to be
inclusive of the data fork.  Other suggestions anyone?

BTW, thinking about the Mac precedent a little more, I believe
the way they grafted that Classic concept onto BSD was that
applications (which the user thinks of as single objects) are
now directories with multiple files inside them.  Probably it'd be
overkill to think of turning each relation into a subdirectory, but
then again maybe not?
        regards, tom lane


Re: Rewriting Free Space Map

From
David Fetter
Date:
On Mon, Mar 17, 2008 at 01:23:46PM -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> >> Tom Lane wrote:
> >>> The idea that's becoming attractive to me while contemplating
> >>> the multiple-maps problem is that we should adopt something
> >>> similar to the old Mac OS idea of multiple "forks" in a
> >>> relation.
> 
> > Can we call them "maps" or "metadata maps"? "forks" sounds weird.
> 
> I'm not wedded to "forks", that's just the name that was used in the
> only previous example I've seen.  Classic Mac had a "resource fork"
> and a "data fork" within each file.
> 
> Don't think I like "maps" though, as (a) that prejudges what the
> alternate forks might be used for, and (b) the name fails to be
> inclusive of the data fork.  Other suggestions anyone?

Segment?  Section?  Module?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Rewriting Free Space Map

From
Simon Riggs
Date:
On Mon, 2008-03-17 at 13:23 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> >> Tom Lane wrote:
> >>> The idea that's becoming attractive to me while contemplating the
> >>> multiple-maps problem is that we should adopt something similar to
> >>> the old Mac OS idea of multiple "forks" in a relation.
> 
> > Can we call them "maps" or "metadata maps"? "forks" sounds weird.
> 
> I'm not wedded to "forks", that's just the name that was used in the
> only previous example I've seen.  Classic Mac had a "resource fork"
> and a "data fork" within each file.

Layer? Slab? Sheet? Strata/um? Overlay?

Layer makes sense to me because of the way GIS and CAD systems work.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com 
 PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk



Re: Rewriting Free Space Map

From
"Jochem van Dieten"
Date:
On Mon, Mar 17, 2008 at 6:23 PM, Tom Lane wrote:
> Simon Riggs writes:
>>
>> Can we call them "maps" or "metadata maps"? "forks" sounds weird.
>
>  I'm not wedded to "forks", that's just the name that was used in the
>  only previous example I've seen.  Classic Mac had a "resource fork"
>  and a "data fork" within each file.

Microsoft / NTFS calls them Data Streams:
http://www.wikistc.org/wiki/Alternate_data_streams

Jochem


Re: Rewriting Free Space Map

From
"Mark Cave-Ayland"
Date:
On Mon, 2008-03-17 at 13:23 -0400, Tom Lane wrote:

> I'm not wedded to "forks", that's just the name that was used in the
> only previous example I've seen.  Classic Mac had a "resource fork"
> and a "data fork" within each file.
>
> Don't think I like "maps" though, as (a) that prejudges what the
> alternate forks might be used for, and (b) the name fails to be
> inclusive of the data fork.  Other suggestions anyone?

I believe that in the world of NTFS the concept is called "streams":
http://support.microsoft.com/kb/105763.


HTH,

Mark.

-- 
Mark Cave-Ayland
Sirius Corporation - The Open Source Experts
http://www.siriusit.co.uk
T: +44 870 608 0063



Re: Rewriting Free Space Map

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> I'm not wedded to "forks", that's just the name that was used in the
> only previous example I've seen.  Classic Mac had a "resource fork"
> and a "data fork" within each file.

fwiw forks are not unique to MacOS, c.f.:
http://en.wikipedia.org/wiki/Fork_%28filesystem%29

However I'm not sure reusing any of these terms is such a hot idea. All it's
going to do is confuse someone into thinking we're actually talking about HFS
forks or NTFS data streams or whatever. Better to pick a term that isn't
already being used for such things so people don't get misled.

> BTW, thinking about the Mac precedent a little more, I believe
> the way they grafted that Classic concept onto BSD was that
> applications (which the user thinks of as single objects) are
> now directories with multiple files inside them.  Probably it'd be
> overkill to think of turning each relation into a subdirectory, but
> then again maybe not?

Well there are upsides and downsides. Many OSes have difficulties when you
have many files in a single directory. This would tend to reduce that. On the
other hand it would drastically increase the number of directory files the OS
has to keep track of and the total number of inodes being referenced.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Rewriting Free Space Map

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> You're cavalierly waving away a whole boatload of problems that will
>>> arise as soon as you start trying to make the index AMs play along
>>> with this :-(.  
> 
>> It doesn't seem very hard.
> 
> The problem is that the index AMs are no longer in control of what goes
> where within their indexes, which has always been their prerogative to
> determine.  The fact that you think you can kluge btree to still work
> doesn't mean that it will work for other AMs.

Well, it does work with all the existing AMs AFAICS. I do agree with the 
general point; it'd certainly be cleaner, more modular and more flexible 
if the AMs didn't need to know about the existence of the maps.

>>> The idea that's becoming attractive to me while contemplating the
>>> multiple-maps problem is that we should adopt something similar to
>>> the old Mac OS idea of multiple "forks" in a relation.
> 
>> Hmm. You also need to teach at least xlog.c and xlogutils.c about the 
>> map forks, for full page images and the invalid page tracking.
> 
> Well, you'd have to teach them something anyway, for any incarnation
> of maps that they might need to update.

Umm, the WAL code doesn't care where the pages it operates on came from. 
Sure, we'll need rmgr-specific code that know what to do with the maps, 
but the full page image code would work without changes with the 
multiple RelFileNode approach.

The essential change with the map fork idea is that a RelFileNode no 
longer uniquely identifies a file on disk (ignoring the segmentation 
which is handled in smgr for now). Anything that operates on 
RelFileNodes, without any higher level information of what it is, needs 
to be modified to use RelFileNode+forkid instead. That includes at least 
the buffer manager, smgr, and the full page image code in xlog.c.

It's probably a pretty mechanical change, even though it affects a lot 
of code. We'd probably want to have a new struct, let's call it 
PhysFileId for now, for RelFileNode+forkid, and basically replace all 
occurrences of RelFileNode with PhysFileId in smgr, bufmgr and xlog code.

>> I also wonder what the performance impact of extending BufferTag is.
> 
> That's a fair objection, and obviously something we'd need to check.
> But I don't recall seeing hash_any so high on any profile that I think
> it'd be a big problem.

I do remember seeing hash_any in some oprofile runs. But that's fairly 
easy to test: we don't need to actually implement any of the stuff, 
other than add a field to BufferTag, and run pgbench.

>> My original thought was to have a separate RelFileNode for each of the 
>> maps. That would require no smgr or xlog changes, and not very many 
>> changes in the buffer manager, though I guess you'd more catalog 
>> changes. You had doubts about that on the previous thread 
>> (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but 
>> the "map forks" idea certainly seems much more invasive than that.
> 
> The main problems with that are (a) the need to expose every type of map
> in pg_class and (b) the need to pass all those relfilenode numbers down
> to pretty low levels of the system. 

(a) is certainly a valid point. Regarding (b), I don't think the low 
level stuff (I assume you mean smgr, bufmgr, bgwriter, xlog by that) 
would need to be passed any additional relfilenode numbers. Or rather, 
they already work with relfilenodes, and they don't need to know whether 
the relfilenode is for an index, a heap, or an FSM attached to something 
else. The relfilenodes would be in RelationData, and we already have 
that around whenever we do anything that needs to differentiate between 
those.

Another consideration is which approach is easiest to debug. The "map 
fork" approach seems better on that front, as you can immediately see 
from the PhysFileId if a page is coming from an auxiliary map or the 
main data portion. That might turn out to be handy in the buffer manager 
or bgwriter as well; they don't currently have any knowledge of what a 
page contains.

> The nice thing about the fork idea
> is that you don't need any added info to uniquely identify what relation
> you're working on.  The fork numbers would be hard-wired into whatever
> code needed to know about particular forks.  (Of course, these same
> advantages apply to using special space in an existing file.  I'm
> just suggesting that we can keep these advantages without buying into
> the restrictions that special space would have.)

I don't see that advantage. All the higher-level code that care which 
relation you're working on already have Relation around. All the 
lower-level stuff don't care.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Rewriting Free Space Map

From
"Dawid Kuroczko"
Date:
On Mon, Mar 17, 2008 at 6:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>  I'm not wedded to "forks", that's just the name that was used in the
>  only previous example I've seen.  Classic Mac had a "resource fork"
>  and a "data fork" within each file.
>
>  Don't think I like "maps" though, as (a) that prejudges what the
>  alternate forks might be used for, and (b) the name fails to be
>  inclusive of the data fork.  Other suggestions anyone?

Shadow?  As each err, fork trails each relfilenode? (Or perhaps "shade").

Hints?  As something more generic than "map"?
  Regards,    Dawid


Re: Rewriting Free Space Map

From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, Mar 17, 2008 at 01:23:46PM -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> >> Tom Lane wrote:
> >>> The idea that's becoming attractive to me while contemplating the
> >>> multiple-maps problem is that we should adopt something similar to
> >>> the old Mac OS idea of multiple "forks" in a relation.
> 
> > Can we call them "maps" or "metadata maps"? "forks" sounds weird.

Actually, I do like "forks", but to add a little bit diversity:

facets? aspects?

FWIW, the idea of mapping a relation to a directory quite compelling.

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFH33c7Bcgs9XrR2kYRAuBQAJ9MjISqgn37umRIydxtUBYONORwDgCbBKkE
y7adUy7s/30TxQPQiJZZejA=
=PAQ9
-----END PGP SIGNATURE-----


Re: Rewriting Free Space Map

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> More precisely, on CVS HEAD it takes between 7.1-7.2%. After extending 
> BufferTag with one uint32, it takes 7.4-7.5%. So the effect is 
> measurable if you try hard enough, but not anything to get worried about.

And if we adopt the allegedly-faster hash_any that's in the patch
queue, the difference should get smaller yet.
        regards, tom lane


Re: Rewriting Free Space Map

From
"Heikki Linnakangas"
Date:
Heikki Linnakangas wrote:
> Tom Lane wrote:
>> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>>> I also wonder what the performance impact of extending BufferTag is.
>>
>> That's a fair objection, and obviously something we'd need to check.
>> But I don't recall seeing hash_any so high on any profile that I think
>> it'd be a big problem.
> 
> I do remember seeing hash_any in some oprofile runs. But that's fairly 
> easy to test: we don't need to actually implement any of the stuff, 
> other than add a field to BufferTag, and run pgbench.

I tried that. hash_any wasn't significant on pgbench, but I was able to 
construct a test case where it is. It goes like this:

BEGIN;
CREATE TEMPORARY TABLE foo (id int4);
INSERT INTO foo SELECT a FROM generate_series(1, 10000) a;
INSERT INTO foo SELECT * FROM foo;
INSERT INTO foo SELECT * FROM foo;
INSERT INTO foo SELECT * FROM foo;
... (repeat multiple times)

oprofile says that hash_any consumes ~7 % of CPU time on that test on my 
laptop.

More precisely, on CVS HEAD it takes between 7.1-7.2%. After extending 
BufferTag with one uint32, it takes 7.4-7.5%. So the effect is 
measurable if you try hard enough, but not anything to get worried about.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com