Thread: Relation forks & FSM rewrite patches

Relation forks & FSM rewrite patches

From
"Heikki Linnakangas"
Date:
Here's an updated version of the "relation forks" patch, and an
incremental FSM rewrite patch on top of that. The relation forks patch
is ready for review. The FSM implementation is more work-in-progress
still, but I'd like to get some review on that as well, with the goal of
doing more performance testing and committing it after the commit fest.

The one part that I'm not totally satisfied in the relation forks patch
is the smgrcreate() function. The question problem is: which piece of
code decides which forks to create for a relation, and when to create
them? I settled on the concept that all forks that a relation will need
are created at once, in one smgrcreate() call. There's no function to
create additional forks later on. Likewise, there's no function to
unlink individual forks, you have to unlink the whole relation.

Currently, heap_create() decides which forks to create. That's fine at
the moment, but will become problematic in the future, as it's
indexam-specific which forks an index requires. That decision should
really be done in the indexam. One possibility would be to only create
the main fork in heap_create(), and let indexam create any additional
forks it needs in ambuild function.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: Relation forks & FSM rewrite patches

From
Mark Kirkwood
Date:
Heikki Linnakangas wrote:
> Here's an updated version of the "relation forks" patch, and an
> incremental FSM rewrite patch on top of that. The relation forks patch
> is ready for review. The FSM implementation is more work-in-progress
> still, but I'd like to get some review on that as well, with the goal
> of doing more performance testing and committing it after the commit
> fest.
>
> The one part that I'm not totally satisfied in the relation forks
> patch is the smgrcreate() function. The question problem is: which
> piece of code decides which forks to create for a relation, and when
> to create them? I settled on the concept that all forks that a
> relation will need are created at once, in one smgrcreate() call.
> There's no function to create additional forks later on. Likewise,
> there's no function to unlink individual forks, you have to unlink the
> whole relation.
>
> Currently, heap_create() decides which forks to create. That's fine at
> the moment, but will become problematic in the future, as it's
> indexam-specific which forks an index requires. That decision should
> really be done in the indexam. One possibility would be to only create
> the main fork in heap_create(), and let indexam create any additional
> forks it needs in ambuild function.
>

I've had a bit of a play with this, looks pretty nice. One point that
comes to mind is that we are increasing the number of required file
descriptors by a factor of 2 (and possibly 3 if we use a fork for the
visibility map). Is this a problem do you think?

Cheers

Mark

Re: Relation forks & FSM rewrite patches

From
Simon Riggs
Date:
On Fri, 2008-06-27 at 19:36 +0300, Heikki Linnakangas wrote:

> Here's an updated version of the "relation forks" patch, and an
> incremental FSM rewrite patch on top of that. The relation forks patch
> is ready for review. The FSM implementation is more work-in-progress
> still, but I'd like to get some review on that as well, with the goal of
> doing more performance testing and committing it after the commit fest.

Hmmm, a 6000 line page with no visible documentation, readme, nor any
discussion on -hackers that I'm aware of.

Could you write something up please? And/or post links to the relevant
discussions wherever they took place.

I'm interested in the FSM and how it might change.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Relation forks & FSM rewrite patches

From
"Heikki Linnakangas"
Date:
Simon Riggs wrote:
> On Fri, 2008-06-27 at 19:36 +0300, Heikki Linnakangas wrote:
>
>> Here's an updated version of the "relation forks" patch, and an
>> incremental FSM rewrite patch on top of that. The relation forks patch
>> is ready for review. The FSM implementation is more work-in-progress
>> still, but I'd like to get some review on that as well, with the goal of
>> doing more performance testing and committing it after the commit fest.
>
> Hmmm, a 6000 line page with no visible documentation, readme, nor any
> discussion on -hackers that I'm aware of.

There is a readme in the patch, and there certainly has been discussion
on -hackers, see:

http://archives.postgresql.org/pgsql-hackers/2008-04/msg00415.php

where the current design is discussed for the first time.

User documentation is still to be done. There isn't anything to tune, or
anything that requires manual maintenance, so there isn't much to
document, though, except perhaps a chapter in the "Internals" part.

Here's an updated version of the patch. Changes:
- one bug has been fixed (I assumed that it's always safe to call
rightchild(x) on a parent node, but that was not always true for the
rightmost branch of the tree)
- there's some new debugging code.
- if we can't find a page with enough free space in search_avail after
starting from scratch 1000 times, give up. That shouldn't happen, but
see below.

There is still a nasty bug somewhere, probably related to locking :-(. I
ran DBT-2 with this, and after about 1h a FSM lookup goes into an
endless loop, hogging all CPU. I suspect that there's a bug somewhere so
that a change to the root node of a lower level FSM page isn't
propagated to the upper level FSM page for some reason. oprofile shows
that the endless loop happens in search_avail. This is why I added the
code to give up after 1000 tries, hoping to get some debugging output
the next time that happens.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Relation forks & FSM rewrite patches

From
Simon Riggs
Date:
On Fri, 2008-07-04 at 12:27 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Fri, 2008-06-27 at 19:36 +0300, Heikki Linnakangas wrote:
> >
> >> Here's an updated version of the "relation forks" patch, and an
> >> incremental FSM rewrite patch on top of that. The relation forks patch
> >> is ready for review. The FSM implementation is more work-in-progress
> >> still, but I'd like to get some review on that as well, with the goal of
> >> doing more performance testing and committing it after the commit fest.
> >
> > Hmmm, a 6000 line page with no visible documentation, readme, nor any
> > discussion on -hackers that I'm aware of.
>
> There is a readme in the patch, and there certainly has been discussion
> on -hackers, see:
>
> http://archives.postgresql.org/pgsql-hackers/2008-04/msg00415.php
>
> where the current design is discussed for the first time.

OK, I see the readme now. Thanks.

Minor point but the readme needs to draw a clear distinction between FSM
pages and data pages, which confused me when I read it first time.

Few thoughts:

Contention on FSM pages concerns me. Every change has the potential to
bubble up to the top, which would cause a significant problem. I'd like
to find a way to minimise the number of bubble-up operations, otherwise
there will be no material difference between using a single whole-FSM
lock like we do now and this new scheme. Note that in this new scheme
the path length to the bottom of the tree is considerably longer and can
stall waiting on I/O - so contention in the FSM is a big issue. (It
always has been in databases as long as I can remember).

The FSM tree as proposed has exact values. What if we bubbled up based
upon only significant tree changes, rather than exact changes? After
all, we only really care about whether we can fit one tuple in, so
changes smaller than avg row length have no real benefit, just
contention. So perhaps we can perform bubble-up only when the change in
free space in greater than avg row size or the remaining space has
dropped to less than 2*avg row size, where exact values begin to matter
more. That way fewer changes bubble up and fewer write locks needed on
higher level pages. We probably need to track row size variation to make
this work effectively in some cases.

Also, as FSM map becomes empty we will see more and more bubble up
operations reaching top parts of the tree. We need a way to avoid
contention from growing over time.

I'm not happy about the FSM being WAL logged for minor changes (new
pages, yes). The high number of changes will cause extra traffic where
we don't want it. This will accentuate the locks in key areas and will
introduce additional delays into the code paths that use FSM, but don't
currently write WAL. Writing WAL will further exacerbate the expected
contention around the FSM root page.

We will write dirty FSM pages at checkpoint, so just allow that rather
than logging every change. If we crash, going back to the last
checkpoint is probably OK, since all mistakes will cause corrections in
the FSM anyway and it will soon correct itself. If the values are only
approximate this will make the post-crash imprecision of the FSM less
important anyway. If you're worried about sending the FSM to standby
servers, then we can copy it to WAL once every few checkpoints.

The tree structure seems regular. Can we jump straight to bottom of tree
when its clear that there's tons of space available in certain part of
the table?

One of the current functions of the FSM is to hand out a different
target block to each backend. This naturally reduces contention during
heavy writes. The current design uses random(), but I fear that may not
be very useful when the number of useful routes is reduced. Is there a
more deterministic and uniformly better way of separating out backends?
Using the bits of the pid to send the search different ways?

(On a humorous note, it surprises me that you have no time to work on
index enhancements, so the first thing you do is write a custom index
for the FSM :-)

> User documentation is still to be done. There isn't anything to tune, or
> anything that requires manual maintenance, so there isn't much to
> document, though, except perhaps a chapter in the "Internals" part.

Yeh, can't see anything I'd want a parameter for.

> Here's an updated version of the patch. Changes:
> - one bug has been fixed (I assumed that it's always safe to call
> rightchild(x) on a parent node, but that was not always true for the
> rightmost branch of the tree)
> - there's some new debugging code.
> - if we can't find a page with enough free space in search_avail after
> starting from scratch 1000 times, give up. That shouldn't happen, but
> see below.

Such a heuristic seems useful. I'd suggest that FSM page locks be
usually taken conditionally after the root level. So if one route is
busy, try another, unless the caller prefers to wait to allow keeping
the heap in order. Presumably the FSM can page to disk, so some waits
may be much longer than anybody cares to wait across.

> There is still a nasty bug somewhere, probably related to locking :-(.

Thats a generally true statement :-)

> I
> ran DBT-2 with this, and after about 1h a FSM lookup goes into an
> endless loop, hogging all CPU. I suspect that there's a bug somewhere so
> that a change to the root node of a lower level FSM page isn't
> propagated to the upper level FSM page for some reason. oprofile shows
> that the endless loop happens in search_avail. This is why I added the
> code to give up after 1000 tries, hoping to get some debugging output
> the next time that happens.

Maybe re-initialising the level value when jumping between pages, so it
restarts at level zero. Maybe call them level 1+ so you can spot that
more easily.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Relation forks & FSM rewrite patches

From
"Heikki Linnakangas"
Date:
Simon Riggs wrote:
> On Fri, 2008-07-04 at 12:27 +0300, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Fri, 2008-06-27 at 19:36 +0300, Heikki Linnakangas wrote:
>>>
>>>> Here's an updated version of the "relation forks" patch, and an
>>>> incremental FSM rewrite patch on top of that. The relation forks patch
>>>> is ready for review. The FSM implementation is more work-in-progress
>>>> still, but I'd like to get some review on that as well, with the goal of
>>>> doing more performance testing and committing it after the commit fest.
>>> Hmmm, a 6000 line page with no visible documentation, readme, nor any
>>> discussion on -hackers that I'm aware of.
>> There is a readme in the patch, and there certainly has been discussion
>> on -hackers, see:
>>
>> http://archives.postgresql.org/pgsql-hackers/2008-04/msg00415.php
>>
>> where the current design is discussed for the first time.
>
> OK, I see the readme now. Thanks.
>
> Minor point but the readme needs to draw a clear distinction between FSM
> pages and data pages, which confused me when I read it first time.

Ok, thanks.

I had trouble finding distinctive terms for the tree within a page, and
the tree of FSM blocks, with root at block 0.

> Contention on FSM pages concerns me. Every change has the potential to
> bubble up to the top, which would cause a significant problem. I'd like
> to find a way to minimise the number of bubble-up operations, otherwise
> there will be no material difference between using a single whole-FSM
> lock like we do now and this new scheme. Note that in this new scheme
> the path length to the bottom of the tree is considerably longer and can
> stall waiting on I/O - so contention in the FSM is a big issue. (It
> always has been in databases as long as I can remember).

There's already some mitigating factors:

1. You only need to bubble up to an upper level FSM page if the amount
at the top of the leaf FSM page changes. Keep in mind that one FSM page
holds FSM information on ~4000 heap pages, so you don't need to bubble
up if there's any page within that 4000 page range that has as much or
more space than the page you're updating.

2. We only update the FSM when we try to insert a tuple and find that it
doesn't fit. That reduces the amount of FSM updates dramatically when
you're doing bulk inserts. (This is the same as in the current
implementation, though I'm not sure it's optimal anymore.)

I haven't been able to come up with a simple test case that shows any
meaningful performance difference between the current and this new
implementation. Got any ideas? I fear that we're going overboard trying
to avoid contention that simple isn't there, but it's hard to argue
without a concrete test case to analyze..

> The FSM tree as proposed has exact values.

Not quite. Free space is tracked in BLCKSZ/256 (=32 with default BLCKSZ)
increments, so that we only need one byte per heap page.

> What if we bubbled up based
> upon only significant tree changes, rather than exact changes?

Hmm. So an update would only ever update the lowest-level FSM page, and
leave a mismatch between the value at the root node of a lower level FSM
page, and the corresponding value at the lead node of its parent. The
mismatch would then need to be fixed by the next search that traverses
down that path, and finds that there's not enough space there after all.

That works when the amount of free space on page is decremented. VACUUM,
that increments it, would still need to bubble up the change, because if
the value at the upper level is not fixed, no search might ever traverse
down that path, and the value would never be fixed.

That would solve the "I/O while holding lock" issue. (not that I'm too
worried about it, though)

> So perhaps we can perform bubble-up only when the change in
> free space in greater than avg row size or the remaining space has
> dropped to less than 2*avg row size, where exact values begin to matter
> more.

One idea is to make the mapping from "amount of free space in bytes" to
the 1-byte "FSM category" non-linear. For example, use just one category
for > 2000 bytes free, on the assumption that anything larger than that
is toasted anyway, and divide the 255 categories evenly across the
remaining 2000 bytes.

I would like to move away from using the average row width in the
calculation if possible. We use it in the current implementation, but if
we won't to keep doing it, we'll need to track it within the FSM like
the current implementation does, adding complexity and contention issues
of its own. Or reach into the statistics from the FSM, but I don't like
that idea much either.

> Also, as FSM map becomes empty we will see more and more bubble up
> operations reaching top parts of the tree. We need a way to avoid
> contention from growing over time.

Yeah, that's something to watch out for.

> I'm not happy about the FSM being WAL logged for minor changes (new
> pages, yes). The high number of changes will cause extra traffic where
> we don't want it. This will accentuate the locks in key areas and will
> introduce additional delays into the code paths that use FSM, but don't
> currently write WAL. Writing WAL will further exacerbate the expected
 > contention around the FSM root page.

There's no such code paths AFAICS. The FSM is only updated on INSERTS,
UPDATEs and VACUUMs, and they're already WAL-logged. We will be writing
an extra WAL record, though, in addition to the ones we already write.

At first, I tried to piggy-back on the WAL replay routines of heap
inserts and updates, but it turned out to be quite complicated because
of the bubbling up. If we don't bubble up at update time, per your idea,
perhaps it would be possible after all.

> We will write dirty FSM pages at checkpoint, so just allow that rather
> than logging every change. If we crash, going back to the last
> checkpoint is probably OK, since all mistakes will cause corrections in
> the FSM anyway and it will soon correct itself. If the values are only
> approximate this will make the post-crash imprecision of the FSM less
> important anyway.

If we go down that path, we really need to make sure that the FSM is
self-correcting. The current implementation isn't, the bubble up
operations need to be replayed from WAL, or you end up with an
incoherent FSM.

Changing the logic so that the upper nodes are fixed at searches, rather
than the updates, helps, but vacuums that increase the amount of free
space on page would still need to be WAL-logged. Mixing WAL-logged and
non-WAL-logged operations sounds like a sure way to get confused.

> If you're worried about sending the FSM to standby
> servers, then we can copy it to WAL once every few checkpoints.

That seems hard. You'd need to somehow keep track of which FSM pages has
been modified since last such operation, and trigger the copy at the
right moment.

We do need to handle standby servers somehow. That's one of the major
drawbacks of the current implementation.

> The tree structure seems regular. Can we jump straight to bottom of tree
> when its clear that there's tons of space available in certain part of
> the table?

Yes.

> One of the current functions of the FSM is to hand out a different
> target block to each backend. This naturally reduces contention during
> heavy writes. The current design uses random(), but I fear that may not
> be very useful when the number of useful routes is reduced. Is there a
> more deterministic and uniformly better way of separating out backends?
> Using the bits of the pid to send the search different ways?

I can't be too excited about that. If there's only few useful routes,
you're just about to run out of free space anyway.

> I'd suggest that FSM page locks be
> usually taken conditionally after the root level. So if one route is
> busy, try another, unless the caller prefers to wait to allow keeping
> the heap in order.

We're only talking about lightweight locks here. Doing another
ReadBuffer, possibly causing I/O, and trying to lock another buffer
instead is surely more expensive than just waiting on the first lock.

>> I
>> ran DBT-2 with this, and after about 1h a FSM lookup goes into an
>> endless loop, hogging all CPU. I suspect that there's a bug somewhere so
>> that a change to the root node of a lower level FSM page isn't
>> propagated to the upper level FSM page for some reason. oprofile shows
>> that the endless loop happens in search_avail. This is why I added the
>> code to give up after 1000 tries, hoping to get some debugging output
>> the next time that happens.
>
> Maybe re-initialising the level value when jumping between pages, so it
> restarts at level zero. Maybe call them level 1+ so you can spot that
> more easily.

Now you lost me. 'level' is re-initialized in GetPageWithFreeSpace when
we start over.

BTW, level zero is the *bottom* level, and 1 is the "middle" level, and
2 is the root page.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Relation forks & FSM rewrite patches

From
Simon Riggs
Date:
On Fri, 2008-07-04 at 15:20 +0300, Heikki Linnakangas wrote:

> 2. We only update the FSM when we try to insert a tuple and find that it
> doesn't fit. That reduces the amount of FSM updates dramatically when
> you're doing bulk inserts. (This is the same as in the current
> implementation, though I'm not sure it's optimal anymore.)

OK, so the ideal search algorithm would be to search the FSM page you
just updated, rather than starting right from the top again. That way,
we'd only need to do one page access to do a page-full-need-another-page
operation. That's O(1) whatever the table size.

> I haven't been able to come up with a simple test case that shows any
> meaningful performance difference between the current and this new
> implementation. Got any ideas? I fear that we're going overboard trying
> to avoid contention that simple isn't there, but it's hard to argue
> without a concrete test case to analyze..

We do very poorly at measuring contention every where else too... yet we
know it exists and that we have a long way to go yet.

My concern is too avoid contention theoretically. Almost all the cases
of contention we've seen have been pulsing or occasional traffic jam
cases not nice steady contention. As a result we've repeatedly
underestimated the knock-on effects through the system.

Right now this scheme introduces new sources of wait
* access to FSM page while wait for buffer partition lock
* access to FSM page while I/O occurs
* access to FSM blocked by exclusive lock while waiting for
WALInsertLock

Both of those are significantly longer than any wait on existing FSM
lwlock.

> > The FSM tree as proposed has exact values.
>
> Not quite. Free space is tracked in BLCKSZ/256 (=32 with default BLCKSZ)
> increments, so that we only need one byte per heap page.

OK, so that probably addresses my concerns then. That is imprecise to
avoid changes across pages when only a few bytes differ between values
higher up the FSM tree. I like that.

> One idea is to make the mapping from "amount of free space in bytes" to
> the 1-byte "FSM category" non-linear. For example, use just one category
> for > 2000 bytes free, on the assumption that anything larger than that
> is toasted anyway, and divide the 255 categories evenly across the
> remaining 2000 bytes.

Hmm. Why not use 4 bits per page and store a log2 value? That still
gives nice big chunks, without too much difference.

i.e.
0-31 bytes free = 0
32-64 bytes free = 1
65-127 bytes free = 2
128-255 bytes free = 3
256-511 bytes free = 4
512-1023 bytes free = 5
1024-2047 bytes free = 6
etc

That works for BLCKSZ > 8192, which the 1 byte linear scheme doesn't.

It would also give you a fan out of 8 rather than just 4.

It also leaves a free bit per page to record other information also,
when BLCKSZ = 8192.

> > I'm not happy about the FSM being WAL logged for minor changes (new
> > pages, yes). The high number of changes will cause extra traffic where
> > we don't want it. This will accentuate the locks in key areas and will
> > introduce additional delays into the code paths that use FSM, but don't
> > currently write WAL. Writing WAL will further exacerbate the expected
>  > contention around the FSM root page.
>
> There's no such code paths AFAICS. The FSM is only updated on INSERTS,
> UPDATEs and VACUUMs, and they're already WAL-logged. We will be writing
> an extra WAL record, though, in addition to the ones we already write.

Right. So we'll have to queue for the damn lock twice.

> At first, I tried to piggy-back on the WAL replay routines of heap
> inserts and updates, but it turned out to be quite complicated because
> of the bubbling up. If we don't bubble up at update time, per your idea,
> perhaps it would be possible after all.
>
> > We will write dirty FSM pages at checkpoint, so just allow that rather
> > than logging every change. If we crash, going back to the last
> > checkpoint is probably OK, since all mistakes will cause corrections in
> > the FSM anyway and it will soon correct itself. If the values are only
> > approximate this will make the post-crash imprecision of the FSM less
> > important anyway.
>
> If we go down that path, we really need to make sure that the FSM is
> self-correcting. The current implementation isn't, the bubble up
> operations need to be replayed from WAL, or you end up with an
> incoherent FSM.

Well, make changes across pages WAL-logged, so the FSM always links up correctly, but might be wrong on some of the
internalnodes. We don't want the structure to be incoherent, but the values themselves aren't very exciting. 

> Changing the logic so that the upper nodes are fixed at searches, rather
> than the updates, helps, but vacuums that increase the amount of free
> space on page would still need to be WAL-logged. Mixing WAL-logged and
> non-WAL-logged operations sounds like a sure way to get confused.
>

> > I'd suggest that FSM page locks be
> > usually taken conditionally after the root level. So if one route is
> > busy, try another, unless the caller prefers to wait to allow keeping
> > the heap in order.
>
> We're only talking about lightweight locks here. Doing another
> ReadBuffer, possibly causing I/O, and trying to lock another buffer
> instead is surely more expensive than just waiting on the first lock.

Then you should do it with conditional read buffer, so we don't do I/O
if we don't need to.

> > Maybe re-initialising the level value when jumping between pages, so it
> > restarts at level zero. Maybe call them level 1+ so you can spot that
> > more easily.
>
> Now you lost me. 'level' is re-initialized in GetPageWithFreeSpace when
> we start over.
>
> BTW, level zero is the *bottom* level, and 1 is the "middle" level, and
> 2 is the root page.

Yes,so reinitialising the level variable would cause the search to start
again at the bottom of the tree, hence looping.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Relation forks & FSM rewrite patches

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:
> Simon Riggs wrote:

>> What if we bubbled up based
>> upon only significant tree changes, rather than exact changes?
>
> Hmm. So an update would only ever update the lowest-level FSM page, and
> leave a mismatch between the value at the root node of a lower level FSM
> page, and the corresponding value at the lead node of its parent. The
> mismatch would then need to be fixed by the next search that traverses
> down that path, and finds that there's not enough space there after all.
>
> That works when the amount of free space on page is decremented. VACUUM,
> that increments it, would still need to bubble up the change, because if
> the value at the upper level is not fixed, no search might ever traverse
> down that path, and the value would never be fixed.

I wonder if we could instead make vacuum write a new FSM tree from
scratch, rather than updating it piecemeal.  The problem that arises is
a concurrent updater.  Maybe we can just ignore those (i.e. someone that
consumes free space while vacuum is running does not update the FSM),
and let the bogus values (which can be higher, but not lower, than the
actual free space) be updated later, when someone tries to use it and
finds that it's not correct.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Relation forks & FSM rewrite patches

From
"Heikki Linnakangas"
Date:
Alvaro Herrera wrote:
> I wonder if we could instead make vacuum write a new FSM tree from
> scratch, rather than updating it piecemeal.

I'd like to move away from that, as we'll hopefully get some sort of
partial vacuum capability soon.

We might want to have some sort of bulk update operation, though, so
that you could refresh FSM information for say 100 pages, or one FSM
page, at a time. That would reduce the WAL traffic of a VACUUM
significantly, though I'm not sure how significant that is to begin with.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Relation forks & FSM rewrite patches

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Here's an updated version of the "relation forks" patch, and an
> incremental FSM rewrite patch on top of that. The relation forks patch
> is ready for review. The FSM implementation is more work-in-progress
> still, but I'd like to get some review on that as well, with the goal of
> doing more performance testing and committing it after the commit fest.

I looked through the "relation forks" patch, and think it's close to
committable, but I have a few minor gripes.

> The one part that I'm not totally satisfied in the relation forks patch
> is the smgrcreate() function. The question problem is: which piece of
> code decides which forks to create for a relation, and when to create
> them? I settled on the concept that all forks that a relation will need
> are created at once, in one smgrcreate() call. There's no function to
> create additional forks later on.

I think that's an extremely bad idea.  You need look no further than
Zdenek's hopes of in-place-upgrade to see a need for adding a fork
to an existing relation; but even without that I can see possibly
wanting to not create a fork right away.  I think smgrcreate() should
just create one fork as it's told (ie, same API you gave mdcreate).

> Currently, heap_create() decides which forks to create. That's fine at
> the moment, but will become problematic in the future, as it's
> indexam-specific which forks an index requires. That decision should
> really be done in the indexam. One possibility would be to only create
> the main fork in heap_create(), and let indexam create any additional
> forks it needs in ambuild function.

Problem goes away if you redo smgrcreate API as above.

Other nits:

relpath() is now in danger of buffer overrun: you need to increase
the palloc() sizes.  Seems like a #define for the max number of digits
in a ForkNumber wouldn't be out of place (though I sure hope it never
gets past 1 ...).

Also, I strongly suggest *not* appending _0 to the name of the main fork's
file.  This'd break on-disk compatibility and people's expectations,
for no particular reason.

Don't like the name NUM_FORKS; it seems to suggest that's the actual
number of forks in existence.  MAX_NUM_FORKS would be better.

I think that setting InvalidForkNumber to -1 is unportable: there is
nothing compelling the compiler to represent enum ForkNumber as a signed
type, so the places where you assume a ForkNumber variable can hold
InvalidForkNumber all look like trouble.  One solution is to include
InvalidForkNumber in the enum:

    enum ForkNumber {
        InvalidForkNumber = -1,
        MAIN_FORKNUM = 0
    };

This would be a bit annoying if someone wrote a switch statement listing
different possible fork numbers, as the compiler would complain if
there's no case for InvalidForkNumber; but I can't see a reason for code
like that to be common.

Another possibility is to set InvalidForkNumber = 0, MAIN_FORKNUM = 1;
which is still a bit of a type cheat if InvalidForkNumber isn't listed
in the enum, but one that's very unlikely to cause trouble.  A bigger
problem is it would mess up a lot of your loops.

BTW, don't put a comma at end of enum ForkNumber, I think some compilers
will barf on that.

Last, don't forget to update the sgml docs chapter on database physical
storage.

            regards, tom lane

Re: Relation forks & FSM rewrite patches

From
"Heikki Linnakangas"
Date:
Attached is an new version of the relation forks and FSM patches,
updated per Tom's comments (details below). I think the relation forks
patch is ready for commit, except that the docs on physical storage
needs to be updated. Barring any objections, I will commit the relation
forks patch in a few days, and submit a patch for the docs.

The FSM patch has been updated so that it applied over the new relation
forks patch.

Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> The one part that I'm not totally satisfied in the relation forks patch
>> is the smgrcreate() function. The question problem is: which piece of
>> code decides which forks to create for a relation, and when to create
>> them? I settled on the concept that all forks that a relation will need
>> are created at once, in one smgrcreate() call. There's no function to
>> create additional forks later on.
>
> I think that's an extremely bad idea.  You need look no further than
> Zdenek's hopes of in-place-upgrade to see a need for adding a fork
> to an existing relation; but even without that I can see possibly
> wanting to not create a fork right away.  I think smgrcreate() should
> just create one fork as it's told (ie, same API you gave mdcreate).

Yeah, that's better.

> Other nits:
>
> relpath() is now in danger of buffer overrun: you need to increase
> the palloc() sizes.

Oops, fixed.

>  Seems like a #define for the max number of digits
> in a ForkNumber wouldn't be out of place (though I sure hope it never
> gets past 1 ...).

Done. It makes it more obvious how the buffer length is calculated than
using plain numbers.

> Also, I strongly suggest *not* appending _0 to the name of the main fork's
> file.  This'd break on-disk compatibility and people's expectations,
> for no particular reason.

Done.

> Don't like the name NUM_FORKS; it seems to suggest that's the actual
> number of forks in existence.  MAX_NUM_FORKS would be better.

I don't like MAX_NUM_FORKS. I renamed it to MAX_FORKNUM (and changed its
semantics accordingly).

> I think that setting InvalidForkNumber to -1 is unportable: there is
> nothing compelling the compiler to represent enum ForkNumber as a signed
> type, so the places where you assume a ForkNumber variable can hold
> InvalidForkNumber all look like trouble.  One solution is to include
> InvalidForkNumber in the enum:
>
>     enum ForkNumber {
>         InvalidForkNumber = -1,
>         MAIN_FORKNUM = 0
>     };
>
> This would be a bit annoying if someone wrote a switch statement listing
> different possible fork numbers, as the compiler would complain if
> there's no case for InvalidForkNumber; but I can't see a reason for code
> like that to be common.

I chose this approach. There's no switch constructs like that at the
moment, and I don't see any immediate need for one. In fact, at the
moment InvalidForkNumber is only used in bgwriter database fsync
requests, where the fork number field doesn't mean anything.

> BTW, don't put a comma at end of enum ForkNumber, I think some compilers
> will barf on that.

Done.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment