Re: Relation forks & FSM rewrite patches - Mailing list pgsql-patches

From Heikki Linnakangas
Subject Re: Relation forks & FSM rewrite patches
Date
Msg-id 486E159D.2050405@enterprisedb.com
Whole thread Raw
In response to Re: Relation forks & FSM rewrite patches  (Simon Riggs <simon@2ndquadrant.com>)
Responses Re: Relation forks & FSM rewrite patches
Re: Relation forks & FSM rewrite patches
List pgsql-patches
Simon Riggs wrote:
> On Fri, 2008-07-04 at 12:27 +0300, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Fri, 2008-06-27 at 19:36 +0300, Heikki Linnakangas wrote:
>>>
>>>> Here's an updated version of the "relation forks" patch, and an
>>>> incremental FSM rewrite patch on top of that. The relation forks patch
>>>> is ready for review. The FSM implementation is more work-in-progress
>>>> still, but I'd like to get some review on that as well, with the goal of
>>>> doing more performance testing and committing it after the commit fest.
>>> Hmmm, a 6000 line page with no visible documentation, readme, nor any
>>> discussion on -hackers that I'm aware of.
>> There is a readme in the patch, and there certainly has been discussion
>> on -hackers, see:
>>
>> http://archives.postgresql.org/pgsql-hackers/2008-04/msg00415.php
>>
>> where the current design is discussed for the first time.
>
> OK, I see the readme now. Thanks.
>
> Minor point but the readme needs to draw a clear distinction between FSM
> pages and data pages, which confused me when I read it first time.

Ok, thanks.

I had trouble finding distinctive terms for the tree within a page, and
the tree of FSM blocks, with root at block 0.

> Contention on FSM pages concerns me. Every change has the potential to
> bubble up to the top, which would cause a significant problem. I'd like
> to find a way to minimise the number of bubble-up operations, otherwise
> there will be no material difference between using a single whole-FSM
> lock like we do now and this new scheme. Note that in this new scheme
> the path length to the bottom of the tree is considerably longer and can
> stall waiting on I/O - so contention in the FSM is a big issue. (It
> always has been in databases as long as I can remember).

There's already some mitigating factors:

1. You only need to bubble up to an upper level FSM page if the amount
at the top of the leaf FSM page changes. Keep in mind that one FSM page
holds FSM information on ~4000 heap pages, so you don't need to bubble
up if there's any page within that 4000 page range that has as much or
more space than the page you're updating.

2. We only update the FSM when we try to insert a tuple and find that it
doesn't fit. That reduces the amount of FSM updates dramatically when
you're doing bulk inserts. (This is the same as in the current
implementation, though I'm not sure it's optimal anymore.)

I haven't been able to come up with a simple test case that shows any
meaningful performance difference between the current and this new
implementation. Got any ideas? I fear that we're going overboard trying
to avoid contention that simple isn't there, but it's hard to argue
without a concrete test case to analyze..

> The FSM tree as proposed has exact values.

Not quite. Free space is tracked in BLCKSZ/256 (=32 with default BLCKSZ)
increments, so that we only need one byte per heap page.

> What if we bubbled up based
> upon only significant tree changes, rather than exact changes?

Hmm. So an update would only ever update the lowest-level FSM page, and
leave a mismatch between the value at the root node of a lower level FSM
page, and the corresponding value at the lead node of its parent. The
mismatch would then need to be fixed by the next search that traverses
down that path, and finds that there's not enough space there after all.

That works when the amount of free space on page is decremented. VACUUM,
that increments it, would still need to bubble up the change, because if
the value at the upper level is not fixed, no search might ever traverse
down that path, and the value would never be fixed.

That would solve the "I/O while holding lock" issue. (not that I'm too
worried about it, though)

> So perhaps we can perform bubble-up only when the change in
> free space in greater than avg row size or the remaining space has
> dropped to less than 2*avg row size, where exact values begin to matter
> more.

One idea is to make the mapping from "amount of free space in bytes" to
the 1-byte "FSM category" non-linear. For example, use just one category
for > 2000 bytes free, on the assumption that anything larger than that
is toasted anyway, and divide the 255 categories evenly across the
remaining 2000 bytes.

I would like to move away from using the average row width in the
calculation if possible. We use it in the current implementation, but if
we won't to keep doing it, we'll need to track it within the FSM like
the current implementation does, adding complexity and contention issues
of its own. Or reach into the statistics from the FSM, but I don't like
that idea much either.

> Also, as FSM map becomes empty we will see more and more bubble up
> operations reaching top parts of the tree. We need a way to avoid
> contention from growing over time.

Yeah, that's something to watch out for.

> I'm not happy about the FSM being WAL logged for minor changes (new
> pages, yes). The high number of changes will cause extra traffic where
> we don't want it. This will accentuate the locks in key areas and will
> introduce additional delays into the code paths that use FSM, but don't
> currently write WAL. Writing WAL will further exacerbate the expected
 > contention around the FSM root page.

There's no such code paths AFAICS. The FSM is only updated on INSERTS,
UPDATEs and VACUUMs, and they're already WAL-logged. We will be writing
an extra WAL record, though, in addition to the ones we already write.

At first, I tried to piggy-back on the WAL replay routines of heap
inserts and updates, but it turned out to be quite complicated because
of the bubbling up. If we don't bubble up at update time, per your idea,
perhaps it would be possible after all.

> We will write dirty FSM pages at checkpoint, so just allow that rather
> than logging every change. If we crash, going back to the last
> checkpoint is probably OK, since all mistakes will cause corrections in
> the FSM anyway and it will soon correct itself. If the values are only
> approximate this will make the post-crash imprecision of the FSM less
> important anyway.

If we go down that path, we really need to make sure that the FSM is
self-correcting. The current implementation isn't, the bubble up
operations need to be replayed from WAL, or you end up with an
incoherent FSM.

Changing the logic so that the upper nodes are fixed at searches, rather
than the updates, helps, but vacuums that increase the amount of free
space on page would still need to be WAL-logged. Mixing WAL-logged and
non-WAL-logged operations sounds like a sure way to get confused.

> If you're worried about sending the FSM to standby
> servers, then we can copy it to WAL once every few checkpoints.

That seems hard. You'd need to somehow keep track of which FSM pages has
been modified since last such operation, and trigger the copy at the
right moment.

We do need to handle standby servers somehow. That's one of the major
drawbacks of the current implementation.

> The tree structure seems regular. Can we jump straight to bottom of tree
> when its clear that there's tons of space available in certain part of
> the table?

Yes.

> One of the current functions of the FSM is to hand out a different
> target block to each backend. This naturally reduces contention during
> heavy writes. The current design uses random(), but I fear that may not
> be very useful when the number of useful routes is reduced. Is there a
> more deterministic and uniformly better way of separating out backends?
> Using the bits of the pid to send the search different ways?

I can't be too excited about that. If there's only few useful routes,
you're just about to run out of free space anyway.

> I'd suggest that FSM page locks be
> usually taken conditionally after the root level. So if one route is
> busy, try another, unless the caller prefers to wait to allow keeping
> the heap in order.

We're only talking about lightweight locks here. Doing another
ReadBuffer, possibly causing I/O, and trying to lock another buffer
instead is surely more expensive than just waiting on the first lock.

>> I
>> ran DBT-2 with this, and after about 1h a FSM lookup goes into an
>> endless loop, hogging all CPU. I suspect that there's a bug somewhere so
>> that a change to the root node of a lower level FSM page isn't
>> propagated to the upper level FSM page for some reason. oprofile shows
>> that the endless loop happens in search_avail. This is why I added the
>> code to give up after 1000 tries, hoping to get some debugging output
>> the next time that happens.
>
> Maybe re-initialising the level value when jumping between pages, so it
> restarts at level zero. Maybe call them level 1+ so you can spot that
> more easily.

Now you lost me. 'level' is re-initialized in GetPageWithFreeSpace when
we start over.

BTW, level zero is the *bottom* level, and 1 is the "middle" level, and
2 is the root page.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

pgsql-patches by date:

Previous
From: Zdenek Kotala
Date:
Subject: Re: page macros cleanup (ver 04)
Next
From: Simon Riggs
Date:
Subject: Re: Relation forks & FSM rewrite patches