Re: The Free Space Map: Problems and Opportunities - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: The Free Space Map: Problems and Opportunities
Date
Msg-id CAH2-Wzk2dU7xnKrdfs0O_eg=uXdBBC9XbwBM4yk5jXLRht4xyQ@mail.gmail.com
Whole thread Raw
In response to Re: The Free Space Map: Problems and Opportunities  (Bruce Momjian <bruce@momjian.us>)
Responses Re: The Free Space Map: Problems and Opportunities
Re: The Free Space Map: Problems and Opportunities
Re: The Free Space Map: Problems and Opportunities
List pgsql-hackers
On Mon, Aug 16, 2021 at 3:49 PM Bruce Momjian <bruce@momjian.us> wrote:
> OK, here is my feedback.  First, I understand the space
> reservation/soccer idea, but I am also concerned that adding space
> reservation could improve some things and make others worse.

That is definitely a legitimate concern.

There is a trade-off here: if we do too much preallocation, there may
be cases where the preallocated space that we expected to be used
quickly is never used by anybody. But that doesn't seem like a
problem, provided we don't lose track of the fact that it happened.
Then the worst that can happen is that we have empty pages that nobody
will ever use, because nobody inserts into the table ever again (not
the backend with the leaked space lease, not anybody else). That seems
manageable. We can add some kind of ramp-up behavior that gets more
and more aggressive as demand for new space from backends is steady or
increasing.

> Second, let's look at a concrete example, and see how it can be improved
> more simply.  As I understand it, there are three operations that need
> free space:
>
>         1.  INSERT/COPY
>         2.  UPDATE where there is insufficient space on the page of the
>             old row
>         3.  UPDATE where there is sufficient page space

Right.

> The third option already can use the existing page for a HOT update, so
> the FSM doesn't matter.

I agree.

All good questions. Thank you for diving in on this.

> For 1 & 2, suppose you have a table with 10 8k
> pages, all 80% full.  As I understand it, the first page will go from
> 80% to 81%, then the next row addition will take the second page from
> 80% to 81%, until all pages are 81%, and then it starts over again.  Is
> that accurate?

No. Generally backends have their own target block (a simple
BlockNumber) that they store in the relcache, one per table recently
modified. Backends/heapam uses RelationGetTargetBlock() to access this
local cache. So there is "backend affinity" for particular individual
heap pages, even today. However, that still has many problems.

It doesn't make sense to have a local cache for a shared resource --
that's the problem. You actually need some kind of basic locking or
lease system, so that 10 backends don't all decide at the same time
that one particular heap block is fully empty, and therefore a good
target block for that backend alone. It's as if the backends believe
that they're special little snowflakes, and that no other backend
could possibly be thinking the same thing at the same time about the
same heap page. And so when TPC-C does its initial bulk insert,
distinct orders are already shuffled together in a hodge-podge, just
because concurrent bulk inserters all insert on the same heap pages.

If we could safely give out 10 or 50 pages directly to each backend,
then clearly the related data would be stored together in this
scenario -- each backend would be able to work through its lease of
contiguous heap pages for quite a while before revisiting the
FSM/relation extension. The trick is to teach the implementation to do
that without allowing the backend to permanently leak whole entire
leases with maybe dozens of free pages.

Systems like DB2 and Oracle probably can't have this problem. Even the
bulk lease case is just an extension of something they had to do
anyway. You see, some kind of FSM lease system that knows about
transaction lifetime for any given piece of free space is strictly
necessary with UNDO based rollback. Without that, transaction rollback
breaks in certain edge cases involving concurrent inserts into the
same page, right before a rollback that needs to put an updated
version back in place. If the updated version from undo happens to be
physically larger than the now-aborted successor version, and if there
is little or no space left to cover that, what can rollback do about
it? Nothing. It's a nasty bug. They use heavyweight locks to avoid
this.

> Is that the non-temporal memory issue you are concerned
> about?

When I talk about memory or time, what I'm usually referring to is the
ability to manage larger allocations of multiple blocks for a while
after they're originally requested. So that a requesting
transaction/backend is given the opportunity to use the space that
they asked for. They shouldn't be completely trusted. We should trust
but verify.

> Doesn't spreading the new rows out increase the ability to do
> HOT updates?

It's true that the problem scenario with TPC-C does involve updates,
and it's true that that's when things really go badly. But I
deliberately haven't gone into the role that the updates play there
too much. Not just yet.

It's important to note that the problems really do start when we
insert, even if that's not obvious -- that's when the locality is
lost, right when the original order transaction comes in. We lose
locality just because we have concurrent inserters into the same
table, where each transaction inserts multiple related rows. We fail
to group related rows together right from the start, setting us up for
failure later on.

This is actually quite simple -- the chaos that follows is where it
gets truly complicated (and not necessarily worth getting into just
yet). The update of a given order and its order lines entries takes
place hours later, within a delivery transaction.

Another concern is knock on effects *after* any initial problematic
updates -- that's certainly not where it ends. Every action has
consequences, and these consequences themselves have consequences. By
migrating a row from an earlier block into a later block (both
physically and temporally early/late), we're not just changing things
for that original order row or rows (the order non-HOT-updated by the
delivery transaction). Since the old/original version left behind by a
delivery transaction will itself get vacuumed eventually, that space
goes in the FSM eventually. And so this same block will get some entirely
unrelated version (could be a new insert or update). It's systematic.

We dare not waste a tiny amount of free space. Surely we should cut
our losses at some point and accept that we're "wasting space", but
there is never any natural backpressure. Nothing to break the cycle of
fragmentation. (This is another thing involving time, not quite
related to my central point about free space leases.)

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: "alvherre@alvh.no-ip.org"
Date:
Subject: Re: archive status ".ready" files may be created too early
Next
From: Peter Geoghegan
Date:
Subject: Re: The Free Space Map: Problems and Opportunities