Thread: Problems with the FSM, heap fillfactor, and temporal locality

Problems with the FSM, heap fillfactor, and temporal locality

From
Peter Geoghegan
Date:
I'm concerned about how the FSM gives out pages to heapam. Disabling
the FSM entirely helps TPC-C/BenchmarkSQL, which uses non-default heap
fillfactors for most tables [1]. Testing has shown that this actually
increases throughput for the benchmark (as measured in TPM) by 5% -
9%, even though my approach is totally naive and stupid. My approach
makes one or two small tables much bigger, but that doesn't have that
much downside for the workload in practice. My approach helps by
accidentally improving temporal locality -- related records are more
consistently located on the same block, which in practice reduces the
number of pages dirtied and the number of FPIs generated. TPC-C has a
tendency to insert a set of related tuples together (e.g., order lines
from an order), while later updating all of those records together.

Interestingly, the problems start before we even begin the benchmark
proper, and can be observed directly using simple ad-hoc queries (I
developed some simple SQL queries involving ctid for this).
BenchmarkSQL's initial bulk loading is performed by concurrent workers
that insert related groups of tuples into tables, so that we start out
with a realistic amount of old orders to refer back to, etc. I can
clearly observe that various natural groupings (e.g., grouping order
lines by order number, grouping customers by district + warehouse)
actually get initially inserted in a way that leaves tuples in a
grouping spread around an excessive number of heap blocks. For
example, while most order lines do fit on one block, there is a
significant minority of orderlines that span two or more blocks for
the master branch case. Whereas with the FSM artificially disabled,
the heap relation looks more "pristine" in that related tuples are
located on the same blocks (or at least on neighboring blocks). It's
possible that one orderline will span two neighboring blocks here, but
it will never span three or more blocks. Each order has 5 - 15 order
lines, and so I was surprised to see that a small minority or order
line tuples end up occupying as many as 5 or 7 heap pages on the
master branch (i.e. with the current FSM intact during bulk loading).

The underlying cause of this "bulk inserts are surprisingly
indifferent to locality" issue seems to be that heap am likes to
remember small amounts of space from earlier pages when the backend
couldn't fit one last tuple on an earlier target page (before
allocating a new page that became the new relcache target page in
turn). This is penny wise and pound foolish, because we're eagerly
using a little bit more space in a case where we are applying a heap
fill factor anyway. I think that we also have a number of related
problems.

It seems to me that we don't make enough effort to store related heap
tuples together -- both during bulk inserts like this, but also during
subsequent updates that cannot fit successor tuples on the same heap
page. The current design of the FSM seems to assume that it doesn't
matter where the free space comes from, as long as you get it from
somewhere and as long as fill factor isn't violated -- it cares about
the letter of the fill factor law without caring about its spirit or
intent.

If the FSM tried to get free space from a close-by block, then we
might at least see related updates that cannot fit a successor tuple
on the same block behave in a coordinated fashion. We might at least
have both updates relocate the successor tuple to the same
mostly-empty block -- they both notice the same nearby free block, so
both sets of successor tuples end up going on the same most-empty
block. The two updating backends don't actually coordinate, of course
-- this just happens as a consequence of looking for nearby free
space.

The FSM should probably be taught to treat pages as free space
candidates (candidates to give out free space from) based on more
sophisticated, locality-aware criteria. The FSM should care about the
*rate of change* for a block over time. Suppose you have heap fill
factor set to 90. Once a heap block reaches fill factor% full, it
ought to not be used to insert new tuples unless and until the used
space on the block shrinks *significantly* -- the free space is now
supposed to be reserved. It should not be enough for the space used on
the page to shrink by just 1% (to 89%). Maybe it should have to reach
as low as 50% or less before we flip it back to "free to take space
from for new unrelated tuples". The idea is that fill factor space is
truly reserved for updates -- that should be "sticky" for all of this
to work well.

What's the point in having the concept of a heap fill factor at all if
we don't make any real effort to enforce that the extra free space
left behind is used as intended, for updates of tuples located on the
same heap page?

Thoughts?

[1] https://github.com/petergeoghegan/benchmarksql/commit/3ef4fe71077b40f56b91286d4b874a15835c241e
--
Peter Geoghegan



Re: Problems with the FSM, heap fillfactor, and temporal locality

From
John Naylor
Date:
On Fri, Aug 21, 2020 at 2:48 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> I'm concerned about how the FSM gives out pages to heapam. Disabling
> the FSM entirely helps TPC-C/BenchmarkSQL, which uses non-default heap
> fillfactors for most tables [1].

Hi Peter,

Interesting stuff. Is lower-than-default fillfactor important for the
behavior you see?

> located on the same blocks (or at least on neighboring blocks). It's
> possible that one orderline will span two neighboring blocks here, but
> it will never span three or more blocks. Each order has 5 - 15 order
> lines, and so I was surprised to see that a small minority or order
> line tuples end up occupying as many as 5 or 7 heap pages on the
> master branch (i.e. with the current FSM intact during bulk loading).

Hmm. You focus on FSM, but I'm thinking also multiple VM bits
potentially got cleared (maybe not this benchmark but other
scenarios).

> If the FSM tried to get free space from a close-by block, then we
> might at least see related updates that cannot fit a successor tuple
> on the same block behave in a coordinated fashion. We might at least
> have both updates relocate the successor tuple to the same
> mostly-empty block -- they both notice the same nearby free block, so
> both sets of successor tuples end up going on the same most-empty
> block. The two updating backends don't actually coordinate, of course
> -- this just happens as a consequence of looking for nearby free
> space.

I'm not sure I follow the "close-by" criterion. It doesn't seem
immediately relevant to what I understand as the main problem you
found, of free space. In other words, if we currently write to 5
blocks, but with smarter FSM logic we can find a single block, it
seems that should be preferred over close-by blocks each with less
space? Or are you envisioning scoring by both free space and distance
from the original block?

> supposed to be reserved. It should not be enough for the space used on
> the page to shrink by just 1% (to 89%). Maybe it should have to reach
> as low as 50% or less before we flip it back to "free to take space
> from for new unrelated tuples". The idea is that fill factor space is
> truly reserved for updates -- that should be "sticky" for all of this
> to work well.

Makes sense. If we go this route, I wonder if we should keep the
current behavior and use any free space if the fillfactor is 100%,
since that's in line with the intention. Also, the 50% (or whatever)
figure could be scaled depending on fillfactor. As in, check if
freespace > (100% - fillfactor * 0.6), or something.

I'm not sure how to distinguish blocks that have never reached
fillfactor vs. ones that did but haven't gained back enough free space
to be considered again. Naively, we could start by assuming the last
block can always be filled up to fillfactor, but earlier blocks must
use the stricter rule. That's easy since we already try the last block
anyway before extending the relation.

The flip side of this is: Why should vacuum be in a hurry to dirty a
page, emit WAL, and update the FSM if it only removes one dead tuple?
This presentation [1] (pages 35-43) from Masahiko Sawada had the idea
of a "garbage map", which keeps track of which parts of the heap have
the most dead space, and focus I/O efforts on those blocks first. It
may or may not be worth the extra complexity by itself, but it seems
it would work synergistically with your proposal: Updating related
tuples would concentrate dead tuples on fewer pages, and vacuums would
more quickly free up space that can actually be used to store multiple
new tuples.

[1] https://www.slideshare.net/masahikosawada98/vacuum-more-efficient-than-ever-99916671

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Problems with the FSM, heap fillfactor, and temporal locality

From
Peter Geoghegan
Date:
Hi John,

On Fri, Aug 21, 2020 at 3:10 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> Interesting stuff. Is lower-than-default fillfactor important for the
> behavior you see?

It's hard to say. It's definitely not important as far as the initial
bulk loading behavior is concerned (the behavior where related tuples
get inserted onto multiple disparate tuples all too often). That will
happen with any fill factor, including the default/100.

I'm concerned about heap fill factor in particular because I suspect
that that doesn't really work sensibly.

To give you some concrete idea of the benefits, I present a
pg_stat_database from the master branch after 6 hours of BenchmarkSQL
with a rate limit:

-[ RECORD 1 ]---------+------------------------------
datid                 | 13,619
datname               | postgres
numbackends           | 3
xact_commit           | 45,902,031
xact_rollback         | 200,131
blks_read             | 662,677,079
blks_hit              | 24,790,989,538
tup_returned          | 30,054,930,608
tup_fetched           | 13,722,542,025
tup_inserted          | 859,165,629
tup_updated           | 520,611,959
tup_deleted           | 20,074,897
conflicts             | 0
temp_files            | 88
temp_bytes            | 18,849,890,304
deadlocks             | 0
checksum_failures     |
checksum_last_failure |
blk_read_time         | 124,233,831.492
blk_write_time        | 8,588,876.871
stats_reset           | 2020-08-20 13:51:08.351036-07

Here is equivalent output for my simple patch that naively disables the FSM:

-[ RECORD 1 ]---------+------------------------------
datid                 | 13,619
datname               | postgres
numbackends           | 3
xact_commit           | 46,369,235
xact_rollback         | 201,919
blks_read             | 658,267,665
blks_hit              | 19,980,524,578
tup_returned          | 30,804,856,896
tup_fetched           | 11,839,865,936
tup_inserted          | 861,798,911
tup_updated           | 525,895,435
tup_deleted           | 20,277,618
conflicts             | 0
temp_files            | 88
temp_bytes            | 18,849,439,744
deadlocks             | 0
checksum_failures     |
checksum_last_failure |
blk_read_time         | 117,167,612.616
blk_write_time        | 7,873,922.175
stats_reset           | 2020-08-20 13:50:51.72056-07

Note that there is a ~20% reduction in blks_hit here, even though the
patch does ~1% more transactions (the rate limiting doesn't work
perfectly). There is also a ~5.5% reduction in aggregate
blk_read_time, and a ~9% reduction in blk_write_time. I'd say that
that's pretty significant.

I also see small but significant improvements in transaction latency,
particularly with 90th percentile latency.

> Hmm. You focus on FSM, but I'm thinking also multiple VM bits
> potentially got cleared (maybe not this benchmark but other
> scenarios).

I'm focussed on how heapam interacts with the FSM, and its effects on
locality. It's definitely true that this could go on to affect how the
visibility map gets set -- we could set fewer bits unnecessarily. And
it probably has a number of other undesirable consequences that are
hard to quantify. Clearly there are many reasons why making the
physical database layout closer to that of the logical database is a
good thing. I probably have missed a few.

> I'm not sure I follow the "close-by" criterion. It doesn't seem
> immediately relevant to what I understand as the main problem you
> found, of free space. In other words, if we currently write to 5
> blocks, but with smarter FSM logic we can find a single block, it
> seems that should be preferred over close-by blocks each with less
> space? Or are you envisioning scoring by both free space and distance
> from the original block?

I am thinking of doing both at the same time.

Think of a table with highly skewed access. There is a relatively
small number of hot spots -- parts of the primary key's key space that
are consistently affected by many skewed updates (with strong heap
correlation to primary key order). We simply cannot ever hope to avoid
migrating heavily updated rows to new pages -- too much contention in
the hot spots for that. But, by 1) Not considering pages
free-space-reclaimable until the free space reaches ~50%, and 2)
preferring close-by free pages, we avoid (or at least delay)
destroying locality of access. There is a much better chance that rows
from logically related hot spots will all migrate to the same few
blocks again and again, with whole blocks becoming free in a
relatively predictable, phased fashion. (I'm speculating here, but my
guess is that this combination will help real world workloads by quite
a bit.)

Preferring close-by blocks in the FSM is something that there was
discussion of when the current FSM implementation went in back in 8.4.
I am almost certain that just doing that will not help. If it helps at
all then it will help by preserving locality as tuple turnover takes
place over time, and I think that you need to be clever about "reuse
granularity" in order for that to happen. We're optimizing the entire
"lifecycle" of logically related tuples whose relatedness is embodied
by their initial physical position following an initial insert (before
many subsequent updates take place that risk destroying locality).

> Makes sense. If we go this route, I wonder if we should keep the
> current behavior and use any free space if the fillfactor is 100%,
> since that's in line with the intention. Also, the 50% (or whatever)
> figure could be scaled depending on fillfactor. As in, check if
> freespace > (100% - fillfactor * 0.6), or something.

Right. Or it could be another reloption.

> I'm not sure how to distinguish blocks that have never reached
> fillfactor vs. ones that did but haven't gained back enough free space
> to be considered again. Naively, we could start by assuming the last
> block can always be filled up to fillfactor, but earlier blocks must
> use the stricter rule. That's easy since we already try the last block
> anyway before extending the relation.

I was thinking of explicitly marking blocks as "freeable", meaning
that the FSM will advertise their free space. This isn't self-evident
from the amount of free space on the page alone, since we need to
distinguish between at least two cases: the case where a page has yet
to apply fill factor for the first time (which may still be close to
fillfactor% full) versus the case where the page did reach fillfactor,
but then had a small amount of space freed. I think that the FSM ought
to give out space in the former case, but not in the latter case. Even
though an identical amount of free space might be present in either
case.

> The flip side of this is: Why should vacuum be in a hurry to dirty a
> page, emit WAL, and update the FSM if it only removes one dead tuple?
> This presentation [1] (pages 35-43) from Masahiko Sawada had the idea
> of a "garbage map", which keeps track of which parts of the heap have
> the most dead space, and focus I/O efforts on those blocks first. It
> may or may not be worth the extra complexity by itself, but it seems
> it would work synergistically with your proposal: Updating related
> tuples would concentrate dead tuples on fewer pages, and vacuums would
> more quickly free up space that can actually be used to store multiple
> new tuples.

I agree that that seems kind of related. I'm trying to concentrate
garbage from skewed updates in fewer blocks. (Same with the skewed
successor tuples.)

--
Peter Geoghegan



Re: Problems with the FSM, heap fillfactor, and temporal locality

From
John Naylor
Date:
On Fri, Aug 21, 2020 at 8:53 PM Peter Geoghegan <pg@bowt.ie> wrote:

> Note that there is a ~20% reduction in blks_hit here, even though the
> patch does ~1% more transactions (the rate limiting doesn't work
> perfectly). There is also a ~5.5% reduction in aggregate
> blk_read_time, and a ~9% reduction in blk_write_time. I'd say that
> that's pretty significant.

Indeed.

> Preferring close-by blocks in the FSM is something that there was
> discussion of when the current FSM implementation went in back in 8.4.

Right, I found a pretty long one here:

https://www.postgresql.org/message-id/flat/1253201179.9666.174.camel%40ebony.2ndQuadrant

> > I'm not sure how to distinguish blocks that have never reached
> > fillfactor vs. ones that did but haven't gained back enough free space
> > to be considered again. Naively, we could start by assuming the last
> > block can always be filled up to fillfactor, but earlier blocks must
> > use the stricter rule. That's easy since we already try the last block
> > anyway before extending the relation.
>
> I was thinking of explicitly marking blocks as "freeable", meaning
> that the FSM will advertise their free space. This isn't self-evident
> from the amount of free space on the page alone, since we need to
> distinguish between at least two cases: the case where a page has yet
> to apply fill factor for the first time (which may still be close to
> fillfactor% full) versus the case where the page did reach fillfactor,
> but then had a small amount of space freed. I think that the FSM ought
> to give out space in the former case, but not in the latter case. Even
> though an identical amount of free space might be present in either
> case.

I imagine you're suggesting to make this change in the FSM data? I'm
thinking we could change the category byte to a signed integer, and
reduce FSM_CATEGORIES to 128. (That gives us 64-byte granularity which
doesn't sound bad, especially if we're considering ignoring free space
for inserts until we get a couple kilobytes back.) The absolute value
represents the actual space. A negative number would always compare as
less, so use the negative range to mark the page as not usable. A
fresh page will have positive-numbered categories until fill_factor is
reached, at which point we flip the sign to negative. When the used
space gets below "min_fill_factor", flip the sign to mark it usable.
Updates would have to be taught to read the absolute value. Managing
the math might be a little tricky, but maybe that could be contained.

One possible disadvantage is that negative values would bubble up to
higher levels in the opposite way that positive ones do, but maybe we
don't care since the pages are marked unusable anyway. All we care is
that all negative numbers give the correct binary comparison when we
search for available space. We could also reserve the high bit as a
flag (1 = usable), and use the lower bits for the value, but I'm not
sure that's better.

We could also preserve the 255 categories as is, but add a second byte
for the flag. If we could imagine other uses for a new byte, this
might be good, but would make the FSM much bigger, which doesn't sound
attractive at all.

Any change of the FSM file would require pg_upgrade to rewrite the
FSM, but it still doesn't seem like a lot of code.

Other ideas?

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Problems with the FSM, heap fillfactor, and temporal locality

From
Peter Geoghegan
Date:
On Mon, Aug 24, 2020 at 6:38 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> On Fri, Aug 21, 2020 at 8:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Note that there is a ~20% reduction in blks_hit here, even though the
> > patch does ~1% more transactions (the rate limiting doesn't work
> > perfectly). There is also a ~5.5% reduction in aggregate
> > blk_read_time, and a ~9% reduction in blk_write_time. I'd say that
> > that's pretty significant.
>
> Indeed.

Most of this seems to come from the new_orders table, which has heap
pages that are continually inserted into and then deleted a little
later on. new_orders is a bit like a queue that never gets too big. It
is probably the component of TPC-C where we have the most room for
improvement, fragmentation-wise. OTOH, despite all the churn the high
watermark size of the new_orders table isn't all that high -- maybe
~50MB with a 1TB database. So it's not like we'll save very many
shared_buffers misses there.

> > Preferring close-by blocks in the FSM is something that there was
> > discussion of when the current FSM implementation went in back in 8.4.
>
> Right, I found a pretty long one here:
>
> https://www.postgresql.org/message-id/flat/1253201179.9666.174.camel%40ebony.2ndQuadrant

Thanks for finding that.

> I imagine you're suggesting to make this change in the FSM data? I'm
> thinking we could change the category byte to a signed integer, and
> reduce FSM_CATEGORIES to 128.

Yeah, something like that. I don't think that we need very many
distinct FSM_CATEGORIES. Even 128 seems like way more than could ever
be needed.

> Any change of the FSM file would require pg_upgrade to rewrite the
> FSM, but it still doesn't seem like a lot of code.

I think that the sloppy approach to locking for the
fsmpage->fp_next_slot field in functions like fsm_search_avail() (i.e.
not using real atomic ops, even though we could) is one source of
problems here. That might end up necessitating fixing the on-disk
format, just to get the FSM to behave sensibly -- assuming that the
value won't be too stale in practice is extremely dubious.

This fp_next_slot business interacts poorly with the "extend a
relation by multiple blocks" logic added by commit 719c84c1be5 --
concurrently inserting backends are liable to get the same heap block
from the FSM, causing "collisions". That almost seems like a bug IMV.
We really shouldn't be given out the same block twice, but that's what
my own custom instrumentation shows happens here. With atomic ops, it
isn't a big deal to restart using a compare-and-swap at the end (when
we set/reset fp_next_slot for other backends).

> Other ideas?

I've been experimenting with changing the way that we enforce heap
fill factor with calls to heap_insert() (and even heap_update()) that
happen to occur at a "natural temporal boundary". This works by
remembering an XID alongside the target block in the relcache when the
target block is set. When we have an existing target block whose XID
does not match our backend's current XID (i.e. it's an old XID for the
backend), then that means we're at one of these boundaries. We require
that the page has a little more free space before we'll insert on it
when at a boundary. If we barely have enough space to insert the
incoming heap tuple, and it's the first of a few tuples the
transaction will ultimately insert, then we should start early on a
new page instead of using the last little bit of space (note that the
"last little bit" of space does not include the space left behind by
fill factor). The overall effect is that groups of related tuples are
much less likely to span a heap page boundary unless and until we have
lots of updates -- though maybe not even then. I think that it's very
common for transactions to insert a group of 2 - 15 logically related
tuples into a table at a time.

Roughly speaking, you can think of this as the heapam equivalent of
the nbtree page split choice logic added by commit fab25024. We ought
to go to at least a little bit of effort to minimize the number of
distinct XIDs that are present on each heap page (in the tuple
headers). We can probably figure out heuristics that result in
respecting heap fill factor on average, while giving inserts (and even
non-HOT updates) a little wiggle room when it comes to heap page
boundaries.

By applying both of these techniques together (locality/page split
thing and real atomic ops for fp_next_slot) the prototype patch I'm
working on mostly restores the system's current ability to reuse space
(as measured by the final size of relations when everything is done),
while maintaining most of the performance benefits of not using the
FSM at all. The code is still pretty rough, though.

I haven't decided how far to pursue this. It's not as if there are
that many ways to make TPC-C go 5%+ faster left; it's very
write-heavy, and stresses many different parts of the system all at
once. I'm sure that anything like my current prototype patch will be
controversial, though. Maybe it will be acceptable if we only change
the behavior for people that explicitly set heap fillfactor.

-- 
Peter Geoghegan



Re: Problems with the FSM, heap fillfactor, and temporal locality

From
Stephen Frost
Date:
Greetings,

* Peter Geoghegan (pg@bowt.ie) wrote:
> On Mon, Aug 24, 2020 at 6:38 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> > Other ideas?
>
> I've been experimenting with changing the way that we enforce heap
> fill factor with calls to heap_insert() (and even heap_update()) that
> happen to occur at a "natural temporal boundary". This works by
> remembering an XID alongside the target block in the relcache when the
> target block is set. When we have an existing target block whose XID
> does not match our backend's current XID (i.e. it's an old XID for the
> backend), then that means we're at one of these boundaries. We require
> that the page has a little more free space before we'll insert on it
> when at a boundary. If we barely have enough space to insert the
> incoming heap tuple, and it's the first of a few tuples the
> transaction will ultimately insert, then we should start early on a
> new page instead of using the last little bit of space (note that the
> "last little bit" of space does not include the space left behind by
> fill factor). The overall effect is that groups of related tuples are
> much less likely to span a heap page boundary unless and until we have
> lots of updates -- though maybe not even then. I think that it's very
> common for transactions to insert a group of 2 - 15 logically related
> tuples into a table at a time.

This all definitely sounds quite interesting and the idea to look at the
XID to see if we're in the same transaction and therefore likely
inserting a related tuple certainly makes some sense.  While I get that
it might not specifically work with TPC-C, I'm wondering about if we
could figure out how to make a multi-tuple INSERT use
heap/table_multi_insert (which seems to only be used by COPY currently,
and internally thanks to the recent work to use it for some catalog
tables) and then consider the size of the entire set of tuples being
INSERT'd when working to find a page, or deciding if we should extend
the relation.

> Roughly speaking, you can think of this as the heapam equivalent of
> the nbtree page split choice logic added by commit fab25024. We ought
> to go to at least a little bit of effort to minimize the number of
> distinct XIDs that are present on each heap page (in the tuple
> headers). We can probably figure out heuristics that result in
> respecting heap fill factor on average, while giving inserts (and even
> non-HOT updates) a little wiggle room when it comes to heap page
> boundaries.

Agreed.

> By applying both of these techniques together (locality/page split
> thing and real atomic ops for fp_next_slot) the prototype patch I'm
> working on mostly restores the system's current ability to reuse space
> (as measured by the final size of relations when everything is done),
> while maintaining most of the performance benefits of not using the
> FSM at all. The code is still pretty rough, though.
>
> I haven't decided how far to pursue this. It's not as if there are
> that many ways to make TPC-C go 5%+ faster left; it's very
> write-heavy, and stresses many different parts of the system all at
> once. I'm sure that anything like my current prototype patch will be
> controversial, though. Maybe it will be acceptable if we only change
> the behavior for people that explicitly set heap fillfactor.

Getting a 5% improvement is pretty exciting, very cool and seems worth
spending effort on.

Thanks,

Stephen

Attachment

Re: Problems with the FSM, heap fillfactor, and temporal locality

From
Peter Geoghegan
Date:
On Tue, Aug 25, 2020 at 6:21 AM Stephen Frost <sfrost@snowman.net> wrote:
> This all definitely sounds quite interesting and the idea to look at the
> XID to see if we're in the same transaction and therefore likely
> inserting a related tuple certainly makes some sense.  While I get that
> it might not specifically work with TPC-C, I'm wondering about if we
> could figure out how to make a multi-tuple INSERT use
> heap/table_multi_insert (which seems to only be used by COPY currently,
> and internally thanks to the recent work to use it for some catalog
> tables) and then consider the size of the entire set of tuples being
> INSERT'd when working to find a page, or deciding if we should extend
> the relation.

There are probably quite a variety of ways in which we can capture
locality, and I'm sure that I'm only scratching the surface right now.
I agree that table_multi_insert could definitely be one of them.

John said something about concentrating garbage in certain pages
up-thread. I also wonder if there is some visibility + freeze map
angle on this.

What I see with the benchmark is that the order_line table (the
largest table by quite some margin, and one that grows indefinitely)
does not make much use of the visibility map during VACUUM -- even
though it's the kind of table + workload that you'd hope and expect
would make good use of it if you were looking at it in a real world
situation. Each tuple is only inserted once and later updated once, so
what we really ought to do better. The logs show that
VACUUM/autovacuum dirties lots of pages, probably due to fragmentation
from free space management (though there are undoubtedly other
factors).

The best "locality orientated" reference guide to TPC-C that I've been
able to find is "A Modeling Study of the TPC-C Benchmark", which was
published in 1993 by NASA (shortly after the introduction of TPC-C).
You can get it from:

https://apps.dtic.mil/dtic/tr/fulltext/u2/a264793.pdf  (Unfortunately
this reproduction is a low quality photocopy -- ACM members can get a
clear copy.)

If you think about the TPC-C workload at a high level, and Postgres
internals stuff at a low level, and then run the benchmark, you'll
find various ways in which we don't live up to our potential. The big
picture is that the "physical database" is not close enough to the
"logical database", especially over time and after a lot of churn.
This causes problems all over the place, that look like nothing in
particular in profiles.

It's not that TPC-C is unsympathetic to Postgres in any of the usual
ways -- there are very few UPDATEs that affect indexed columns, which
are not really a factor at all. There are also no transactions that
run longer than 2 seconds (any more than ~50ms per execution is
exceptional, in fact). We already do a significant amount of the
necessary garbage collection opportunistically (by pruning) --
probably the vast majority, in fact. In particular, HOT pruning works
well, since fill factor has been tuned. It just doesn't work as well
as you'd hope, in that it cannot stem the tide of fragmentation. And
not just because of heapam's use of the FSM.

If we implemented a simple differential heap tuple compression scheme
within HOT chains (though not across HOT chains/unrelated tuples),
then we'd probably do much better -- we could keep the same logical
tuple on the same heap page much more often, maybe always. For
example, "stock" table is a major source of FPIs, and I bet that this
is greatly amplified by our failure to keep versions of the same
frequently updated tuple together. We can only fit ~25 stock tuples on
each heap page (with heap fill factor at 95, the BenchmarkSQL
default), so individual tuples are ~320 bytes (including tuple
header). If we found a way to store the changed columns for successor
tuples within a HOT chain, then we would do much better -- without
changing the HOT invariant (or anything else that's truly scary). If
our scheme worked by abusing the representation that we use for NULL
values in the successor/HOT tuples (which is not terribly space
efficient), then we could still store about 6 more versions of each
stock tuple on the same page -- the actual changed columns are
typically much much smaller than the unchanged columns. Our 23/24 byte
tuple header is usually small potatoes compared to storing unchanged
values several times.

As I said, the HOT optimization (and opportunistic pruning) already
work well with this benchmark. But I think that they'd work a lot
better if we could just temporarily absorb a few extra versions on the
heap page, so we have enough breathing room to prune before the page
"truly fills to capacity". It could help in roughly the same way that
deduplication now helps in nbtree indexes with "version churn".

I'm also reminded of the nbtree optimization I prototyped recently,
which more or less prevented all unique index bloat provided there is
no long running transaction:

https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com

(Yes, "preventing all unique index bloat provided there is no long
running transaction" is no exaggeration -- it really prevents all
bloat related nbtree page splits, even with hundreds of clients, skew,
etc.)

It seems pretty obvious to me that buying another (say) 2 seconds to
let opportunistic pruning run "before the real damage is done" can be
extremely valuable -- we only need to be able to delay a page split
(which is similar to the case where we cannot continue to store heap
tuples on the same heap page indefinitely) for a couple of seconds at
a time. We only need to "stay one step ahead" of the need to split the
page (or to migrate a logical heap tuple to a new heap page when it
comes to the heap) at any given time -- that alone will totally arrest
the problem.

This is a very important point -- the same set of principles that
helped in nbtree can also be effectively applied to heap pages that
are subject to version churn. (Assuming no long running xacts.)

> Getting a 5% improvement is pretty exciting, very cool and seems worth
> spending effort on.

I'm already at 5% - 7% now. I bet the differential compression of
tuples on a HOT chain could buy a lot more than that. The biggest
emphasis should be placed on stable performance over time, and total
I/O over time -- that's where we're weakest right now.

--
Peter Geoghegan



Re: Problems with the FSM, heap fillfactor, and temporal locality

From
John Naylor
Date:
On Tue, Aug 25, 2020 at 5:17 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> I think that the sloppy approach to locking for the
> fsmpage->fp_next_slot field in functions like fsm_search_avail() (i.e.
> not using real atomic ops, even though we could) is one source of
> problems here. That might end up necessitating fixing the on-disk
> format, just to get the FSM to behave sensibly -- assuming that the
> value won't be too stale in practice is extremely dubious.
>
> This fp_next_slot business interacts poorly with the "extend a
> relation by multiple blocks" logic added by commit 719c84c1be5 --
> concurrently inserting backends are liable to get the same heap block
> from the FSM, causing "collisions". That almost seems like a bug IMV.
> We really shouldn't be given out the same block twice, but that's what
> my own custom instrumentation shows happens here. With atomic ops, it
> isn't a big deal to restart using a compare-and-swap at the end (when
> we set/reset fp_next_slot for other backends).

The fact that that logic extends by 20 * numwaiters to get optimal
performance is a red flag that resources aren't being allocated
efficiently. I have an idea to ignore fp_next_slot entirely if we have
extended by multiple blocks: The backend that does the extension
stores in the FSM root page 1) the number of blocks added and 2) the
end-most block number. Any request for space will look for a valid
value here first before doing the usual search. If there is then the
block to try is based on a hash of the xid. Something like:

candidate-block = prev-end-of-relation + 1 + (xid % (num-new-blocks))

To guard against collisions, then peak in the FSM at that slot and if
it's not completely empty, then search FSM using a "look-nearby" API
and increment a counter every time we collide. When the counter gets
to some-value, clear the special area in the root page so that future
backends use the usual search.

I think this would work well with your idea to be more picky if the
xid stored with the relcache target block doesn't match the current
one.

Also num-new-blocks above could be scaled down from the actual number
of blocks added, just to make sure writes aren't happening all over
the place.

There might be holes in this idea, but it may be worth trying to be
better in this area without adding stricter locking.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Problems with the FSM, heap fillfactor, and temporal locality

From
Peter Geoghegan
Date:
On Wed, Aug 26, 2020 at 1:46 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> The fact that that logic extends by 20 * numwaiters to get optimal
> performance is a red flag that resources aren't being allocated
> efficiently.

I agree that that's pretty suspicious.

> I have an idea to ignore fp_next_slot entirely if we have
> extended by multiple blocks: The backend that does the extension
> stores in the FSM root page 1) the number of blocks added and 2) the
> end-most block number. Any request for space will look for a valid
> value here first before doing the usual search. If there is then the
> block to try is based on a hash of the xid. Something like:
>
> candidate-block = prev-end-of-relation + 1 + (xid % (num-new-blocks))

I was thinking of doing something in shared memory, and not using the
FSM here at all. If we're really giving 20 pages out to each backend,
we will probably benefit from explicitly assigning contiguous ranges
of pages to each backend, and making some effort to respect that they
own the blocks in some general sense. Hopefully without also losing
access to the free space in corner cases (e.g. one of the backend's
has an error shortly after receiving its contiguous range of blocks).

> To guard against collisions, then peak in the FSM at that slot and if
> it's not completely empty, then search FSM using a "look-nearby" API
> and increment a counter every time we collide. When the counter gets
> to some-value, clear the special area in the root page so that future
> backends use the usual search.

The backends already use a look nearby API, sort of --
RecordAndGetPageWithFreeSpace() already behaves that way. I'm not sure
exactly how well it works in practice, but it definitely works to some
degree.

> Also num-new-blocks above could be scaled down from the actual number
> of blocks added, just to make sure writes aren't happening all over
> the place.

Or scaled up, perhaps.

-- 
Peter Geoghegan



Re: Problems with the FSM, heap fillfactor, and temporal locality

From
John Naylor
Date:
On Wed, Sep 2, 2020 at 1:57 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Aug 26, 2020 at 1:46 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> > The fact that that logic extends by 20 * numwaiters to get optimal
> > performance is a red flag that resources aren't being allocated
> > efficiently.
>
> I agree that that's pretty suspicious.

Per Simon off-list some time ago (if I understood him correctly),
counting the lock waiters make the lock contention worse. I haven't
tried to measure this, but I just had an idea instead to keep track of
how many times the table has previously been extended by multiple
blocks, and extend by a number calculated from that. Something like
(pow2(2 + num-times-ext-mult-blocks)), with some ceiling perhaps much
smaller than 512. Maybe a bit off topic, but the general problem you
brought up has many moving parts, as you've mentioned.

> > I have an idea to ignore fp_next_slot entirely if we have
> > extended by multiple blocks: The backend that does the extension
> > stores in the FSM root page 1) the number of blocks added and 2) the
> > end-most block number. Any request for space will look for a valid
> > value here first before doing the usual search. If there is then the
> > block to try is based on a hash of the xid. Something like:
> >
> > candidate-block = prev-end-of-relation + 1 + (xid % (num-new-blocks))
>
> I was thinking of doing something in shared memory, and not using the
> FSM here at all. If we're really giving 20 pages out to each backend,
> we will probably benefit from explicitly assigning contiguous ranges
> of pages to each backend, and making some effort to respect that they
> own the blocks in some general sense.

That would give us flexibility and precise control. I suspect it would
also have more cognitive and maintenance cost, by having more than one
source of info.

> Hopefully without also losing
> access to the free space in corner cases (e.g. one of the backend's
> has an error shortly after receiving its contiguous range of blocks).

Right, you'd need some way of resetting or retiring the shared memory
info when it is no longer useful. That was my thinking with the
collision counter -- go back to using the FSM when we no longer have a
reasonable chance of getting a fresh block.

> > Also num-new-blocks above could be scaled down from the actual number
> > of blocks added, just to make sure writes aren't happening all over
> > the place.
>
> Or scaled up, perhaps.

I don't think I explained this part well, so here's a concrete
example. Let's say 128 blocks were added at once. Then xid % 128 would
give a number to be added to the previous last block in the relation,
so new target block allocations could be anywhere in this 128. By
"scale down", I mean compute (say) xid % 32. That would limit deviance
of spatial locality for those backends that were waiting on extension.
Doing the opposite, like xid % 256, would give you blocks past the end
of the relation. Further thinking out loud, after detecting enough
collisions in the first 32, we could iterate through the other ranges
and finally revert to conventional FSM search.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services