Re: Problems with the FSM, heap fillfactor, and temporal locality - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Problems with the FSM, heap fillfactor, and temporal locality
Date
Msg-id CAH2-Wz=Kk6RGn_XaV4JYgCyT=hcascsA835TYksF=6LRVRve0A@mail.gmail.com
Whole thread Raw
In response to Re: Problems with the FSM, heap fillfactor, and temporal locality  (John Naylor <john.naylor@2ndquadrant.com>)
Responses Re: Problems with the FSM, heap fillfactor, and temporal locality
List pgsql-hackers
Hi John,

On Fri, Aug 21, 2020 at 3:10 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> Interesting stuff. Is lower-than-default fillfactor important for the
> behavior you see?

It's hard to say. It's definitely not important as far as the initial
bulk loading behavior is concerned (the behavior where related tuples
get inserted onto multiple disparate tuples all too often). That will
happen with any fill factor, including the default/100.

I'm concerned about heap fill factor in particular because I suspect
that that doesn't really work sensibly.

To give you some concrete idea of the benefits, I present a
pg_stat_database from the master branch after 6 hours of BenchmarkSQL
with a rate limit:

-[ RECORD 1 ]---------+------------------------------
datid                 | 13,619
datname               | postgres
numbackends           | 3
xact_commit           | 45,902,031
xact_rollback         | 200,131
blks_read             | 662,677,079
blks_hit              | 24,790,989,538
tup_returned          | 30,054,930,608
tup_fetched           | 13,722,542,025
tup_inserted          | 859,165,629
tup_updated           | 520,611,959
tup_deleted           | 20,074,897
conflicts             | 0
temp_files            | 88
temp_bytes            | 18,849,890,304
deadlocks             | 0
checksum_failures     |
checksum_last_failure |
blk_read_time         | 124,233,831.492
blk_write_time        | 8,588,876.871
stats_reset           | 2020-08-20 13:51:08.351036-07

Here is equivalent output for my simple patch that naively disables the FSM:

-[ RECORD 1 ]---------+------------------------------
datid                 | 13,619
datname               | postgres
numbackends           | 3
xact_commit           | 46,369,235
xact_rollback         | 201,919
blks_read             | 658,267,665
blks_hit              | 19,980,524,578
tup_returned          | 30,804,856,896
tup_fetched           | 11,839,865,936
tup_inserted          | 861,798,911
tup_updated           | 525,895,435
tup_deleted           | 20,277,618
conflicts             | 0
temp_files            | 88
temp_bytes            | 18,849,439,744
deadlocks             | 0
checksum_failures     |
checksum_last_failure |
blk_read_time         | 117,167,612.616
blk_write_time        | 7,873,922.175
stats_reset           | 2020-08-20 13:50:51.72056-07

Note that there is a ~20% reduction in blks_hit here, even though the
patch does ~1% more transactions (the rate limiting doesn't work
perfectly). There is also a ~5.5% reduction in aggregate
blk_read_time, and a ~9% reduction in blk_write_time. I'd say that
that's pretty significant.

I also see small but significant improvements in transaction latency,
particularly with 90th percentile latency.

> Hmm. You focus on FSM, but I'm thinking also multiple VM bits
> potentially got cleared (maybe not this benchmark but other
> scenarios).

I'm focussed on how heapam interacts with the FSM, and its effects on
locality. It's definitely true that this could go on to affect how the
visibility map gets set -- we could set fewer bits unnecessarily. And
it probably has a number of other undesirable consequences that are
hard to quantify. Clearly there are many reasons why making the
physical database layout closer to that of the logical database is a
good thing. I probably have missed a few.

> I'm not sure I follow the "close-by" criterion. It doesn't seem
> immediately relevant to what I understand as the main problem you
> found, of free space. In other words, if we currently write to 5
> blocks, but with smarter FSM logic we can find a single block, it
> seems that should be preferred over close-by blocks each with less
> space? Or are you envisioning scoring by both free space and distance
> from the original block?

I am thinking of doing both at the same time.

Think of a table with highly skewed access. There is a relatively
small number of hot spots -- parts of the primary key's key space that
are consistently affected by many skewed updates (with strong heap
correlation to primary key order). We simply cannot ever hope to avoid
migrating heavily updated rows to new pages -- too much contention in
the hot spots for that. But, by 1) Not considering pages
free-space-reclaimable until the free space reaches ~50%, and 2)
preferring close-by free pages, we avoid (or at least delay)
destroying locality of access. There is a much better chance that rows
from logically related hot spots will all migrate to the same few
blocks again and again, with whole blocks becoming free in a
relatively predictable, phased fashion. (I'm speculating here, but my
guess is that this combination will help real world workloads by quite
a bit.)

Preferring close-by blocks in the FSM is something that there was
discussion of when the current FSM implementation went in back in 8.4.
I am almost certain that just doing that will not help. If it helps at
all then it will help by preserving locality as tuple turnover takes
place over time, and I think that you need to be clever about "reuse
granularity" in order for that to happen. We're optimizing the entire
"lifecycle" of logically related tuples whose relatedness is embodied
by their initial physical position following an initial insert (before
many subsequent updates take place that risk destroying locality).

> Makes sense. If we go this route, I wonder if we should keep the
> current behavior and use any free space if the fillfactor is 100%,
> since that's in line with the intention. Also, the 50% (or whatever)
> figure could be scaled depending on fillfactor. As in, check if
> freespace > (100% - fillfactor * 0.6), or something.

Right. Or it could be another reloption.

> I'm not sure how to distinguish blocks that have never reached
> fillfactor vs. ones that did but haven't gained back enough free space
> to be considered again. Naively, we could start by assuming the last
> block can always be filled up to fillfactor, but earlier blocks must
> use the stricter rule. That's easy since we already try the last block
> anyway before extending the relation.

I was thinking of explicitly marking blocks as "freeable", meaning
that the FSM will advertise their free space. This isn't self-evident
from the amount of free space on the page alone, since we need to
distinguish between at least two cases: the case where a page has yet
to apply fill factor for the first time (which may still be close to
fillfactor% full) versus the case where the page did reach fillfactor,
but then had a small amount of space freed. I think that the FSM ought
to give out space in the former case, but not in the latter case. Even
though an identical amount of free space might be present in either
case.

> The flip side of this is: Why should vacuum be in a hurry to dirty a
> page, emit WAL, and update the FSM if it only removes one dead tuple?
> This presentation [1] (pages 35-43) from Masahiko Sawada had the idea
> of a "garbage map", which keeps track of which parts of the heap have
> the most dead space, and focus I/O efforts on those blocks first. It
> may or may not be worth the extra complexity by itself, but it seems
> it would work synergistically with your proposal: Updating related
> tuples would concentrate dead tuples on fewer pages, and vacuums would
> more quickly free up space that can actually be used to store multiple
> new tuples.

I agree that that seems kind of related. I'm trying to concentrate
garbage from skewed updates in fewer blocks. (Same with the skewed
successor tuples.)

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: deb repo doesn't have latest. or possible to update web page?
Next
From: Bruce Momjian
Date:
Subject: Re: Documentation patch for backup manifests in protocol.sgml