Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAH2-WznoVAMRodYT+-ixZ=sEPQkXTY-UJ+fqoEb58aEo8BHqfg@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Thu, Jul 8, 2021 at 1:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> As I wrote in the first email, I think there are two important factors
> in index vacuuming performance: the performance to check if heap TID
> that an index tuple points to is dead, and the number of times to
> perform index bulk-deletion. The flame graph I attached in the first
> mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a
> disk-intensive operation in practice.

Maybe. But I recently bought an NVME SSD that can read at over
6GB/second. So "disk-intensive" is not what it used to be -- at least
not for reads. In general it's not good if we do multiple scans of an
index -- no question. But there is a danger in paying a little too
much attention to what is true in general -- we should not ignore what
might be true in specific cases either. Maybe we can solve some
problems by spilling the TID data structure to disk -- if we trade
sequential I/O for random I/O, we may be able to do only one pass over
the index (especially when we have *almost* enough memory to fit all
TIDs, but not quite enough).

The big problem with multiple passes over the index is not the extra
read bandwidth  -- it's the extra page dirtying (writes), especially
with things like indexes on UUID columns. We want to dirty each leaf
page in each index at most once per VACUUM, and should be willing to
pay some cost in order to get a larger benefit with page dirtying.
After all, writes are much more expensive on modern flash devices --
if we have to do more random read I/O to spill the TIDs then that
might actually be 100% worth it. And, we don't need much memory for
something that works well as a negative cache, either -- so maybe the
extra random read I/O needed to spill the TIDs will be very limited
anyway.

There are many possibilities. You can probably think of other
trade-offs yourself. We could maybe use a cost model for all this --
it is a little like a hash join IMV. This is just something to think
about while refining the design.

> Interesting idea. I remember you mentioned this idea somewhere and
> I've considered this idea too while implementing the PoC code. It's
> definitely worth trying. Maybe we can write a patch for this as a
> separate patch? It will change index AM and could improve also the
> current bulk-deletion. We can consider a better data structure on top
> of this idea.

I'm happy to write it as a separate patch, either by leaving it to you
or by collaborating directly. It's not necessary to tie it to the
first patch. But at the same time it is highly related to what you're
already doing.

As I said I am totally prepared to be wrong here. But it seems worth
it to try. In Postgres 14, the _bt_delitems_vacuum() function (which
actually carries out VACUUM's physical page modifications to a leaf
page) is almost identical to _bt_delitems_delete(). And
_bt_delitems_delete() was already built with these kinds of problems
in mind -- it batches work to get the most out of synchronizing with
distant state describing which tuples to delete. It's not exactly the
same situation, but it's *kinda* similar. More importantly, it's a
relatively cheap and easy experiment to run, since we already have
most of what we need (we can take it from
_bt_delitems_delete_check()).

Usually this kind of micro optimization is not very valuable -- 99.9%+
of all code just isn't that sensitive to having the right
optimizations. But this is one of the rare important cases where we
really should look at the raw machine code, and do some kind of
microarchitectural level analysis through careful profiling, using
tools like perf. The laws of physics (or electronic engineering) make
it inevitable that searching for TIDs to match is going to be kind of
slow. But we should at least make sure that we use every trick
available to us to reduce the bottleneck, since it really does matter
a lot to users. Users should be able to expect that this code will at
least be as fast as the hardware that they paid for can allow (or
close to it). There is a great deal of microarchitectural
sophistication with modern CPUs, much of which is designed to make
problems like this one less bad [1].

[1] https://www.agner.org/optimize/microarchitecture.pdf
-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: "debug_invalidate_system_caches_always" is too long
Next
From: Peter Geoghegan
Date:
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum