Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date | |
Msg-id | CAH2-WznoVAMRodYT+-ixZ=sEPQkXTY-UJ+fqoEb58aEo8BHqfg@mail.gmail.com Whole thread Raw |
In response to | Re: [PoC] Improve dead tuple storage for lazy vacuum (Masahiko Sawada <sawada.mshk@gmail.com>) |
List | pgsql-hackers |
On Thu, Jul 8, 2021 at 1:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > As I wrote in the first email, I think there are two important factors > in index vacuuming performance: the performance to check if heap TID > that an index tuple points to is dead, and the number of times to > perform index bulk-deletion. The flame graph I attached in the first > mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a > disk-intensive operation in practice. Maybe. But I recently bought an NVME SSD that can read at over 6GB/second. So "disk-intensive" is not what it used to be -- at least not for reads. In general it's not good if we do multiple scans of an index -- no question. But there is a danger in paying a little too much attention to what is true in general -- we should not ignore what might be true in specific cases either. Maybe we can solve some problems by spilling the TID data structure to disk -- if we trade sequential I/O for random I/O, we may be able to do only one pass over the index (especially when we have *almost* enough memory to fit all TIDs, but not quite enough). The big problem with multiple passes over the index is not the extra read bandwidth -- it's the extra page dirtying (writes), especially with things like indexes on UUID columns. We want to dirty each leaf page in each index at most once per VACUUM, and should be willing to pay some cost in order to get a larger benefit with page dirtying. After all, writes are much more expensive on modern flash devices -- if we have to do more random read I/O to spill the TIDs then that might actually be 100% worth it. And, we don't need much memory for something that works well as a negative cache, either -- so maybe the extra random read I/O needed to spill the TIDs will be very limited anyway. There are many possibilities. You can probably think of other trade-offs yourself. We could maybe use a cost model for all this -- it is a little like a hash join IMV. This is just something to think about while refining the design. > Interesting idea. I remember you mentioned this idea somewhere and > I've considered this idea too while implementing the PoC code. It's > definitely worth trying. Maybe we can write a patch for this as a > separate patch? It will change index AM and could improve also the > current bulk-deletion. We can consider a better data structure on top > of this idea. I'm happy to write it as a separate patch, either by leaving it to you or by collaborating directly. It's not necessary to tie it to the first patch. But at the same time it is highly related to what you're already doing. As I said I am totally prepared to be wrong here. But it seems worth it to try. In Postgres 14, the _bt_delitems_vacuum() function (which actually carries out VACUUM's physical page modifications to a leaf page) is almost identical to _bt_delitems_delete(). And _bt_delitems_delete() was already built with these kinds of problems in mind -- it batches work to get the most out of synchronizing with distant state describing which tuples to delete. It's not exactly the same situation, but it's *kinda* similar. More importantly, it's a relatively cheap and easy experiment to run, since we already have most of what we need (we can take it from _bt_delitems_delete_check()). Usually this kind of micro optimization is not very valuable -- 99.9%+ of all code just isn't that sensitive to having the right optimizations. But this is one of the rare important cases where we really should look at the raw machine code, and do some kind of microarchitectural level analysis through careful profiling, using tools like perf. The laws of physics (or electronic engineering) make it inevitable that searching for TIDs to match is going to be kind of slow. But we should at least make sure that we use every trick available to us to reduce the bottleneck, since it really does matter a lot to users. Users should be able to expect that this code will at least be as fast as the hardware that they paid for can allow (or close to it). There is a great deal of microarchitectural sophistication with modern CPUs, much of which is designed to make problems like this one less bad [1]. [1] https://www.agner.org/optimize/microarchitecture.pdf -- Peter Geoghegan
pgsql-hackers by date: