Re: Lock-free compaction. Why not? - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Lock-free compaction. Why not? |
Date | |
Msg-id | CA+TgmoZ=5PELaaQ1g-Xj5PFQYkgqHLDqk0hnWQvHC7sB1M14kQ@mail.gmail.com Whole thread Raw |
In response to | Re: Lock-free compaction. Why not? (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Lock-free compaction. Why not?
|
List | pgsql-hackers |
On Sat, Jul 20, 2024 at 10:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > The actually tricky part about that is that you have to ensure that > any concurrent scan will see one of the two copies --- not both, > and not neither. This is fairly hard when the concurrent query > might be using any of several scan methods, and might or might not > have visited the original tuple before you commenced the move. > You can solve it by treating the move more like an UPDATE, that > is the new tuple gets a new XID, but that has its own downsides; > notably that it must block/be blocked by concurrent real UPDATEs. Yeah, this is always the part that has stumped me. I think there's probably some way to find bit space to indicate whether an update is a "real" update or a "move-the-tuple" update, although it might take a bunch of engineering to get the job done, or otherwise be kind of awkward in some way. But then what? You're still left with writing a bunch of new index entries for the tuple which will lead to bloating the indexes as you try to organize the heap, so I imagine that you have to move relatively small batches of tuples and then vacuum and then repeat, which seems like it will take forever on a big table. If you imagine a hypothetical world in which the block number in the index entry is a logical block number rather than a physical block number, then you could imagine shuffling the physical position of the blocks around to put the non-empty logical blocks at the beginning of the relation and then throw away the empty ones. You might still want to migrate some rows from partially-filled blocks to empty ones, but that seems like it would often require reindexing vastly fewer rows. Imagine for example a relation with ten half-empty blocks at the beginning followed by a million completely full blocks. But now you've also turned sequential scans into random I/O. Probably it works better if the logical->physical mapping works in multi-megabyte chunks rather than block by block, but that seems like an awful lot of engineering to do as foundational work, especially given that even after you do all of that and build some kind of tuple relocator on top of it, you still need a way to move around individual tuples when relocating chunks isn't good enough. What the extensions that are out there seem to do is, as I understand it, an online table rewrite with concurrent change capture, and then you apply the changes to the output table afterward. That has the problem that if the changes are happening faster than you can apply them, the operation does not terminate. But, enough people seem to be happy with this kind of solution that we should perhaps look harder at doing something along these lines in core. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: