Thread: Lock-free compaction. Why not?
So anyways I talked last week about lock-free vacuum. Since then I almost finished creating a plugin that works on all Jetbrains products. It hooks to the internal database tool and displays the internals of the database and some stats. I would use this for my next phase: autovacuum with compaction.
The thing is, after reading the code a million times, I still don't understand why lock-free (or minimum locking) is such a big problem! Is it that hard to lazily move tuples from one page to the other after defragmenting it lazily?
I even made (a super simple equation) to calculate how much disk space can be reclaimed, and created a full list of the advantages including maintaining clustered indexes. This would of course be followed by benchmarks.
Here is the wip plugin btw https://github.com/ahmedyarub/database_internals_plugin/tree/2.0.0 which should be released this week.
Like I don't even trust ChatGPT at all but I kept on trying to find reasons for not doing it but I couldn't find any!
Such a trivial change that can bring a huge advantage. I just hope that somebody could find me a reason for why it wouldn't work.
The thing is, after reading the code a million times, I still don't understand why lock-free (or minimum locking) is such a big problem! Is it that hard to lazily move tuples from one page to the other after defragmenting it lazily?
I even made (a super simple equation) to calculate how much disk space can be reclaimed, and created a full list of the advantages including maintaining clustered indexes. This would of course be followed by benchmarks.
Here is the wip plugin btw https://github.com/ahmedyarub/database_internals_plugin/tree/2.0.0 which should be released this week.
Like I don't even trust ChatGPT at all but I kept on trying to find reasons for not doing it but I couldn't find any!
Such a trivial change that can bring a huge advantage. I just hope that somebody could find me a reason for why it wouldn't work.
Really appreciate your patience,
Ahmed
On Tue, 9 Jul 2024 at 16:58, Ahmed Yarub Hani Al Nuaimi <ahmedyarubhani@gmail.com> wrote: > The thing is, after reading the code a million times, I still don't understand why lock-free (or minimum locking) is sucha big problem! Is it that hard to lazily move tuples from one page to the other after defragmenting it lazily? I think there are a few things to think about. You may have thought of some of these already. 1. moving rows could cause deadlocking issues. Users might struggle to accept that some background process is causing their transaction to rollback. 2. transaction size: How large to make the batches of tuples to move at once? One transaction sounds much more prone to deadlocking. 3. xid consumption. Doing lots of small transactions to move tuples could consume lots of xids. 4. moving tuples around to other pages needs indexes to be updated and could cause index bloat. For #1, maybe there's something that can be done to ensure it's always vacuum that's the deadlock victim. You might be interested in [1]. There's also an older discussion in [2] that you might find interesting. David [1] https://www.postgresql.org/message-id/flat/CAFj8pRDNDOrg90hLMmbo_hiWpgBm%2B73gmWMRUHRkTKwrGnvYJQ%40mail.gmail.com#cc4f8d730d2c5203f53c50260053fec5 [2] https://www.postgresql.org/message-id/flat/CANTTaev-LdgYj4uZoy67catS5SF5u_X-dTHiLH7OKwU6Gv3MFA%40mail.gmail.com
On 7/10/24 11:49, David Rowley wrote: > On Tue, 9 Jul 2024 at 16:58, Ahmed Yarub Hani Al Nuaimi > <ahmedyarubhani@gmail.com> wrote: >> The thing is, after reading the code a million times, I still don't understand why lock-free (or minimum locking) is sucha big problem! Is it that hard to lazily move tuples from one page to the other after defragmenting it lazily? > > I think there are a few things to think about. You may have thought of > some of these already. > > 1. moving rows could cause deadlocking issues. Users might struggle to > accept that some background process is causing their transaction to > rollback. > 2. transaction size: How large to make the batches of tuples to move > at once? One transaction sounds much more prone to deadlocking. > 3. xid consumption. Doing lots of small transactions to move tuples > could consume lots of xids. > 4. moving tuples around to other pages needs indexes to be updated and > could cause index bloat. > > For #1, maybe there's something that can be done to ensure it's always > vacuum that's the deadlock victim. > > You might be interested in [1]. There's also an older discussion in > [2] that you might find interesting. > IIRC long time ago VACUUM FULL actually worked in a similar way, i.e. by moving rows around. I'm not sure if it did the lock-free thing as proposed here (probably not), but I guess at least some of the reasons why it was replaced by CLUSTER would still apply to this new thing. Maybe it's a good trade off for some use cases (after all, people do that using pg_repack/pg_squeeze/... so it clearly has value for them), but it'd be a bit unfortunate to rediscover those old issues later. The cluster vacuum was introduced by commit 946cf229e89 in 2010, and then the "inplace" variant was removed by 0a469c87692 shortly after. I haven't looked for the threads discussing those changes, but I guess it should not be hard to find in the archives. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 10 Jul 2024 at 22:58, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > IIRC long time ago VACUUM FULL actually worked in a similar way, i.e. by > moving rows around. I'm not sure if it did the lock-free thing as > proposed here (probably not), but I guess at least some of the reasons > why it was replaced by CLUSTER would still apply to this new thing. Yeah, that changed in 9.0. The old version still obtained an AEL on the table. I think the primary issue with the old way was index bloat wasn't fixed. The release notes for 9.0 do claim the CLUSTER method "is substantially faster in most cases", however, I imagine there are plenty of cases where it wouldn't be. e.g, it's hard to imagine rewriting the entire 1TB table and indexes is cheaper than moving 1 row out of place row. David
On Thu, Jul 18, 2024 at 7:08 AM David Rowley <dgrowleyml@gmail.com> wrote: > On Wed, 10 Jul 2024 at 22:58, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > IIRC long time ago VACUUM FULL actually worked in a similar way, i.e. by > > moving rows around. I'm not sure if it did the lock-free thing as > > proposed here (probably not), but I guess at least some of the reasons > > why it was replaced by CLUSTER would still apply to this new thing. > > Yeah, that changed in 9.0. The old version still obtained an AEL on the table. > > I think the primary issue with the old way was index bloat wasn't > fixed. The release notes for 9.0 do claim the CLUSTER method "is > substantially faster in most cases", however, I imagine there are > plenty of cases where it wouldn't be. e.g, it's hard to imagine > rewriting the entire 1TB table and indexes is cheaper than moving 1 > row out of place row. The other thing I remember besides index bloat is that it was crushingly slow. My memory is pretty fuzzy after this long, but I feel like it was on the order of minutes to do VACUUM FULL when you could have done CLUSTER in seconds -- and then on top of the long wait you often ended up using more disk space at the end than you had at the beginning due to the index bloat. I remember being surprised by the decision to remove it entirely, but it sure was painful to use. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jul 18, 2024 at 7:08 AM David Rowley <dgrowleyml@gmail.com> wrote: >> I think the primary issue with the old way was index bloat wasn't >> fixed. The release notes for 9.0 do claim the CLUSTER method "is >> substantially faster in most cases", however, I imagine there are >> plenty of cases where it wouldn't be. e.g, it's hard to imagine >> rewriting the entire 1TB table and indexes is cheaper than moving 1 >> row out of place row. > The other thing I remember besides index bloat is that it was > crushingly slow. Yeah. The old way was great if there really were just a few tuples needing to be moved ... but by the time you decide you need VACUUM FULL rather than plain VACUUM, that's unlikely to be the case. regards, tom lane
Wow I was busy for a controle of days and now I’m again fully committed to this initiative. These ideas are extremely useful to my. I’ll first start by reading the old in-place implementation, but meanwhile I have the following questions:
1- I’m thinking of adding only one simple step to be auto-vacuum. This means that there will neither be excessive locking nor resource utilization. I guess my question is: does that simple step make the current lazy auto-vacuum much worse?
2- Can you point me to a resource explaining why this might lead to index bloating?
Em qui., 18 de jul. de 2024 às 15:21, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jul 18, 2024 at 7:08 AM David Rowley <dgrowleyml@gmail.com> wrote:
>> I think the primary issue with the old way was index bloat wasn't
>> fixed. The release notes for 9.0 do claim the CLUSTER method "is
>> substantially faster in most cases", however, I imagine there are
>> plenty of cases where it wouldn't be. e.g, it's hard to imagine
>> rewriting the entire 1TB table and indexes is cheaper than moving 1
>> row out of place row.
> The other thing I remember besides index bloat is that it was
> crushingly slow.
Yeah. The old way was great if there really were just a few tuples
needing to be moved ... but by the time you decide you need VACUUM
FULL rather than plain VACUUM, that's unlikely to be the case.
regards, tom lane
On Sun, 21 Jul 2024 at 04:00, Ahmed Yarub Hani Al Nuaimi <ahmedyarubhani@gmail.com> wrote: > 2- Can you point me to a resource explaining why this might lead to index bloating? No resource links, but if you move a tuple to another page then you must also adjust the index. If you have no exclusive lock on the table, then you must assume older transactions still need the old tuple version, so you need to create another index entry rather than re-pointing the existing index entry's ctid to the new tuple version. It's not hard to imagine that would cause the index to become larger if you had to move some decent portion of the tuples to other pages. FWIW, I think it would be good if we had some easier way to compact tables without blocking concurrent users. My primary interest in TID Range Scans was to allow easier identification of tuples near the end of the heap that could be manually UPDATEd after a vacuum to allow the heap to be shrunk during the next vacuum. David
David Rowley <dgrowleyml@gmail.com> writes: > No resource links, but if you move a tuple to another page then you > must also adjust the index. If you have no exclusive lock on the > table, then you must assume older transactions still need the old > tuple version, so you need to create another index entry rather than > re-pointing the existing index entry's ctid to the new tuple version. The actually tricky part about that is that you have to ensure that any concurrent scan will see one of the two copies --- not both, and not neither. This is fairly hard when the concurrent query might be using any of several scan methods, and might or might not have visited the original tuple before you commenced the move. You can solve it by treating the move more like an UPDATE, that is the new tuple gets a new XID, but that has its own downsides; notably that it must block/be blocked by concurrent real UPDATEs. regards, tom lane
On 7/21/24 04:13, Tom Lane wrote: > David Rowley <dgrowleyml@gmail.com> writes: >> No resource links, but if you move a tuple to another page then you >> must also adjust the index. If you have no exclusive lock on the >> table, then you must assume older transactions still need the old >> tuple version, so you need to create another index entry rather than >> re-pointing the existing index entry's ctid to the new tuple version. > > The actually tricky part about that is that you have to ensure that > any concurrent scan will see one of the two copies --- not both, > and not neither. This is fairly hard when the concurrent query > might be using any of several scan methods, and might or might not > have visited the original tuple before you commenced the move. > You can solve it by treating the move more like an UPDATE, that > is the new tuple gets a new XID, but that has its own downsides; > notably that it must block/be blocked by concurrent real UPDATEs. > True, but the UPDATE approach probably comes with it's own set of issues. For example, it likely breaks tracking of commit timestamps, and if an application depends on that e.g. for conflict resolution ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
That clearly explains the problem. But this got me thinking: what if we do both index and heap optimization at the same time?
Meaning that the newly move heap tuple which is used to compact/defragment heap pages would be followed by moving the index (creating and then deleting) a new index tuple at the right place in the index data files (the one that had its dead tuples removed and internally defragmented, aka vacuumed). Deleting the old index could be done immediately after moving the heap tuple. I think that this can both solve the bloating problem and make sure that both the table and index heaps are in optimum shape, all of this being done lazily to make sure that these operations would only be done when the servers are not overwhelmed (or just using whatever logic our lazy vacuuming uses). What do you think?
On Sat, Jul 20, 2024 at 10:52 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Sun, 21 Jul 2024 at 04:00, Ahmed Yarub Hani Al Nuaimi
<ahmedyarubhani@gmail.com> wrote:
> 2- Can you point me to a resource explaining why this might lead to index bloating?
No resource links, but if you move a tuple to another page then you
must also adjust the index. If you have no exclusive lock on the
table, then you must assume older transactions still need the old
tuple version, so you need to create another index entry rather than
re-pointing the existing index entry's ctid to the new tuple version.
It's not hard to imagine that would cause the index to become larger
if you had to move some decent portion of the tuples to other pages.
FWIW, I think it would be good if we had some easier way to compact
tables without blocking concurrent users. My primary interest in TID
Range Scans was to allow easier identification of tuples near the end
of the heap that could be manually UPDATEd after a vacuum to allow the
heap to be shrunk during the next vacuum.
David
Please don't top-post ... On 7/21/24 16:42, Ahmed Yarub Hani Al Nuaimi wrote: > That clearly explains the problem. But this got me thinking: what if we do > both index and heap optimization at the same time? > Meaning that the newly move heap tuple which is used to compact/defragment > heap pages would be followed by moving the index (creating and then > deleting) a new index tuple at the right place in the index data files (the > one that had its dead tuples removed and internally defragmented, aka > vacuumed). Deleting the old index could be done immediately after moving > the heap tuple. I think that this can both solve the bloating problem and > make sure that both the table and index heaps are in optimum shape, all of > this being done lazily to make sure that these operations would only be > done when the servers are not overwhelmed (or just using whatever logic our > lazy vacuuming uses). What do you think? > I think this would run directly into the problems mentioned by Tom [1]. You say "immediately", but what does that mean? You need to explain how would you ensure a scan (of arbitrary type) sees *exactly( one of the heap/index tuples. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jul 20, 2024 at 10:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > The actually tricky part about that is that you have to ensure that > any concurrent scan will see one of the two copies --- not both, > and not neither. This is fairly hard when the concurrent query > might be using any of several scan methods, and might or might not > have visited the original tuple before you commenced the move. > You can solve it by treating the move more like an UPDATE, that > is the new tuple gets a new XID, but that has its own downsides; > notably that it must block/be blocked by concurrent real UPDATEs. Yeah, this is always the part that has stumped me. I think there's probably some way to find bit space to indicate whether an update is a "real" update or a "move-the-tuple" update, although it might take a bunch of engineering to get the job done, or otherwise be kind of awkward in some way. But then what? You're still left with writing a bunch of new index entries for the tuple which will lead to bloating the indexes as you try to organize the heap, so I imagine that you have to move relatively small batches of tuples and then vacuum and then repeat, which seems like it will take forever on a big table. If you imagine a hypothetical world in which the block number in the index entry is a logical block number rather than a physical block number, then you could imagine shuffling the physical position of the blocks around to put the non-empty logical blocks at the beginning of the relation and then throw away the empty ones. You might still want to migrate some rows from partially-filled blocks to empty ones, but that seems like it would often require reindexing vastly fewer rows. Imagine for example a relation with ten half-empty blocks at the beginning followed by a million completely full blocks. But now you've also turned sequential scans into random I/O. Probably it works better if the logical->physical mapping works in multi-megabyte chunks rather than block by block, but that seems like an awful lot of engineering to do as foundational work, especially given that even after you do all of that and build some kind of tuple relocator on top of it, you still need a way to move around individual tuples when relocating chunks isn't good enough. What the extensions that are out there seem to do is, as I understand it, an online table rewrite with concurrent change capture, and then you apply the changes to the output table afterward. That has the problem that if the changes are happening faster than you can apply them, the operation does not terminate. But, enough people seem to be happy with this kind of solution that we should perhaps look harder at doing something along these lines in core. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On Mon, Jul 22, 2024 at 08:39:23AM -0400, Robert Haas wrote: > What the extensions that are out there seem to do is, as I understand > it, an online table rewrite with concurrent change capture, and then > you apply the changes to the output table afterward. That has the > problem that if the changes are happening faster than you can apply > them, the operation does not terminate. But, enough people seem to be > happy with this kind of solution that we should perhaps look harder at > doing something along these lines in core. I believe this is being discussed here: https://commitfest.postgresql.org/49/5117/ https://www.postgresql.org/message-id/5186.1706694913%40antos Michael
That is a very useful thread and I'll keep on following it but it is not exactly what I'm trying to achieve here.
You see, there is a great difference between VACUUM FULL CONCURRENTLY and adding compaction to lazy vacuuming. The main factor here is resource utilization: a lot of companies have enough data that would need days to be vacuumed concurrently. Is the implementation discussed there pausable or at least cancellable? Does it take into account periods of high resource utilization by user-generated queries?
On Mon, Jul 22, 2024 at 9:42 AM Michael Banck <mbanck@gmx.net> wrote:
Hi,
On Mon, Jul 22, 2024 at 08:39:23AM -0400, Robert Haas wrote:
> What the extensions that are out there seem to do is, as I understand
> it, an online table rewrite with concurrent change capture, and then
> you apply the changes to the output table afterward. That has the
> problem that if the changes are happening faster than you can apply
> them, the operation does not terminate. But, enough people seem to be
> happy with this kind of solution that we should perhaps look harder at
> doing something along these lines in core.
I believe this is being discussed here:
https://commitfest.postgresql.org/49/5117/
https://www.postgresql.org/message-id/5186.1706694913%40antos
Michael
On Mon, Jul 22, 2024 at 1:00 PM Ahmed Yarub Hani Al Nuaimi <ahmedyarubhani@gmail.com> wrote: > That is a very useful thread and I'll keep on following it but it is not exactly what I'm trying to achieve here. > You see, there is a great difference between VACUUM FULL CONCURRENTLY and adding compaction to lazy vacuuming. The mainfactor here is resource utilization: a lot of companies have enough data that would need days to be vacuumed concurrently.Is the implementation discussed there pausable or at least cancellable? Does it take into account periods ofhigh resource utilization by user-generated queries? If you want to discuss the patch on the other thread, you should go read that thread and perhaps reply there, rather than replying to this message. It's important to keep all of the discussion of a certain patch together, which doesn't happen if you reply like this. Also, you've already been asked not to top-post and you just did it again, so I'm guessing that you don't know what is meant by the term. So please read this: https://web.archive.org/web/20230608210806/idallen.com/topposting.html If you're going to post to this mailing list, it is important to understand the conventions and expectations that people have here. If you insist on doing things differently than what everyone else does, you're going to annoy a lot of people. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Jul 22, 2024 at 2:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jul 22, 2024 at 1:00 PM Ahmed Yarub Hani Al Nuaimi
<ahmedyarubhani@gmail.com> wrote:
> That is a very useful thread and I'll keep on following it but it is not exactly what I'm trying to achieve here.
> You see, there is a great difference between VACUUM FULL CONCURRENTLY and adding compaction to lazy vacuuming. The main factor here is resource utilization: a lot of companies have enough data that would need days to be vacuumed concurrently. Is the implementation discussed there pausable or at least cancellable? Does it take into account periods of high resource utilization by user-generated queries?
If you want to discuss the patch on the other thread, you should go
read that thread and perhaps reply there, rather than replying to this
message. It's important to keep all of the discussion of a certain
patch together, which doesn't happen if you reply like this.
Also, you've already been asked not to top-post and you just did it
again, so I'm guessing that you don't know what is meant by the term.
So please read this:
https://web.archive.org/web/20230608210806/idallen.com/topposting.html
If you're going to post to this mailing list, it is important to
understand the conventions and expectations that people have here. If
you insist on doing things differently than what everyone else does,
you're going to annoy a lot of people.
--
Robert Haas
EDB: http://www.enterprisedb.com
Oh I'm so sorry for the top-posting. I didn't even notice the warning before. I'm not discussing exactly what is in that thread but rather an alternative implementation. That being said, I'll do my own research, try to get a working implementation and then come back to this thread.
Sorry again :)