Thread: Lock-free compaction. Why not?

Lock-free compaction. Why not?

From
Ahmed Yarub Hani Al Nuaimi
Date:
So anyways I talked last week about lock-free vacuum. Since then I almost finished creating a plugin that works on all Jetbrains products. It hooks to the internal database tool and displays the internals of the database and some stats. I would use this for my next phase: autovacuum with compaction.
The thing is, after reading the code a million times, I still don't understand why lock-free (or minimum locking) is such a big problem! Is it that hard to lazily move tuples from one page to the other after defragmenting it lazily?
I even made (a super simple equation) to calculate how much disk space can be reclaimed, and created a full list of the advantages including maintaining clustered indexes. This would of course be followed by benchmarks.
Here is the wip plugin btw https://github.com/ahmedyarub/database_internals_plugin/tree/2.0.0 which should be released this week.
Like I don't even trust ChatGPT at all but I kept on trying to find reasons for not doing it but I couldn't find any!
Such a trivial change that can bring a huge advantage. I just hope that somebody could find me a reason for why it wouldn't work.

Really appreciate your patience,
Ahmed

Re: Lock-free compaction. Why not?

From
David Rowley
Date:
On Tue, 9 Jul 2024 at 16:58, Ahmed Yarub Hani Al Nuaimi
<ahmedyarubhani@gmail.com> wrote:
> The thing is, after reading the code a million times, I still don't understand why lock-free (or minimum locking) is
sucha big problem! Is it that hard to lazily move tuples from one page to the other after defragmenting it lazily?
 

I think there are a few things to think about. You may have thought of
some of these already.

1. moving rows could cause deadlocking issues. Users might struggle to
accept that some background process is causing their transaction to
rollback.
2. transaction size: How large to make the batches of tuples to move
at once? One transaction sounds much more prone to deadlocking.
3. xid consumption. Doing lots of small transactions to move tuples
could consume lots of xids.
4. moving tuples around to other pages needs indexes to be updated and
could cause index bloat.

For #1, maybe there's something that can be done to ensure it's always
vacuum that's the deadlock victim.

You might be interested in [1].  There's also an older discussion in
[2] that you might find interesting.

David

[1]
https://www.postgresql.org/message-id/flat/CAFj8pRDNDOrg90hLMmbo_hiWpgBm%2B73gmWMRUHRkTKwrGnvYJQ%40mail.gmail.com#cc4f8d730d2c5203f53c50260053fec5
[2] https://www.postgresql.org/message-id/flat/CANTTaev-LdgYj4uZoy67catS5SF5u_X-dTHiLH7OKwU6Gv3MFA%40mail.gmail.com



Re: Lock-free compaction. Why not?

From
Tomas Vondra
Date:
On 7/10/24 11:49, David Rowley wrote:
> On Tue, 9 Jul 2024 at 16:58, Ahmed Yarub Hani Al Nuaimi
> <ahmedyarubhani@gmail.com> wrote:
>> The thing is, after reading the code a million times, I still don't understand why lock-free (or minimum locking) is
sucha big problem! Is it that hard to lazily move tuples from one page to the other after defragmenting it lazily?
 
> 
> I think there are a few things to think about. You may have thought of
> some of these already.
> 
> 1. moving rows could cause deadlocking issues. Users might struggle to
> accept that some background process is causing their transaction to
> rollback.
> 2. transaction size: How large to make the batches of tuples to move
> at once? One transaction sounds much more prone to deadlocking.
> 3. xid consumption. Doing lots of small transactions to move tuples
> could consume lots of xids.
> 4. moving tuples around to other pages needs indexes to be updated and
> could cause index bloat.
> 
> For #1, maybe there's something that can be done to ensure it's always
> vacuum that's the deadlock victim.
> 
> You might be interested in [1].  There's also an older discussion in
> [2] that you might find interesting.
> 

IIRC long time ago VACUUM FULL actually worked in a similar way, i.e. by
moving rows around. I'm not sure if it did the lock-free thing as
proposed here (probably not), but I guess at least some of the reasons
why it was replaced by CLUSTER would still apply to this new thing.

Maybe it's a good trade off for some use cases (after all, people do
that using pg_repack/pg_squeeze/... so it clearly has value for them),
but it'd be a bit unfortunate to rediscover those old issues later.

The cluster vacuum was introduced by commit 946cf229e89 in 2010, and
then the "inplace" variant was removed by 0a469c87692 shortly after. I
haven't looked for the threads discussing those changes, but I guess it
should not be hard to find in the archives.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Lock-free compaction. Why not?

From
David Rowley
Date:
On Wed, 10 Jul 2024 at 22:58, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> IIRC long time ago VACUUM FULL actually worked in a similar way, i.e. by
> moving rows around. I'm not sure if it did the lock-free thing as
> proposed here (probably not), but I guess at least some of the reasons
> why it was replaced by CLUSTER would still apply to this new thing.

Yeah, that changed in 9.0.  The old version still obtained an AEL on the table.

I think the primary issue with the old way was index bloat wasn't
fixed. The release notes for 9.0 do claim the CLUSTER method "is
substantially faster in most cases", however, I imagine there are
plenty of cases where it wouldn't be. e.g, it's hard to imagine
rewriting the entire 1TB table and indexes is cheaper than moving 1
row out of place row.

David



Re: Lock-free compaction. Why not?

From
Robert Haas
Date:
On Thu, Jul 18, 2024 at 7:08 AM David Rowley <dgrowleyml@gmail.com> wrote:
> On Wed, 10 Jul 2024 at 22:58, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> > IIRC long time ago VACUUM FULL actually worked in a similar way, i.e. by
> > moving rows around. I'm not sure if it did the lock-free thing as
> > proposed here (probably not), but I guess at least some of the reasons
> > why it was replaced by CLUSTER would still apply to this new thing.
>
> Yeah, that changed in 9.0.  The old version still obtained an AEL on the table.
>
> I think the primary issue with the old way was index bloat wasn't
> fixed. The release notes for 9.0 do claim the CLUSTER method "is
> substantially faster in most cases", however, I imagine there are
> plenty of cases where it wouldn't be. e.g, it's hard to imagine
> rewriting the entire 1TB table and indexes is cheaper than moving 1
> row out of place row.

The other thing I remember besides index bloat is that it was
crushingly slow. My memory is pretty fuzzy after this long, but I feel
like it was on the order of minutes to do VACUUM FULL when you could
have done CLUSTER in seconds -- and then on top of the long wait you
often ended up using more disk space at the end than you had at the
beginning due to the index bloat. I remember being surprised by the
decision to remove it entirely, but it sure was painful to use.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Lock-free compaction. Why not?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jul 18, 2024 at 7:08 AM David Rowley <dgrowleyml@gmail.com> wrote:
>> I think the primary issue with the old way was index bloat wasn't
>> fixed. The release notes for 9.0 do claim the CLUSTER method "is
>> substantially faster in most cases", however, I imagine there are
>> plenty of cases where it wouldn't be. e.g, it's hard to imagine
>> rewriting the entire 1TB table and indexes is cheaper than moving 1
>> row out of place row.

> The other thing I remember besides index bloat is that it was
> crushingly slow.

Yeah.  The old way was great if there really were just a few tuples
needing to be moved ... but by the time you decide you need VACUUM
FULL rather than plain VACUUM, that's unlikely to be the case.  

            regards, tom lane



Re: Lock-free compaction. Why not?

From
Ahmed Yarub Hani Al Nuaimi
Date:
Wow I was busy for a controle of days and now I’m again fully committed to this initiative. These ideas are extremely useful to my. I’ll first start by reading the old in-place implementation, but meanwhile I have the following questions:
1- I’m thinking of adding only one simple step to be auto-vacuum. This means that there will neither be excessive locking nor resource utilization. I guess my question is: does that simple step make the current lazy auto-vacuum much worse?
2- Can you point me to a resource explaining why this might lead to index bloating?

Em qui., 18 de jul. de 2024 às 15:21, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jul 18, 2024 at 7:08 AM David Rowley <dgrowleyml@gmail.com> wrote:
>> I think the primary issue with the old way was index bloat wasn't
>> fixed. The release notes for 9.0 do claim the CLUSTER method "is
>> substantially faster in most cases", however, I imagine there are
>> plenty of cases where it wouldn't be. e.g, it's hard to imagine
>> rewriting the entire 1TB table and indexes is cheaper than moving 1
>> row out of place row.

> The other thing I remember besides index bloat is that it was
> crushingly slow.

Yeah.  The old way was great if there really were just a few tuples
needing to be moved ... but by the time you decide you need VACUUM
FULL rather than plain VACUUM, that's unlikely to be the case. 

                        regards, tom lane

Re: Lock-free compaction. Why not?

From
David Rowley
Date:
On Sun, 21 Jul 2024 at 04:00, Ahmed Yarub Hani Al Nuaimi
<ahmedyarubhani@gmail.com> wrote:
> 2- Can you point me to a resource explaining why this might lead to index bloating?

No resource links, but if you move a tuple to another page then you
must also adjust the index.  If you have no exclusive lock on the
table, then you must assume older transactions still need the old
tuple version, so you need to create another index entry rather than
re-pointing the existing index entry's ctid to the new tuple version.
It's not hard to imagine that would cause the index to become larger
if you had to move some decent portion of the tuples to other pages.

FWIW, I think it would be good if we had some easier way to compact
tables without blocking concurrent users.  My primary interest in TID
Range Scans was to allow easier identification of tuples near the end
of the heap that could be manually UPDATEd after a vacuum to allow the
heap to be shrunk during the next vacuum.

David



Re: Lock-free compaction. Why not?

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> No resource links, but if you move a tuple to another page then you
> must also adjust the index.  If you have no exclusive lock on the
> table, then you must assume older transactions still need the old
> tuple version, so you need to create another index entry rather than
> re-pointing the existing index entry's ctid to the new tuple version.

The actually tricky part about that is that you have to ensure that
any concurrent scan will see one of the two copies --- not both,
and not neither.  This is fairly hard when the concurrent query
might be using any of several scan methods, and might or might not
have visited the original tuple before you commenced the move.
You can solve it by treating the move more like an UPDATE, that
is the new tuple gets a new XID, but that has its own downsides;
notably that it must block/be blocked by concurrent real UPDATEs.

            regards, tom lane



Re: Lock-free compaction. Why not?

From
Tomas Vondra
Date:
On 7/21/24 04:13, Tom Lane wrote:
> David Rowley <dgrowleyml@gmail.com> writes:
>> No resource links, but if you move a tuple to another page then you
>> must also adjust the index.  If you have no exclusive lock on the
>> table, then you must assume older transactions still need the old
>> tuple version, so you need to create another index entry rather than
>> re-pointing the existing index entry's ctid to the new tuple version.
> 
> The actually tricky part about that is that you have to ensure that
> any concurrent scan will see one of the two copies --- not both,
> and not neither.  This is fairly hard when the concurrent query
> might be using any of several scan methods, and might or might not
> have visited the original tuple before you commenced the move.
> You can solve it by treating the move more like an UPDATE, that
> is the new tuple gets a new XID, but that has its own downsides;
> notably that it must block/be blocked by concurrent real UPDATEs.
> 

True, but the UPDATE approach probably comes with it's own set of
issues. For example, it likely breaks tracking of commit timestamps, and
if an application depends on that e.g. for conflict resolution ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Lock-free compaction. Why not?

From
Ahmed Yarub Hani Al Nuaimi
Date:
That clearly explains the problem. But this got me thinking: what if we do both index and heap optimization at the same time?
Meaning that the newly move heap tuple which is used to compact/defragment heap pages would be followed by moving the index (creating and then deleting) a new index tuple at the right place in the index data files (the one that had its dead tuples removed and internally defragmented, aka vacuumed). Deleting the old index could be done immediately after moving the heap tuple. I think that this can both solve the bloating problem and make sure that both the table and index heaps are in optimum shape, all of this being done lazily to make sure that these operations would only be done when the servers are not overwhelmed (or just using whatever logic our lazy vacuuming uses). What do you think?

On Sat, Jul 20, 2024 at 10:52 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Sun, 21 Jul 2024 at 04:00, Ahmed Yarub Hani Al Nuaimi
<ahmedyarubhani@gmail.com> wrote:
> 2- Can you point me to a resource explaining why this might lead to index bloating?

No resource links, but if you move a tuple to another page then you
must also adjust the index.  If you have no exclusive lock on the
table, then you must assume older transactions still need the old
tuple version, so you need to create another index entry rather than
re-pointing the existing index entry's ctid to the new tuple version.
It's not hard to imagine that would cause the index to become larger
if you had to move some decent portion of the tuples to other pages.

FWIW, I think it would be good if we had some easier way to compact
tables without blocking concurrent users.  My primary interest in TID
Range Scans was to allow easier identification of tuples near the end
of the heap that could be manually UPDATEd after a vacuum to allow the
heap to be shrunk during the next vacuum.

David

Re: Lock-free compaction. Why not?

From
Tomas Vondra
Date:
Please don't top-post ...

On 7/21/24 16:42, Ahmed Yarub Hani Al Nuaimi wrote:
> That clearly explains the problem. But this got me thinking: what if we do
> both index and heap optimization at the same time?
> Meaning that the newly move heap tuple which is used to compact/defragment
> heap pages would be followed by moving the index (creating and then
> deleting) a new index tuple at the right place in the index data files (the
> one that had its dead tuples removed and internally defragmented, aka
> vacuumed). Deleting the old index could be done immediately after moving
> the heap tuple. I think that this can both solve the bloating problem and
> make sure that both the table and index heaps are in optimum shape, all of
> this being done lazily to make sure that these operations would only be
> done when the servers are not overwhelmed (or just using whatever logic our
> lazy vacuuming uses). What do you think?
> 

I think this would run directly into the problems mentioned by Tom [1].
You say "immediately", but what does that mean? You need to explain how
would you ensure a scan (of arbitrary type) sees *exactly( one of the
heap/index tuples.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Lock-free compaction. Why not?

From
Robert Haas
Date:
On Sat, Jul 20, 2024 at 10:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The actually tricky part about that is that you have to ensure that
> any concurrent scan will see one of the two copies --- not both,
> and not neither.  This is fairly hard when the concurrent query
> might be using any of several scan methods, and might or might not
> have visited the original tuple before you commenced the move.
> You can solve it by treating the move more like an UPDATE, that
> is the new tuple gets a new XID, but that has its own downsides;
> notably that it must block/be blocked by concurrent real UPDATEs.

Yeah, this is always the part that has stumped me. I think there's
probably some way to find bit space to indicate whether an update is a
"real" update or a "move-the-tuple" update, although it might take a
bunch of engineering to get the job done, or otherwise be kind of
awkward in some way. But then what? You're still left with writing a
bunch of new index entries for the tuple which will lead to bloating
the indexes as you try to organize the heap, so I imagine that you
have to move relatively small batches of tuples and then vacuum and
then repeat, which seems like it will take forever on a big table.

If you imagine a hypothetical world in which the block number in the
index entry is a logical block number rather than a physical block
number, then you could imagine shuffling the physical position of the
blocks around to put the non-empty logical blocks at the beginning of
the relation and then throw away the empty ones. You might still want
to migrate some rows from partially-filled blocks to empty ones, but
that seems like it would often require reindexing vastly fewer rows.
Imagine for example a relation with ten half-empty blocks at the
beginning followed by a million completely full blocks. But now you've
also turned sequential scans into random I/O. Probably it works better
if the logical->physical mapping works in multi-megabyte chunks rather
than block by block, but that seems like an awful lot of engineering
to do as foundational work, especially given that even after you do
all of that and build some kind of tuple relocator on top of it, you
still need a way to move around individual tuples when relocating
chunks isn't good enough.

What the extensions that are out there seem to do is, as I understand
it, an online table rewrite with concurrent change capture, and then
you apply the changes to the output table afterward. That has the
problem that if the changes are happening faster than you can apply
them, the operation does not terminate. But, enough people seem to be
happy with this kind of solution that we should perhaps look harder at
doing something along these lines in core.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Lock-free compaction. Why not?

From
Michael Banck
Date:
Hi,

On Mon, Jul 22, 2024 at 08:39:23AM -0400, Robert Haas wrote:
> What the extensions that are out there seem to do is, as I understand
> it, an online table rewrite with concurrent change capture, and then
> you apply the changes to the output table afterward. That has the
> problem that if the changes are happening faster than you can apply
> them, the operation does not terminate. But, enough people seem to be
> happy with this kind of solution that we should perhaps look harder at
> doing something along these lines in core.

I believe this is being discussed here:

https://commitfest.postgresql.org/49/5117/
https://www.postgresql.org/message-id/5186.1706694913%40antos


Michael



Re: Lock-free compaction. Why not?

From
Ahmed Yarub Hani Al Nuaimi
Date:
That is a very useful thread and I'll keep on following it but it is not exactly what I'm trying to achieve here.
You see, there is a great difference between VACUUM FULL CONCURRENTLY and adding compaction to lazy vacuuming. The main factor here is resource utilization: a lot of companies have enough data that would need days to be vacuumed concurrently. Is the implementation discussed there pausable or at least cancellable? Does it take into account periods of high resource utilization by user-generated queries?

On Mon, Jul 22, 2024 at 9:42 AM Michael Banck <mbanck@gmx.net> wrote:
Hi,

On Mon, Jul 22, 2024 at 08:39:23AM -0400, Robert Haas wrote:
> What the extensions that are out there seem to do is, as I understand
> it, an online table rewrite with concurrent change capture, and then
> you apply the changes to the output table afterward. That has the
> problem that if the changes are happening faster than you can apply
> them, the operation does not terminate. But, enough people seem to be
> happy with this kind of solution that we should perhaps look harder at
> doing something along these lines in core.

I believe this is being discussed here:

https://commitfest.postgresql.org/49/5117/
https://www.postgresql.org/message-id/5186.1706694913%40antos


Michael

Re: Lock-free compaction. Why not?

From
Robert Haas
Date:
On Mon, Jul 22, 2024 at 1:00 PM Ahmed Yarub Hani Al Nuaimi
<ahmedyarubhani@gmail.com> wrote:
> That is a very useful thread and I'll keep on following it but it is not exactly what I'm trying to achieve here.
> You see, there is a great difference between VACUUM FULL CONCURRENTLY and adding compaction to lazy vacuuming. The
mainfactor here is resource utilization: a lot of companies have enough data that would need days to be vacuumed
concurrently.Is the implementation discussed there pausable or at least cancellable? Does it take into account periods
ofhigh resource utilization by user-generated queries? 

If you want to discuss the patch on the other thread, you should go
read that thread and perhaps reply there, rather than replying to this
message. It's important to keep all of the discussion of a certain
patch together, which doesn't happen if you reply like this.

Also, you've already been asked not to top-post and you just did it
again, so I'm guessing that you don't know what is meant by the term.
So please read this:

https://web.archive.org/web/20230608210806/idallen.com/topposting.html

If you're going to post to this mailing list, it is important to
understand the conventions and expectations that people have here. If
you insist on doing things differently than what everyone else does,
you're going to annoy a lot of people.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Lock-free compaction. Why not?

From
Ahmed Yarub Hani Al Nuaimi
Date:


On Mon, Jul 22, 2024 at 2:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jul 22, 2024 at 1:00 PM Ahmed Yarub Hani Al Nuaimi
<ahmedyarubhani@gmail.com> wrote:
> That is a very useful thread and I'll keep on following it but it is not exactly what I'm trying to achieve here.
> You see, there is a great difference between VACUUM FULL CONCURRENTLY and adding compaction to lazy vacuuming. The main factor here is resource utilization: a lot of companies have enough data that would need days to be vacuumed concurrently. Is the implementation discussed there pausable or at least cancellable? Does it take into account periods of high resource utilization by user-generated queries?

If you want to discuss the patch on the other thread, you should go
read that thread and perhaps reply there, rather than replying to this
message. It's important to keep all of the discussion of a certain
patch together, which doesn't happen if you reply like this.

Also, you've already been asked not to top-post and you just did it
again, so I'm guessing that you don't know what is meant by the term.
So please read this:

https://web.archive.org/web/20230608210806/idallen.com/topposting.html

If you're going to post to this mailing list, it is important to
understand the conventions and expectations that people have here. If
you insist on doing things differently than what everyone else does,
you're going to annoy a lot of people.

--
Robert Haas
EDB: http://www.enterprisedb.com

Oh I'm so sorry for the top-posting. I didn't even notice the warning before. I'm not discussing exactly what is in that thread but rather an alternative implementation. That being said, I'll do my own research, try to get a working implementation and then come back to this thread.
Sorry again :)