Thread: clustering without locking

clustering without locking

From

fschmidt

Date:

01 May 2008, 21:15:29

An implementation of clustering without locking would start by comparing the
index to the table from the beginning to find the first mismatch.  Rows
before the mismatch are fine, and can be left alone.  From here on, go
through the index and rewrite each row in order.  This will put the rows at
the end of the table in cluster order.  When done, vacuum the table.  This
will result in a clustered table without any locking needed.  Those few
records that were updated while clustering was happening will be out of
order, but that should only be a few.

So, could this work?  I could really use clustering without locking.

--
View this message in context: http://www.nabble.com/clustering-without-locking-tp16996348p16996348.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: clustering without locking

From

Martijn van Oosterhout

Date:

02 May 2008, 03:18:52

On Thu, May 01, 2008 at 05:12:52PM -0700, fschmidt wrote:
>
> An implementation of clustering without locking would start by comparing the
> index to the table from the beginning to find the first mismatch.  Rows
> before the mismatch are fine, and can be left alone.  From here on, go
> through the index and rewrite each row in order.  This will put the rows at
> the end of the table in cluster order.  When done, vacuum the table.  This
> will result in a clustered table without any locking needed.  Those few
> records that were updated while clustering was happening will be out of
> order, but that should only be a few.

Huh? If I'm understanding you correctly you'll end up with rows in
order, but with a really big hole in the middle of the table. I'm not
sure if that qualifies as "clusters".

> So, could this work?  I could really use clustering without locking.

Nice idea, but I don't think it's going to work.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

signature.asc

Re: clustering without locking

From

"Fujii Masao"

Date:

02 May 2008, 04:11:29

On Fri, May 2, 2008 at 9:12 AM, fschmidt <fschmidt@gmail.com> wrote:
>
>  An implementation of clustering without locking would start by comparing the
>  index to the table from the beginning to find the first mismatch.  Rows
>  before the mismatch are fine, and can be left alone.  From here on, go
>  through the index and rewrite each row in order.  This will put the rows at
>  the end of the table in cluster order.  When done, vacuum the table.  This
>  will result in a clustered table without any locking needed.  Those few
>  records that were updated while clustering was happening will be out of
>  order, but that should only be a few.

In your proposal, a large amount of dead tuple can be generated in both
table and index. This is a serious problem that spoils the effect of clustering.

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: clustering without locking

From

Scott Ribe

Date:

02 May 2008, 10:52:07

> Huh? If I'm understanding you correctly you'll end up with rows in
> order, but with a really big hole in the middle of the table. I'm not
> sure if that qualifies as "clusters".

That's why he said vacuum when done. Anyway, I'm not sure that a big
*contiguous* hole in the middle of the table would matter as much for
queries, because most rows would still be close to each other--most queries
would pull from one side or other of the hole, and even for those that
didn't, it would be one seek across the hole, not seeking all over the
place?

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice

Re: clustering without locking

From

Craig Ringer

Date:

02 May 2008, 11:09:36

Scott Ribe wrote:
>> Huh? If I'm understanding you correctly you'll end up with rows in
>> order, but with a really big hole in the middle of the table. I'm not
>> sure if that qualifies as "clusters".
>>
>
> That's why he said vacuum when done. Anyway, I'm not sure that a big
> *contiguous* hole in the middle of the table would matter as much for
> queries, because most rows would still be close to each other--most queries
> would pull from one side or other of the hole, and even for those that
> didn't, it would be one seek across the hole, not seeking all over the
> place?
>

Wouldn't new / updated tuples just get put in the hole, fairly rapidly
un-clustering the table again?

I guess you could also have a fillfactor to pad out the newly clustered
data and just accept huge disk space use.

When you ran the lockless cluster again it could also fill the hole in
partly.

--
Craig Ringer

Re: clustering without locking

From

Scott Ribe

Date:

02 May 2008, 11:29:38

> Wouldn't new / updated tuples just get put in the hole, fairly rapidly
> un-clustering the table again?

How is that different than putting them in newly-allocated space at the end
of the table?

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice

Re: clustering without locking

From

Tom Lane

Date:

02 May 2008, 11:51:14

Scott Ribe <scott_ribe@killerbytes.com> writes:
>> Huh? If I'm understanding you correctly you'll end up with rows in
>> order, but with a really big hole in the middle of the table. I'm not
>> sure if that qualifies as "clusters".

> That's why he said vacuum when done.

Huh?  A plain vacuum wouldn't fix that; a vacuum full would close up the
hole, but (a) it'd not preserve the row ordering, and (b) it'd take an
exclusive lock.

            regards, tom lane

Re: clustering without locking

From

Scott Ribe

Date:

02 May 2008, 12:08:56

> Huh?  A plain vacuum wouldn't fix that; a vacuum full would close up the
> hole, but (a) it'd not preserve the row ordering, and (b) it'd take an
> exclusive lock.

OK.

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice

Re: clustering without locking

From

Craig Ringer

Date:

02 May 2008, 21:22:04

Scott Ribe wrote:
>> Wouldn't new / updated tuples just get put in the hole, fairly rapidly
>> un-clustering the table again?
>>
>
> How is that different than putting them in newly-allocated space at the end
> of the table?
>
>
It isn't, I just wasn't thinking straight.

This is probably a stupid idea as I'm fairly clueless about Pg's
innards, but I find myself wondering about doing a non-exclusive cluster
in small progressive chunks in series of short transactions.

In each transaction tuples from a smallish contiguous chunk of pages
(beginning at the start of the table and moving onward as the cluster
progresses) are copied to free or newly allocated space later in the
table and given the xid of the transaction. The old versions are marked
as having been obsoleted by this transaction. The transaction then
commits. This step is much like a standalone UPDATE that's not actually
changing the field values and that has some weird tuple selection criteria.

Progressive clustering then waits until the last transaction that could
see the old copies of the tuples that were just relocated finishes. When
the space that was occupied by the moved tuples becomes reclaimable (ie
vacuum could mark it as free if it was to run) progressive clustering
resumes. A new transaction is begun. It looks up tuples from the index
being clustered on in order and reads enough to fill the space just
freed into RAM. It then writes those tuples into the freed space
(without ever actually marking it as free, so no other transactions
could ever write to it), giving them the xid of the writing transaction.
The old copies are marked as obsoleted by this transaction and the
transaction commits. Again, it's like an UPDATE, but combined with a
small selective vacuum that never bothers to update the free space map
because it instantly reuses the space.

Now a small chunk of the table, at the start, is in order. The process
can now begin again with the next chunk of unordered pages. It never
needs to create a huge hole in the table or expand the table massively.
It can even space the data out if a fillfactor has been set by leaving
gaps as it writes the replacement data into the freed chunks and
updating the free space map.

The progressive cluster would also need to be able to run a periodic
vacuum (or do the equivalent internally) with a restriction that
prevented it from examining pages in the range the progressive cluster
was currently trying to free. Otherwise all the space freed by old
versions of rows that're now neatly ordered in the table wouldn't be
freed and couldn't be used as scratch space.

In my ignorance of Pg's innards it seems like all this should work,
though it'd basically never finish if there were long running
transactions touching the table being clustered. The other potential
problem I wonder about is bloating the indexes, since every record in
the table is being moved twice over the course of the progressive
cluster operation.

Oh: There's actually only a need for one transaction per step:

begin transaction
move ordered tuples into just-freed hole
move tuples from next block of pages to free space later in table
commit
wait for all transactions that can see the old versions to complete
repeat

So ... is this crazy? Concurrently clustering the table by moving each
record *twice*, in batches, with pauses to allow old versions to cease
being visible by any live transaction? Or can it actually work?

--
Craig Ringer

Re: clustering without locking

From

Tom Lane

Date:

02 May 2008, 21:41:41

Craig Ringer <craig@postnewspapers.com.au> writes:
> So ... is this crazy? Concurrently clustering the table by moving each
> record *twice*, in batches, with pauses to allow old versions to cease
> being visible by any live transaction? Or can it actually work?

It seems to me you'd have a pretty horrid bloat problem: at completion,
the table and all its indexes must be at least twice the minimum size,
and it's unlikely that you'd be able to compact them much afterward.
(If any other insertions were in progress they'd have probably ended up
near the end of the enlarged table, so just truncating is unlikely to
work for the table --- and for the indexes it's a lost cause entirely.)

Now admittedly this is no worse than what you get after a full-table
UPDATE, but for a maintenance operation that is supposed to be improving
efficiency, it doesn't seem like a pleasant property.

            regards, tom lane

Re: clustering without locking

From

Craig Ringer

Date:

02 May 2008, 22:54:17

Tom Lane wrote:
> Craig Ringer <craig@postnewspapers.com.au> writes:
>> So ... is this crazy? Concurrently clustering the table by moving each
>> record *twice*, in batches, with pauses to allow old versions to cease
>> being visible by any live transaction? Or can it actually work?
>
> It seems to me you'd have a pretty horrid bloat problem: at completion,
> the table and all its indexes must be at least twice the minimum size,

I was hoping that wouldn't be the case for the table its self, though I
expected the indexes would be pretty messed up and need their own
cleanup process.

I'll try to explain my thinking, as I'm curious about what I've
misunderstood.

The documentation on VACUUM (well, on pg_freespacemap actually) suggests
that vacuum can recover space from pages that contain some free space
but are not wholly free. Is that right?

In this theoretical progressive cluster, tuples are moved in chunks from
their original locations to free space later in the table (probably
newly allocated at the end). However, if vacuum can recover partial
pages shouldn't it be marking some of the scratch space and originally
allocated space as free (thus permitting its reuse as scratch space) as
tuples are picked out and moved back to the start of the table in order?

Guesswork:

If the initial order of tuples in the table before clustering starts is
completely random then early on the table would expand by a full
progressive cluster chunk size each step, because the pages being moved
back to the start would be from all over the place and the space being
freed would be very scattered and hard to reclaim.

Later on, though, less new space would have to be allocated because more
and more of the space allocated earlier to hold moved tuples would be
being freed up in useful chunks that could be reused. That'd also permit
inserts and updates unrelated to the ongoing progressive clustering
process to be written inside already allocated space rather than being
appended to the table after all the progressive clustering scratch space.

So, if I understand vacuum's reclaiming correctly then even starting off
with a completely record order it should expand the table somewhat for
scratch space, but to less than double its original size. How much less,
and how much could be truncated at the end, would depend on how good
vacuum is at finding small holes to shove new/moved tuples into, and how
similar the tuple sizes are. Right?

That assumes that the initial ordering of tuples is in fact random. If
you're re-clustering a table it's probably far from random, and many of
the tuples will already be in roughly the right areas. That should
permit a much smaller allocation of scratch space, since much more of
the data from the scratch area will be copied back and marked as free
for reuse (for new unrelated inserts or for more moved tuples) within
the next few steps of the progressive cluster. Especially if there's a
non-100% fillfactor it should also be possible to truncate most of the
newly allocated space at the end, as new inserts can be put in sensible
places in the table rather than at the end.

Now, even if that's right the indexes will presumably be in an awful
state. I've noticed that PostgreSQL has `CREATE INDEX CONCURRENTLY' but
not `REINDEX CONCURRENTLY'. That's not too surprising, as nobody's
trying to use an index while you're initially creating it. If there's no
way to clean up the indexes after an operation like this then it's
probably not worth it ... so, is there any way to clean up / optimize
and index that doesn't require a long exclusive lock? A REINDEX
CONCURRENTLY equivalent?

--
Craig Ringer

Re: clustering without locking

From

Tom Lane

Date:

03 May 2008, 14:27:21

Craig Ringer <craig@postnewspapers.com.au> writes:
> Later on, though, less new space would have to be allocated because more
> and more of the space allocated earlier to hold moved tuples would be
> being freed up in useful chunks that could be reused.

I don't see how that works.  If the minimum size of the table is X
pages, ISTM that the first pass has to push everything up to pages above
X.  You can't put any temporary copies in pages <= X because you might
need that space when it comes time to make the clustering happen.  So
the table is going to bloat to (at least) 2X pages.  The extra pages
will be *mostly* empty when you're done, but probably not *entirely*
empty if there have been concurrent insertions --- and you'll never be
able to clean them out without taking exclusive lock.

If you could accurately predict a tuple's final position, you could
maybe get away with putting it temporarily in a page above that one
but still less than X.  I don't see how you do that though, especially
not in the face of concurrent insertions.  (In fact, given concurrent
insertions I don't even see how to know what X is.)

            regards, tom lane

Re: clustering without locking

From

Craig Ringer

Date:

03 May 2008, 15:29:15

Tom Lane wrote:
> Craig Ringer <craig@postnewspapers.com.au> writes:
>> Later on, though, less new space would have to be allocated because more
>> and more of the space allocated earlier to hold moved tuples would be
>> being freed up in useful chunks that could be reused.
>
> I don't see how that works.  If the minimum size of the table is X
> pages, ISTM that the first pass has to push everything up to pages above
> X.

What I was suggesting was to essentially cluster the table in small
parts. Rather than two huge passes (move all tuples to free / newly
allocated space at end of table ; move back into old locations in order)
  it'd be done in a series of smaller moves.

After moving a chunk out of the way and into free/new space at the end
of the table data would be copied from later in the table into the freed
space. That space could then be re-used to hold data from the next chunk
that needs to be moved out of the way.

I'm just trying to understand if it can actually work. Sorry if my
attempted explanations are unclear; I'm probably doing a terrible job,
and it's probably actually a stupid idea anyway (if nothing else it
might just be too slow). Nonetheless I'm curious. Maybe I can explain
another way.

Visually:

`0' to `9' : tuples. Desired eventual cluster order is face value.
`.' : Dead/obsoleted tuple not yet marked reusable by VACUUM
` ' : free space

Initial state:

---------
584736120
---------

Begin a transaction and free the first chunk (2 tuples in this case, but
obviously many more in a real case):

-----------
..473612058
-----------

Use that freed space to store the first ordered tuples:

-----------
014736.2.58
-----------

Commit, and when the last transaction to which the "." tuples above are
still visible completes mark them as free with VACUUM or similar.

-----------
014736 2 58
-----------

Repeat, but now use the freed space to store the next set of tuples that
must be moved rather than extending the table:

-----------
01..3642758
-----------
-----------
0123.64.758
-----------
-----------
0123 64 758
-----------

During the next pass someone inserts `9' after tuples have been moved to
  make a hole and others have been moved into the hole, but before the
old locations of the moved tuples are marked as free:

-----------
0123 .46758
-----------
-----------
012345.67.8
-----------
------------
012345.67.89          <- INSERT 9
------------
------------
012345 67 89
------------

You'd never land up with this sort of convenient ordering half way
through in a real case with realistic numbers of tuples, so it'd keep
going, doing small chunks each time, until the whole table had been
processed.

So, the table does grow, and its final state does contain dead space at
the end, but not *too* much of it:

------------
0123456789
------------

If "low" values are inserted late in the progressive cluster they can
just stay at the end of the table. They'll get moved if the user runs a
progressive cluster operation again later. However, since you're ideally
doing this with a non-100% fillfactor, they should land up in roughly
the right place on initial insert rather than at the end of the table,
avoiding the issue (and helping avoid having newly inserted tuples
prevent table truncation by vacuum when the progressive clustering
finishes).

Say a table containing the range 1-9 was being clustered with a 50%
fillfactor and was about half way through the process:

-------------
  1 2 3 47986
-------------

and someone inserts `0' it should ideally land up in the right place
anyway (as a bit of reading suggests that insert tries to respect
cluster order):

-------------
01 2 3 47986
-------------

If you're re-clustering a table with a fillfactor set you may not need
to extend it at all, because you can use already allocated free space to
store tuples temporarily while moving them around.

> You can't put any temporary copies in pages <= X because you might
> need that space when it comes time to make the clustering happen.  So
> the table is going to bloat to (at least) 2X pages.

Yes, if you perform the whole process in a single huge chunk. What I was
hoping was that it was possible to do it in a series of smaller
operations instead, avoiding the need for such a huge amount of bloat at
the end of the table.

Say you're clustering in 10 steps. You need X*0.1 worth of scratch space
created by extending the table if there's not already appropriate free
space late in the table. Assuming you can reclaim all space you've used
for temporary copies of tuples after each pass your ideal final table
size is X*1.1+(space used by new inserts), of which X*0.1 is free space
at the end of the table. That free space probably has scattered rows
from concurrent inserts, but if you're using a non-100% fillfactor it
might not.

It seems like it should work, *if* space used for temporary storage can
be efficiently reclaimed for reuse (or new inserts) after each pass.
Even if it can work, though, I don't know if it'd actually be useful in
the face of the effect on indexes and because of how slow it might land
up being.

--
Craig Ringer

Re: clustering without locking

From

Tom Lane

Date:

03 May 2008, 15:34:53

Craig Ringer <craig@postnewspapers.com.au> writes:
> Begin a transaction and free the first chunk (2 tuples in this case, but
> obviously many more in a real case):

> -----------
> ..473612058
> -----------

> Use that freed space to store the first ordered tuples:

> -----------
> 014736.2.58
> -----------

> Commit, and when the last transaction to which the "." tuples above are
> still visible completes mark them as free with VACUUM or similar.

> -----------
> 014736 2 58
> -----------

Oh, the problem is that you're misexplaining this.  You can't do it like
that: you can't overwrite the moved-up "." tuples until after they
aren't visible to any other transaction.  So you need two transactions
to do the above.  I'm not sure if you need two "wait for all others" or
just one --- it's not clear to me what's the urgency of clearing out the
moved-down tuples after they're moved down.  (You *would* want to vacuum
every so often, but this is to reclaim index space, not so much heap
space because you'll reuse that anyway.)

Anyway I think the main practical problem would be with deadlocks
against other transactions trying to update/delete tuples at the same
times you need to move them.  Dealing with uncommitted insertions would
be tricky too --- I think you'd need to wait out the inserting
transaction, which would add more possibilities of deadlock.

            regards, tom lane

Re: clustering without locking

From

Craig Ringer

Date:

03 May 2008, 16:05:24

Tom Lane wrote:

> Anyway I think the main practical problem would be with deadlocks
> against other transactions trying to update/delete tuples at the same
> times you need to move them.  Dealing with uncommitted insertions would
> be tricky too --- I think you'd need to wait out the inserting
> transaction, which would add more possibilities of deadlock.

I really appreciate your taking the time to think about and explain
this. It's very helpful, as I'm trying to understand some of the basics
of PostgreSQL's underlying operation.

I'd completely missed thinking about uncomitted inserts - I never
normally need to think about them so they just didn't cross my mind. I
guess it'd either have to do the equivalent of a SELECT FOR UPDATE
NOWAIT on all tuples in the pages to be freed before doing anything
else, or would have to take out an EXCLUSIVE table lock while freeing a
chunk of pages.

I can also vaguely see how problems would arise with concurrent
multi-tuple updates grabbing locks in a different order to the
progressive cluster and deadlocking, and again hadn't even thought about
that.

I guess it might be OK if the progressive cluster attempted to get row
exclusive locks on all tuples in the contiguous range of pages to be
freed, and if it failed to get even one it released them all and retried
that whole step. It sounds like it could be slow and inefficient,
though, possibly so much so as to defeat the point of the clustering
operation in the first place.

Thanks again for taking the time to go over that - it's extremely
helpful and much appreciated.

--
Craig Ringer