Thread: clustering without locking
An implementation of clustering without locking would start by comparing the index to the table from the beginning to find the first mismatch. Rows before the mismatch are fine, and can be left alone. From here on, go through the index and rewrite each row in order. This will put the rows at the end of the table in cluster order. When done, vacuum the table. This will result in a clustered table without any locking needed. Those few records that were updated while clustering was happening will be out of order, but that should only be a few. So, could this work? I could really use clustering without locking. -- View this message in context: http://www.nabble.com/clustering-without-locking-tp16996348p16996348.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
On Thu, May 01, 2008 at 05:12:52PM -0700, fschmidt wrote: > > An implementation of clustering without locking would start by comparing the > index to the table from the beginning to find the first mismatch. Rows > before the mismatch are fine, and can be left alone. From here on, go > through the index and rewrite each row in order. This will put the rows at > the end of the table in cluster order. When done, vacuum the table. This > will result in a clustered table without any locking needed. Those few > records that were updated while clustering was happening will be out of > order, but that should only be a few. Huh? If I'm understanding you correctly you'll end up with rows in order, but with a really big hole in the middle of the table. I'm not sure if that qualifies as "clusters". > So, could this work? I could really use clustering without locking. Nice idea, but I don't think it's going to work. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Attachment
On Fri, May 2, 2008 at 9:12 AM, fschmidt <fschmidt@gmail.com> wrote: > > An implementation of clustering without locking would start by comparing the > index to the table from the beginning to find the first mismatch. Rows > before the mismatch are fine, and can be left alone. From here on, go > through the index and rewrite each row in order. This will put the rows at > the end of the table in cluster order. When done, vacuum the table. This > will result in a clustered table without any locking needed. Those few > records that were updated while clustering was happening will be out of > order, but that should only be a few. In your proposal, a large amount of dead tuple can be generated in both table and index. This is a serious problem that spoils the effect of clustering. -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
> Huh? If I'm understanding you correctly you'll end up with rows in > order, but with a really big hole in the middle of the table. I'm not > sure if that qualifies as "clusters". That's why he said vacuum when done. Anyway, I'm not sure that a big *contiguous* hole in the middle of the table would matter as much for queries, because most rows would still be close to each other--most queries would pull from one side or other of the hole, and even for those that didn't, it would be one seek across the hole, not seeking all over the place? -- Scott Ribe scott_ribe@killerbytes.com http://www.killerbytes.com/ (303) 722-0567 voice
Scott Ribe wrote: >> Huh? If I'm understanding you correctly you'll end up with rows in >> order, but with a really big hole in the middle of the table. I'm not >> sure if that qualifies as "clusters". >> > > That's why he said vacuum when done. Anyway, I'm not sure that a big > *contiguous* hole in the middle of the table would matter as much for > queries, because most rows would still be close to each other--most queries > would pull from one side or other of the hole, and even for those that > didn't, it would be one seek across the hole, not seeking all over the > place? > Wouldn't new / updated tuples just get put in the hole, fairly rapidly un-clustering the table again? I guess you could also have a fillfactor to pad out the newly clustered data and just accept huge disk space use. When you ran the lockless cluster again it could also fill the hole in partly. -- Craig Ringer
> Wouldn't new / updated tuples just get put in the hole, fairly rapidly > un-clustering the table again? How is that different than putting them in newly-allocated space at the end of the table? -- Scott Ribe scott_ribe@killerbytes.com http://www.killerbytes.com/ (303) 722-0567 voice
Scott Ribe <scott_ribe@killerbytes.com> writes: >> Huh? If I'm understanding you correctly you'll end up with rows in >> order, but with a really big hole in the middle of the table. I'm not >> sure if that qualifies as "clusters". > That's why he said vacuum when done. Huh? A plain vacuum wouldn't fix that; a vacuum full would close up the hole, but (a) it'd not preserve the row ordering, and (b) it'd take an exclusive lock. regards, tom lane
> Huh? A plain vacuum wouldn't fix that; a vacuum full would close up the > hole, but (a) it'd not preserve the row ordering, and (b) it'd take an > exclusive lock. OK. -- Scott Ribe scott_ribe@killerbytes.com http://www.killerbytes.com/ (303) 722-0567 voice
Scott Ribe wrote: >> Wouldn't new / updated tuples just get put in the hole, fairly rapidly >> un-clustering the table again? >> > > How is that different than putting them in newly-allocated space at the end > of the table? > > It isn't, I just wasn't thinking straight. This is probably a stupid idea as I'm fairly clueless about Pg's innards, but I find myself wondering about doing a non-exclusive cluster in small progressive chunks in series of short transactions. In each transaction tuples from a smallish contiguous chunk of pages (beginning at the start of the table and moving onward as the cluster progresses) are copied to free or newly allocated space later in the table and given the xid of the transaction. The old versions are marked as having been obsoleted by this transaction. The transaction then commits. This step is much like a standalone UPDATE that's not actually changing the field values and that has some weird tuple selection criteria. Progressive clustering then waits until the last transaction that could see the old copies of the tuples that were just relocated finishes. When the space that was occupied by the moved tuples becomes reclaimable (ie vacuum could mark it as free if it was to run) progressive clustering resumes. A new transaction is begun. It looks up tuples from the index being clustered on in order and reads enough to fill the space just freed into RAM. It then writes those tuples into the freed space (without ever actually marking it as free, so no other transactions could ever write to it), giving them the xid of the writing transaction. The old copies are marked as obsoleted by this transaction and the transaction commits. Again, it's like an UPDATE, but combined with a small selective vacuum that never bothers to update the free space map because it instantly reuses the space. Now a small chunk of the table, at the start, is in order. The process can now begin again with the next chunk of unordered pages. It never needs to create a huge hole in the table or expand the table massively. It can even space the data out if a fillfactor has been set by leaving gaps as it writes the replacement data into the freed chunks and updating the free space map. The progressive cluster would also need to be able to run a periodic vacuum (or do the equivalent internally) with a restriction that prevented it from examining pages in the range the progressive cluster was currently trying to free. Otherwise all the space freed by old versions of rows that're now neatly ordered in the table wouldn't be freed and couldn't be used as scratch space. In my ignorance of Pg's innards it seems like all this should work, though it'd basically never finish if there were long running transactions touching the table being clustered. The other potential problem I wonder about is bloating the indexes, since every record in the table is being moved twice over the course of the progressive cluster operation. Oh: There's actually only a need for one transaction per step: begin transaction move ordered tuples into just-freed hole move tuples from next block of pages to free space later in table commit wait for all transactions that can see the old versions to complete repeat So ... is this crazy? Concurrently clustering the table by moving each record *twice*, in batches, with pauses to allow old versions to cease being visible by any live transaction? Or can it actually work? -- Craig Ringer
Craig Ringer <craig@postnewspapers.com.au> writes: > So ... is this crazy? Concurrently clustering the table by moving each > record *twice*, in batches, with pauses to allow old versions to cease > being visible by any live transaction? Or can it actually work? It seems to me you'd have a pretty horrid bloat problem: at completion, the table and all its indexes must be at least twice the minimum size, and it's unlikely that you'd be able to compact them much afterward. (If any other insertions were in progress they'd have probably ended up near the end of the enlarged table, so just truncating is unlikely to work for the table --- and for the indexes it's a lost cause entirely.) Now admittedly this is no worse than what you get after a full-table UPDATE, but for a maintenance operation that is supposed to be improving efficiency, it doesn't seem like a pleasant property. regards, tom lane
Tom Lane wrote: > Craig Ringer <craig@postnewspapers.com.au> writes: >> So ... is this crazy? Concurrently clustering the table by moving each >> record *twice*, in batches, with pauses to allow old versions to cease >> being visible by any live transaction? Or can it actually work? > > It seems to me you'd have a pretty horrid bloat problem: at completion, > the table and all its indexes must be at least twice the minimum size, I was hoping that wouldn't be the case for the table its self, though I expected the indexes would be pretty messed up and need their own cleanup process. I'll try to explain my thinking, as I'm curious about what I've misunderstood. The documentation on VACUUM (well, on pg_freespacemap actually) suggests that vacuum can recover space from pages that contain some free space but are not wholly free. Is that right? In this theoretical progressive cluster, tuples are moved in chunks from their original locations to free space later in the table (probably newly allocated at the end). However, if vacuum can recover partial pages shouldn't it be marking some of the scratch space and originally allocated space as free (thus permitting its reuse as scratch space) as tuples are picked out and moved back to the start of the table in order? Guesswork: If the initial order of tuples in the table before clustering starts is completely random then early on the table would expand by a full progressive cluster chunk size each step, because the pages being moved back to the start would be from all over the place and the space being freed would be very scattered and hard to reclaim. Later on, though, less new space would have to be allocated because more and more of the space allocated earlier to hold moved tuples would be being freed up in useful chunks that could be reused. That'd also permit inserts and updates unrelated to the ongoing progressive clustering process to be written inside already allocated space rather than being appended to the table after all the progressive clustering scratch space. So, if I understand vacuum's reclaiming correctly then even starting off with a completely record order it should expand the table somewhat for scratch space, but to less than double its original size. How much less, and how much could be truncated at the end, would depend on how good vacuum is at finding small holes to shove new/moved tuples into, and how similar the tuple sizes are. Right? That assumes that the initial ordering of tuples is in fact random. If you're re-clustering a table it's probably far from random, and many of the tuples will already be in roughly the right areas. That should permit a much smaller allocation of scratch space, since much more of the data from the scratch area will be copied back and marked as free for reuse (for new unrelated inserts or for more moved tuples) within the next few steps of the progressive cluster. Especially if there's a non-100% fillfactor it should also be possible to truncate most of the newly allocated space at the end, as new inserts can be put in sensible places in the table rather than at the end. Now, even if that's right the indexes will presumably be in an awful state. I've noticed that PostgreSQL has `CREATE INDEX CONCURRENTLY' but not `REINDEX CONCURRENTLY'. That's not too surprising, as nobody's trying to use an index while you're initially creating it. If there's no way to clean up the indexes after an operation like this then it's probably not worth it ... so, is there any way to clean up / optimize and index that doesn't require a long exclusive lock? A REINDEX CONCURRENTLY equivalent? -- Craig Ringer
Craig Ringer <craig@postnewspapers.com.au> writes: > Later on, though, less new space would have to be allocated because more > and more of the space allocated earlier to hold moved tuples would be > being freed up in useful chunks that could be reused. I don't see how that works. If the minimum size of the table is X pages, ISTM that the first pass has to push everything up to pages above X. You can't put any temporary copies in pages <= X because you might need that space when it comes time to make the clustering happen. So the table is going to bloat to (at least) 2X pages. The extra pages will be *mostly* empty when you're done, but probably not *entirely* empty if there have been concurrent insertions --- and you'll never be able to clean them out without taking exclusive lock. If you could accurately predict a tuple's final position, you could maybe get away with putting it temporarily in a page above that one but still less than X. I don't see how you do that though, especially not in the face of concurrent insertions. (In fact, given concurrent insertions I don't even see how to know what X is.) regards, tom lane
Tom Lane wrote: > Craig Ringer <craig@postnewspapers.com.au> writes: >> Later on, though, less new space would have to be allocated because more >> and more of the space allocated earlier to hold moved tuples would be >> being freed up in useful chunks that could be reused. > > I don't see how that works. If the minimum size of the table is X > pages, ISTM that the first pass has to push everything up to pages above > X. What I was suggesting was to essentially cluster the table in small parts. Rather than two huge passes (move all tuples to free / newly allocated space at end of table ; move back into old locations in order) it'd be done in a series of smaller moves. After moving a chunk out of the way and into free/new space at the end of the table data would be copied from later in the table into the freed space. That space could then be re-used to hold data from the next chunk that needs to be moved out of the way. I'm just trying to understand if it can actually work. Sorry if my attempted explanations are unclear; I'm probably doing a terrible job, and it's probably actually a stupid idea anyway (if nothing else it might just be too slow). Nonetheless I'm curious. Maybe I can explain another way. Visually: `0' to `9' : tuples. Desired eventual cluster order is face value. `.' : Dead/obsoleted tuple not yet marked reusable by VACUUM ` ' : free space Initial state: --------- 584736120 --------- Begin a transaction and free the first chunk (2 tuples in this case, but obviously many more in a real case): ----------- ..473612058 ----------- Use that freed space to store the first ordered tuples: ----------- 014736.2.58 ----------- Commit, and when the last transaction to which the "." tuples above are still visible completes mark them as free with VACUUM or similar. ----------- 014736 2 58 ----------- Repeat, but now use the freed space to store the next set of tuples that must be moved rather than extending the table: ----------- 01..3642758 ----------- ----------- 0123.64.758 ----------- ----------- 0123 64 758 ----------- During the next pass someone inserts `9' after tuples have been moved to make a hole and others have been moved into the hole, but before the old locations of the moved tuples are marked as free: ----------- 0123 .46758 ----------- ----------- 012345.67.8 ----------- ------------ 012345.67.89 <- INSERT 9 ------------ ------------ 012345 67 89 ------------ You'd never land up with this sort of convenient ordering half way through in a real case with realistic numbers of tuples, so it'd keep going, doing small chunks each time, until the whole table had been processed. So, the table does grow, and its final state does contain dead space at the end, but not *too* much of it: ------------ 0123456789 ------------ If "low" values are inserted late in the progressive cluster they can just stay at the end of the table. They'll get moved if the user runs a progressive cluster operation again later. However, since you're ideally doing this with a non-100% fillfactor, they should land up in roughly the right place on initial insert rather than at the end of the table, avoiding the issue (and helping avoid having newly inserted tuples prevent table truncation by vacuum when the progressive clustering finishes). Say a table containing the range 1-9 was being clustered with a 50% fillfactor and was about half way through the process: ------------- 1 2 3 47986 ------------- and someone inserts `0' it should ideally land up in the right place anyway (as a bit of reading suggests that insert tries to respect cluster order): ------------- 01 2 3 47986 ------------- If you're re-clustering a table with a fillfactor set you may not need to extend it at all, because you can use already allocated free space to store tuples temporarily while moving them around. > You can't put any temporary copies in pages <= X because you might > need that space when it comes time to make the clustering happen. So > the table is going to bloat to (at least) 2X pages. Yes, if you perform the whole process in a single huge chunk. What I was hoping was that it was possible to do it in a series of smaller operations instead, avoiding the need for such a huge amount of bloat at the end of the table. Say you're clustering in 10 steps. You need X*0.1 worth of scratch space created by extending the table if there's not already appropriate free space late in the table. Assuming you can reclaim all space you've used for temporary copies of tuples after each pass your ideal final table size is X*1.1+(space used by new inserts), of which X*0.1 is free space at the end of the table. That free space probably has scattered rows from concurrent inserts, but if you're using a non-100% fillfactor it might not. It seems like it should work, *if* space used for temporary storage can be efficiently reclaimed for reuse (or new inserts) after each pass. Even if it can work, though, I don't know if it'd actually be useful in the face of the effect on indexes and because of how slow it might land up being. -- Craig Ringer
Craig Ringer <craig@postnewspapers.com.au> writes: > Begin a transaction and free the first chunk (2 tuples in this case, but > obviously many more in a real case): > ----------- > ..473612058 > ----------- > Use that freed space to store the first ordered tuples: > ----------- > 014736.2.58 > ----------- > Commit, and when the last transaction to which the "." tuples above are > still visible completes mark them as free with VACUUM or similar. > ----------- > 014736 2 58 > ----------- Oh, the problem is that you're misexplaining this. You can't do it like that: you can't overwrite the moved-up "." tuples until after they aren't visible to any other transaction. So you need two transactions to do the above. I'm not sure if you need two "wait for all others" or just one --- it's not clear to me what's the urgency of clearing out the moved-down tuples after they're moved down. (You *would* want to vacuum every so often, but this is to reclaim index space, not so much heap space because you'll reuse that anyway.) Anyway I think the main practical problem would be with deadlocks against other transactions trying to update/delete tuples at the same times you need to move them. Dealing with uncommitted insertions would be tricky too --- I think you'd need to wait out the inserting transaction, which would add more possibilities of deadlock. regards, tom lane
Tom Lane wrote: > Anyway I think the main practical problem would be with deadlocks > against other transactions trying to update/delete tuples at the same > times you need to move them. Dealing with uncommitted insertions would > be tricky too --- I think you'd need to wait out the inserting > transaction, which would add more possibilities of deadlock. I really appreciate your taking the time to think about and explain this. It's very helpful, as I'm trying to understand some of the basics of PostgreSQL's underlying operation. I'd completely missed thinking about uncomitted inserts - I never normally need to think about them so they just didn't cross my mind. I guess it'd either have to do the equivalent of a SELECT FOR UPDATE NOWAIT on all tuples in the pages to be freed before doing anything else, or would have to take out an EXCLUSIVE table lock while freeing a chunk of pages. I can also vaguely see how problems would arise with concurrent multi-tuple updates grabbing locks in a different order to the progressive cluster and deadlocking, and again hadn't even thought about that. I guess it might be OK if the progressive cluster attempted to get row exclusive locks on all tuples in the contiguous range of pages to be freed, and if it failed to get even one it released them all and retried that whole step. It sounds like it could be slow and inefficient, though, possibly so much so as to defeat the point of the clustering operation in the first place. Thanks again for taking the time to go over that - it's extremely helpful and much appreciated. -- Craig Ringer