Thread: Tricky bugs in concurrent index build
I see fairly nasty problems in the concurrent-index patch when it's trying to build a unique index. I think it's solvable but want some more eyeballs on my reasoning. Look at the code in IndexBuildHeapScan where we are deciding whether or not to include a tuple in the index ("indexIt") and also whether or not to include it in the uniqueness check ("tupleIsAlive"). In a normal non-concurrent build, we have to include recently-dead tuples in the index because transactions started before ours might try to use the index after we finish it. (Remember system catalogs generally operate on SnapshotNow, so a query could use a newly created index even though it will be run with a serializable snapshot much older than the index.) So we have to put tuples into the index if any active transaction might wish to see those tuples. OTOH, we should exclude dead tuples from the uniqueness check: the uniqueness constraint ought to be across currently-valid tuples only. In particular, for tuples previously created or deleted by our own transaction, we certainly must include created ones and not include deleted ones in the uniqueness check. In the past, the only way we could see HEAPTUPLE_INSERT_IN_PROGRESS or HEAPTUPLE_DELETE_IN_PROGRESS was for tuples created/deleted by our own transaction, and so the actions taken by IndexBuildHeapScan are to include in the index in both cases, but exclude DELETE_IN_PROGRESS tuples from the uniqueness check. This does not work for a concurrent build, though, because if the in-progress delete is from another transaction, it could roll back after we look. In that case we have an entry that is in the index and has escaped the uniqueness check. If it conflicts with another tuple also entered into the index in the first pass, we'll never notice that. I think we can solve this by having IndexBuildHeapScan not index DELETE_IN_PROGRESS tuples if it's doing a concurrent build. The problem of old transactions trying to use the index does not exist, because we'll wait 'em out before marking the index valid, so we need not worry about preserving validity for old snapshots. And if the deletion does in fact roll back, we'll insert the tuple during the second pass, and catch any uniqueness violation at that point. But wait, there's more: in the patch as it stands, the second pass over the table ignores DELETE_IN_PROGRESS tuples, which is wrong. It's entirely possible for a tuple that is RECENTLY_DEAD or DELETE_IN_PROGRESS to have no entry in the index, if it was inserted during the first pass, and then the deletion occurred after the first pass (and either has or hasn't committed yet). If we ignore DELETE_IN_PROGRESS and then the deleter rolls back, the tuple never gets into the index at all. Furthermore, the patch also tries to insert RECENTLY_DEAD tuples, which is good for MVCC coverage, but wrong for uniqueness checking --- keep in mind that in the second pass, we are just doing normal index insertions, and so anything we insert into the index will be uniqueness-checked as though still alive. We could get a uniqueness failure that should not occur, eg from trying to insert the old version of a recently-updated row. What I think we can do about this is to include DELETE_IN_PROGRESS tuples in the set of candidate tuples to insert in the second pass. During the merge step that verifies whether the tuple is already in the index, if we find that it's not, then we must wait for the deleter to commit or roll back. If the deleter commits then we ignore the tuple. If the deleter rolls back then we have to insert the tuple in the index. (I think we have to actually take a FOR UPDATE or possibly FOR SHARE lock on the tuple while we do this, else we have race conditions against someone else starting a new deletion attempt on the tuple.) In the commit case we've essentially waited long enough to transform DELETE_IN_PROGRESS into RECENTLY_DEAD, and for both of those statuses we are not going to insert into the index for fear of causing a false uniqueness violation. What that means is that after we finish the second pass, we need *another* wait-for-everyone-to-die before we can mark the index valid. Otherwise we have the same MVCC problem that someone might try to use the index for a query with a snapshot old enough that it should be able to see the not-inserted tuple. Have I missed anything? This is tricky stuff. In any case it's clear that the patch still needs major work. Greg, do you have cycles to spare now? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > I think we can solve this by having IndexBuildHeapScan not index > DELETE_IN_PROGRESS tuples if it's doing a concurrent build. Sure > It's entirely possible for a tuple that is RECENTLY_DEAD or > DELETE_IN_PROGRESS to have no entry in the index, if it was inserted > during the first pass, and then the deletion occurred after the first > pass (and either has or hasn't committed yet). Egads. That's nasty indeed. > Furthermore, the patch also tries to insert RECENTLY_DEAD tuples, which is > good for MVCC coverage, but wrong for uniqueness checking --- keep in mind > that in the second pass, we are just doing normal index insertions, and so > anything we insert into the index will be uniqueness-checked as though still > alive. We could get a uniqueness failure that should not occur, eg from > trying to insert the old version of a recently-updated row. Hm, I hadn't absorbed the purpose of isAlive and the distinction between live for uniqueness checks and live for index build purposes. Is it not possible to brute force this adding an AM method to insert without the uniqueness check? That would mean the index build would fail even if the transaction eventually aborts though. (or even if it has already aborted?) [ extended description of complex footwork involving more waiting while holding locks ] > Have I missed anything? This is tricky stuff. Wow, that seems pretty unsatisfactory, all the waiting and locking sounds awful. If you have a lot of update transactions starting continuously you could keep bumping into this situation and repeatedly have to wait for new transactions to end. It also seems like a lot of code :( > In any case it's clear that the patch still needs major work. > Greg, do you have cycles to spare now? I do. But I'll have to spend some time just rereading the code and your comments to convince myself that all this waiting and locking is the best solution. -- greg
Greg Stark <gsstark@mit.edu> writes: > Is it not possible to brute force this adding an AM method to insert without > the uniqueness check? Hm. Actually there already is a feature of aminsert to allow suppressing the unique check, but I'm not sure whether using it for RECENTLY_DEAD tuples helps. Seems like we have to wait to see whether DELETE_IN_PROGRESS deleters commit in any case. >> Have I missed anything? This is tricky stuff. > Wow, that seems pretty unsatisfactory, all the waiting and locking sounds > awful. Yeah, I'm very unhappy. The whole idea may be going down in flames :-( It's fairly clear that we could support concurrent builds of nonunique indexes, but is that enough of a use-case to justify it? regards, tom lane
> >> Wow, that seems pretty unsatisfactory, all the waiting and locking sounds >> awful. > > Yeah, I'm very unhappy. The whole idea may be going down in flames :-( > It's fairly clear that we could support concurrent builds of nonunique > indexes, but is that enough of a use-case to justify it? I believe there would be. Most PostgreSQL users I run into, develop in production, which means being able to add an index they forgot when doing query analysis. Most of the time (I would say >95%) this is not a unique index. Sincerely, Joshua D. Drake > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
"Joshua D. Drake" <jd@commandprompt.com> writes: >> It's fairly clear that we could support concurrent builds of nonunique >> indexes, but is that enough of a use-case to justify it? > I believe there would be. Most PostgreSQL users I run into, develop in > production, which means being able to add an index they forgot when > doing query analysis. True, unique constraints are usually something you should get right to start with. But it'll be annoying if we can do everything BUT that :-( regards, tom lane
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >>> It's fairly clear that we could support concurrent builds of nonunique >>> indexes, but is that enough of a use-case to justify it? > >> I believe there would be. Most PostgreSQL users I run into, develop in >> production, which means being able to add an index they forgot when >> doing query analysis. > > True, unique constraints are usually something you should get right to > start with. But it'll be annoying if we can do everything BUT that :-( Agreed, but better then nothing :). Sincerely, Joshua D. Drake > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > Is it not possible to brute force this adding an AM method to insert without > > the uniqueness check? > > Hm. Actually there already is a feature of aminsert to allow > suppressing the unique check, but I'm not sure whether using it for > RECENTLY_DEAD tuples helps. Seems like we have to wait to see whether > DELETE_IN_PROGRESS deleters commit in any case. Hm, actually don't we need both of these to make it work? We need to see whether the deleter commits to determine whether to enforce the uniqueness constraint on the missing tuple. . If the deleter aborts we need to insert the tuple normally including enforcing the constraint. . If the deleter commits then we still need to insert the tuple but without enforcing the constraint in case some old transactionqueries the index What would happen if we just insert DELETE_IN_PROGRESS tuples normally? Would the only risk be that the index build would fail with a spurious unique constraint violation? I suppose it would be pretty common though given how updates work. Incidentally does this point out a problem with the planner depending on unique constraints? For old (serializable) transactions the unique index exists and the constraint is enforced but they can still find tuples that were deleted before the index was built and might violate the unique constraint. Even if we had the plan invalidation mechanism that's frequently mentioned you would still have to check constraints against your snapshot and not just snapshotnow for planning purposes. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Greg Stark <gsstark@mit.edu> writes: > What would happen if we just insert DELETE_IN_PROGRESS tuples normally? Would > the only risk be that the index build would fail with a spurious unique > constraint violation? I suppose it would be pretty common though given how > updates work. Yeah, that's the problem: if we can't support UPDATEs that don't change the to-be-unique column, it ain't much of a feature. > Incidentally does this point out a problem with the planner depending on > unique constraints? Good point. It doesn't depend on them yet, but we've been hoping to make it do so once we have plan invalidation capability. We shall have to think very carefully about timing semantics of all that. regards, tom lane
Ühel kenal päeval, T, 2006-08-22 kell 16:48, kirjutas Tom Lane: > "Joshua D. Drake" <jd@commandprompt.com> writes: > >> It's fairly clear that we could support concurrent builds of nonunique > >> indexes, but is that enough of a use-case to justify it? > > > I believe there would be. Most PostgreSQL users I run into, develop in > > production, which means being able to add an index they forgot when > > doing query analysis. > > True, unique constraints are usually something you should get right to > start with. But it'll be annoying if we can do everything BUT that :-( Maybe we could find a way to build a non-unique index first and then convert it to a unique one later, in yet another pass ? -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
Ühel kenal päeval, K, 2006-08-23 kell 11:05, kirjutas Hannu Krosing: > Ühel kenal päeval, T, 2006-08-22 kell 16:48, kirjutas Tom Lane: > > "Joshua D. Drake" <jd@commandprompt.com> writes: > > >> It's fairly clear that we could support concurrent builds of nonunique > > >> indexes, but is that enough of a use-case to justify it? > > > > > I believe there would be. Most PostgreSQL users I run into, develop in > > > production, which means being able to add an index they forgot when > > > doing query analysis. > > > > True, unique constraints are usually something you should get right to > > start with. But it'll be annoying if we can do everything BUT that :-( > > Maybe we could find a way to build a non-unique index first and then > convert it to a unique one later, in yet another pass ? Or even add ALTER INDEX myindex ADD/DROP UNIQUE; command -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
> > Is it not possible to brute force this adding an AM method to insert > > without the uniqueness check? > > Hm. Actually there already is a feature of aminsert to allow > suppressing the unique check, but I'm not sure whether using > it for RECENTLY_DEAD tuples helps. Seems like we have to > wait to see whether DELETE_IN_PROGRESS deleters commit in any case. Um, but if we wait for the DELETE_IN_PROGRESS tuple, after the wait we can add it eighter with or without the unique check (depending on commit/abort). Then at least we don't need to wait in a 3rd pass for readers ? Andreas
Tom Lane <tgl@sss.pgh.pa.us> writes: > What I think we can do about this is to include DELETE_IN_PROGRESS > tuples in the set of candidate tuples to insert in the second pass. > During the merge step that verifies whether the tuple is already > in the index, if we find that it's not, then we must wait for the > deleter to commit or roll back. If the deleter commits then we > ignore the tuple. If the deleter rolls back then we have to insert > the tuple in the index. (I think we have to actually take a FOR > UPDATE or possibly FOR SHARE lock on the tuple while we do this, > else we have race conditions against someone else starting a new > deletion attempt on the tuple.) Hm, my first thought was to just try to get a lock on the record which would inherently wait until the deleter commits or aborts. But then wouldn't we have deadlock risks? If we come across these records in a different order from someone else (possibly even the deleter) who also wants to lock them? Or would it be safe to lock and release them one by one so we only every hold one lock at a time? I'm also pondering whether it might be worth saving up all the DELETE_IN_PROGRESS tuples in a second tuplesort and processing them all in a third phase. That seems like it would reduce the amount of waiting that might be involved. The fear I have though is that this third phase could become quite large. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Tom Lane <tgl@sss.pgh.pa.us> writes: > In the past, the only way we could see HEAPTUPLE_INSERT_IN_PROGRESS > or HEAPTUPLE_DELETE_IN_PROGRESS was for tuples created/deleted by our > own transaction, and so the actions taken by IndexBuildHeapScan are > to include in the index in both cases, but exclude DELETE_IN_PROGRESS > tuples from the uniqueness check. > > This does not work for a concurrent build, though, because if the > in-progress delete is from another transaction, it could roll back after > we look. In that case we have an entry that is in the index and has > escaped the uniqueness check. If it conflicts with another tuple also > entered into the index in the first pass, we'll never notice that. > I think we can solve this by having IndexBuildHeapScan not index > DELETE_IN_PROGRESS tuples if it's doing a concurrent build. The problem > of old transactions trying to use the index does not exist, because > we'll wait 'em out before marking the index valid, so we need not > worry about preserving validity for old snapshots. And if the deletion > does in fact roll back, we'll insert the tuple during the second pass, > and catch any uniqueness violation at that point. Actually that's a bit of a pain. This function is called from the AM functions so it doesn't have any access to whether it's doing a concurrent build or not. I would have to stuff a flag in the indexInfo or change the AM api. The API is pretty bizarre around this. The core calls the AM to build the index, the AM calls back this function in core to do the scan, then this function calls back into the AM to handle the inserts. It seems like it would be simpler to leave the core in charge the whole time. It would call an AM method to initialize state, then call an AM method for each tuple that should be indexed, and lastly call a finalize method. Also, I think there may be another problem here with INSERT_IN_PROGRESS. I'm currently testing unique index builds while pgbench is running and I'm consistently getting unique index violations from phase 1. I think what's happening is that insert that haven't committed yet (and hence ought to be invisible to us) are hitting unique constraint violations against older versions that are still alive to us. -- greg
[Sorry for the duplicate -- I accidentally sent the previous before I was finished editing it] Tom Lane <tgl@sss.pgh.pa.us> writes: > I think we can solve this by having IndexBuildHeapScan not index > DELETE_IN_PROGRESS tuples if it's doing a concurrent build. The problem > of old transactions trying to use the index does not exist, because > we'll wait 'em out before marking the index valid, so we need not > worry about preserving validity for old snapshots. And if the deletion > does in fact roll back, we'll insert the tuple during the second pass, > and catch any uniqueness violation at that point. Actually that's a bit of a pain. This function is called from the AM functions so it doesn't have any access to whether it's doing a concurrent build or not. I would have to stuff a flag in the indexInfo or change the AM api. The API is pretty bizarre around this. The core calls the AM to build the index, the AM calls back this function in core to do the scan, then this function calls back into the AM to handle the inserts. It seems like it would be simpler to leave the core in charge the whole time. It would call an AM method to initialize state, then call an AM method for each tuple that should be indexed, and lastly call a finalize method. Also, I think there may be another problem here with INSERT_IN_PROGRESS. I'm currently testing unique index builds while pgbench is running and I'm consistently getting unique index violations from phase 1. I think what's happening is that insert that haven't committed yet (and hence ought to be invisible to us) are hitting unique constraint violations against older versions that are still alive to us. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Hannu Krosing <hannu@skype.net> writes: > Ãhel kenal päeval, K, 2006-08-23 kell 11:05, kirjutas Hannu Krosing: > > > > Maybe we could find a way to build a non-unique index first and then > > convert it to a unique one later, in yet another pass ? > > Or even add ALTER INDEX myindex ADD/DROP UNIQUE; command That would be great. But note that it suffers from precisely the same problem. If you come across a couple of records with the same key and one of them is DELETE_IN_PROGRESS then you'll have to wait until you can acquire a sharelock on it before you can determine if there's a constraint violation. Hmmm. Or is that true. The problem may be somewhat easier since at least you can be sure every tuple in the heap is in the index. So if you see a DELETE_IN_PROGRESS either it *was* a constraint violation prior to the delete and failing is reasonable or it's an update in which case maybe it's possible to detect that they're part of the same chain? (Actually there is another corner case. a transaction that inserts a value, then deletes it in the same transaction then inserts that same value again. Now you have a INSERT_IN_PROGRESS and a DELETE_IN_PROGRESS that conflict but should be allowed since they come from the same transaction. Hopefully the ALTER INDEX command would be able to determine they come from the same transaction.) In the case of concurrent index builds that's not really safe since you don't have the other tuples you're conflicting with together at the same time and even if you did you may or may not have a complete set of them. Tom's right. This stuff is tricky. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Greg Stark <gsstark@mit.edu> writes: > But then wouldn't we have deadlock risks? If we come across these records in a > different order from someone else (possibly even the deleter) who also wants > to lock them? Or would it be safe to lock and release them one by one so we > only every hold one lock at a time? AFAICS we could release the lock as soon as we've inserted the index entry. (Whether there is any infrastructure to do that is another question...) > I'm also pondering whether it might be worth saving up all the > DELETE_IN_PROGRESS tuples in a second tuplesort and processing them all in a > third phase. That seems like it would reduce the amount of waiting that might > be involved. The fear I have though is that this third phase could become > quite large. Actually --- a tuple that is live when we do the "second pass" scan could well be DELETE_IN_PROGRESS (or even RECENTLY_DEAD) by the time we do the merge and discover that it hasn't got an index entry. So offhand I'm thinking that we *must* take a tuple lock on *every* tuple we insert in stage two. Ugh. regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > It seems like it would be simpler to leave the core in charge the whole time. > It would call an AM method to initialize state, then call an AM method for > each tuple that should be indexed, and lastly call a finalize method. [ shrug... ] I'm uninterested in refactoring the AM API right now. We've got enough stuff to deal with before beta, not to mention an uncommitted bitmap AM patch that it would certainly break. > Also, I think there may be another problem here with INSERT_IN_PROGRESS. I'm > currently testing unique index builds while pgbench is running and I'm > consistently getting unique index violations from phase 1. I think what's > happening is that insert that haven't committed yet (and hence ought to be > invisible to us) are hitting unique constraint violations against older > versions that are still alive to us. Hmm ... it's certainly highly likely that we could pick up multiple versions of a row during pass 1, but the uniqueness checker should notice that some versions are dead? Oooh, no there's a problem: the tuple we are inserting could become dead in the interval between when we pick it up and when we put it into the index. So we could try to put multiple versions of the same row into the uniqueness check. Right at the moment, unique index builds with this mechanism are looking unfixably broken :-(. Anyone see any chance at all of making them work? Maybe we should just cut our losses and go with the nonunique case only. That one is pretty easy: just stuff everything in the index ... regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > Hmmm. Or is that true. The problem may be somewhat easier since at least you > can be sure every tuple in the heap is in the index. So if you see a > DELETE_IN_PROGRESS either it *was* a constraint violation prior to the delete > and failing is reasonable or it's an update in which case maybe it's possible > to detect that they're part of the same chain? Unless we are willing to lock every single tuple while we insert it, this seems unfixable to me. Without a lock, the tuple could become DELETE_IN_PROGRESS immediately after we look at it. Actually it's worse than that. We could examine a tuple, see that it's good, include it in the uniqueness check. Then someone updates the tuple and puts the new version near the end of the table. By the time we reach that version, it could be committed good. There is absolutely no way that we could notice an issue without applying extremely expensive tests to *every* apparently-good tuple. [ thinks for a bit... ] At least, it seems hopeless if we use SnapshotNow. Does it help if we use a real snapshot? I'm thinking pass 1 inserts exactly those tuples that are good according to a snap taken at its beginning, and then pass 2 considers only tuples that are good according to a snap taken at *its* beginning. But having consumed no caffeine yet this morning, I'm not sure I can spot any flaws that might exist in this idea. regards, tom lane
Ühel kenal päeval, K, 2006-08-23 kell 09:01, kirjutas Tom Lane: > Greg Stark <gsstark@mit.edu> writes: > > Hmmm. Or is that true. The problem may be somewhat easier since at least you > > can be sure every tuple in the heap is in the index. So if you see a > > DELETE_IN_PROGRESS either it *was* a constraint violation prior to the delete > > and failing is reasonable or it's an update in which case maybe it's possible > > to detect that they're part of the same chain? > > Unless we are willing to lock every single tuple while we insert it, > this seems unfixable to me. Without a lock, the tuple could become > DELETE_IN_PROGRESS immediately after we look at it. > > Actually it's worse than that. We could examine a tuple, see that > it's good, include it in the uniqueness check. Then someone updates > the tuple and puts the new version near the end of the table. By > the time we reach that version, it could be committed good. Perhaps we should scan the index in index order, and set the unique flag per index page. That would a) help us avoid going to heap unless there are multiple entries for some value and b) enable us, once all index pages containing pointers to some possibly duplicate value are checked, to release locks on those tuples. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
stark <stark@enterprisedb.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> [ thinks for a bit... ] At least, it seems hopeless if we use >> SnapshotNow. Does it help if we use a real snapshot? I'm thinking >> pass 1 inserts exactly those tuples that are good according to a >> snap taken at its beginning, and then pass 2 considers only tuples >> that are good according to a snap taken at *its* beginning. But >> having consumed no caffeine yet this morning, I'm not sure I can >> spot any flaws that might exist in this idea. > What about tuples that are inserted and committed in the window between the > two phases. Ie, they're RECENTLY_DEAD but not in phase2's snapshot. We'd put them in the index but skip uniqueness check. > Or do you mean we use SatisfiesVacuum to determine what to insert but > SatisfiesSnapshot to determine whether to check uniqueness? Right. The problems seem to all stem from the risk of trying to unique-check more than one version of a tuple, and using a snap would stop that. We need to think through all the cases though and be sure they all work. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: >> Or do you mean we use SatisfiesVacuum to determine what to insert but >> SatisfiesSnapshot to determine whether to check uniqueness? > > Right. The problems seem to all stem from the risk of trying to > unique-check more than one version of a tuple, and using a snap would > stop that. We need to think through all the cases though and be sure > they all work. What happens if someone inserts a record that we miss, but it gets deleted by the same phase 2 starts. So it's not visible to phase 2 but conflicts with some other record we find. I suppose that's ok since the delete would have to have comitted for that to happen. It just means that having a unique constraint doesn't guarantee uniqueness if your transaction started before the index was finished being built. Or what if there's an insert that occurs before phase 2 starts and hasn't committed yet. There's a conflicting record in the heap that's missing in the index. I guess the build would have to block when it finds the missing record until the new insert either commits or aborts just like inserts do when a user inserts a potential conflict. Would I have to handle that myself or does index_insert handle that automatically? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
stark <stark@enterprisedb.com> writes: > What happens if someone inserts a record that we miss, but it gets deleted by > the same phase 2 starts. So it's not visible to phase 2 but conflicts with > some other record we find. I suppose that's ok since the delete would have to > have comitted for that to happen. I think that's OK, but the whole idea of using an MVCC snap in phase 2 doesn't work on closer inspection. The problem is still the same one that you need to take (at least) share lock on each tuple you insert into the index. Telling aminsert to check uniqueness implicitly assumes the new tuple is live, and without any lock on the tuple you can't promise that. So there's a significant risk of falsely declaring a uniqueness failure due to someone else's perfectly innocent UPDATE. Using an MVCC snap would actually make this worse, by widening the window for that UPDATE to happen. Even though we're hoping not to have to process too many tuples here, having to row-lock each one sounds pretty awful. Remember that that involves writing a WAL record these days. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > I think that's OK, but the whole idea of using an MVCC snap in phase 2 > doesn't work on closer inspection. The problem is still the same one > that you need to take (at least) share lock on each tuple you insert > into the index. Telling aminsert to check uniqueness implicitly assumes > the new tuple is live, and without any lock on the tuple you can't > promise that. No wait. It's still "live" according to my snapshot. How could it be possible for a single snapshot to see two different versions of the same tuple as live? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> I think that's OK, but the whole idea of using an MVCC snap in phase 2 >> doesn't work on closer inspection. The problem is still the same one >> that you need to take (at least) share lock on each tuple you insert >> into the index. Telling aminsert to check uniqueness implicitly assumes >> the new tuple is live, and without any lock on the tuple you can't >> promise that. > No wait. It's still "live" according to my snapshot. How could it be possible > for a single snapshot to see two different versions of the same tuple as live? The problem case is that we take a tuple and try to insert it into the index. Meanwhile someone else updates the tuple, and they're faster than us so they get the new version into the index first. Now our aminsert sees a conflicting index entry, and as soon as it commits good aminsert will raise a uniqueness error. There's no backoff for "oh, the tuple I'm inserting stopped being live while I was inserting it". regards, tom lane
I wrote: > The problem case is that we take a tuple and try to insert it into the index. > Meanwhile someone else updates the tuple, and they're faster than us so > they get the new version into the index first. Now our aminsert sees a > conflicting index entry, and as soon as it commits good aminsert will > raise a uniqueness error. There's no backoff for "oh, the tuple I'm > inserting stopped being live while I was inserting it". It's possible that the problem could be solved by introducing such a backoff, ie, make aminsert recheck liveness of the tuple-to-be-inserted before declaring error. Since we're about to fail anyway, performance of this code path probably isn't a huge issue. But I haven't thought through whether it can be made to work with that addition. Unless someone's got a brilliant idea, my recommendation at this point is that we restrict the patch to building only non-unique indexes. Per discussion upthread, that's still a useful feature. We can revisit the problem of doing uniqueness checks correctly in some future release, but time to work on it for 8.2 is running out fast. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > I wrote: >> The problem case is that we take a tuple and try to insert it into the index. >> Meanwhile someone else updates the tuple, and they're faster than us so >> they get the new version into the index first. Now our aminsert sees a >> conflicting index entry, and as soon as it commits good aminsert will >> raise a uniqueness error. There's no backoff for "oh, the tuple I'm >> inserting stopped being live while I was inserting it". > > It's possible that the problem could be solved by introducing such a > backoff, ie, make aminsert recheck liveness of the tuple-to-be-inserted > before declaring error. Since we're about to fail anyway, performance > of this code path probably isn't a huge issue. But I haven't thought > through whether it can be made to work with that addition. Yesterday I considered if I could just catch the error in validate_index and retest HeapSatisfiesVacuum after the insert but found that didn't work any better. I don't remember the problem though and it's possible it would work if it were inside aminsert. > Unless someone's got a brilliant idea, my recommendation at this point > is that we restrict the patch to building only non-unique indexes. > Per discussion upthread, that's still a useful feature. We can revisit > the problem of doing uniqueness checks correctly in some future release, > but time to work on it for 8.2 is running out fast. I agree. There's other functionality in this area that would be nice too such as REINDEX CONCURRENTLY and deleting the invalid index in case of error. Once one chunk gets into CVS it makes it easier to extend it without making for a bigger divergence to merge in one day. I was also considering going ahead and implementing Hannu's ALTER INDEX SET UNIQUE too. We would have the option of making CREATE UNIQUE INDEX CONCURRENTLY automatically invoke that code afterwards. It would require a second waiting phase though and a full index scan so it would be a much slower option than handling it in the index build. On the plus side it would never have to lock anything -- locking things inside a command explicitly billed as concurrent strikes me as undesirable. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > I was also considering going ahead and implementing Hannu's ALTER INDEX SET > UNIQUE too. Don't waste your time, when we don't have an algorithm that would make it work. It's too late for 8.2 anyway... regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> I was also considering going ahead and implementing Hannu's ALTER INDEX SET >> UNIQUE too. > > Don't waste your time, when we don't have an algorithm that would make > it work. It's too late for 8.2 anyway... Oh, I think ALTER INDEX SET UNIQUE is easy, at least algorithm-wise. We set the index to be unique, then wait until everyone can see it. Then we scan to make sure there's only one live tuple for each key value. As long as it's unique in our snapshot and we can be sure that any concurrent changes will preserve that property then we can be sure it's good. Hm, I suppose we have to keep the index marked invalid during this operation so it doesn't appear as if there's a promise that the data is unique. That's fine for freshly built concurrent indexes but not so nice for an existing non-unique index. We might need a new column induniqueisvalid that indicates that a unique constraint can't be trusted yet. I suppose it's 8.3 material. And maybe not even the most urgent item either. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> Unless someone's got a brilliant idea, my recommendation at this point >> is that we restrict the patch to building only non-unique indexes. > I assume you'll add the check? Yeah, I'll take care of it. At the moment it may be moot, because I've realized that validate_index doesn't work anyway. It is scanning the index and then assuming that any tuple inserted into the index subsequent to that scan will still be INSERT_IN_PROGRESS when the heapscan reaches it. That's completely bogus of course --- the tuple could be committed live or even already deleted by the time the heapscan reaches it. This would result in making duplicate index entries, which is unacceptable even in a nonunique index (it'd cause indexscans to return the same row twice). I'm trying to work out whether we can fix this by taking an MVCC snapshot before we scan the index, and then inserting all tuples we find in the heap that are live according to the snap but are not present in the indexscan data. There are still race conditions in this but I think sufficient delay would fix them. Considerations: * If a tuple is good according to the snap, it must have been inserted by an xact that committed before the snap, therefore there is no still-in-progress index insertion happening for it. So if it's not in the sorted indexscan data we know we must insert it. If it is in the indexscan data, of course we don't have to. * If a tuple is not good according to the snap, there are three possibilities: ** It was never committed good at all. We need not index it. ** It was inserted by a transaction that committed post-snap. We assume that that transaction has or will index it. ** It was deleted by a transaction that committed pre-snap. We assume we need not index it because no remaining transaction will care about it. The trick is that we must wait long enough to ensure that those two assumptions are good. We already know about waiting long enough to ensure that all live transactions are aware of the index, which takes care of the first assumption as long as we take the snap after that wait. However, the second assumption means that we must be sure there are no other backends with serializable snapshots older than the snap we are using for this. I think this means that we wait for all xacts to be aware of our index, take the reference snap, then wait for all xacts except ours to die --- without exception --- and then we can get on with the second phase of the work. It might be OK to do the indexscan and sort before we do the second wait, though, so we could get at least something done. Comments? Have I missed any case for a tuple to fail the snap? Does this analysis change our conclusions about whether we can support unique index builds? regards, tom lane
I wrote: > I'm trying to work out whether we can fix this by taking an MVCC > snapshot before we scan the index, and then inserting all tuples we find > in the heap that are live according to the snap but are not present in > the indexscan data. There are still race conditions in this but I think > sufficient delay would fix them. Considerations: > * If a tuple is good according to the snap, it must have been inserted > by an xact that committed before the snap, therefore there is no > still-in-progress index insertion happening for it. So if it's not in > the sorted indexscan data we know we must insert it. If it is in the > indexscan data, of course we don't have to. > * If a tuple is not good according to the snap, there are three > possibilities: > ** It was never committed good at all. We need not index it. > ** It was inserted by a transaction that committed post-snap. We assume > that that transaction has or will index it. > ** It was deleted by a transaction that committed pre-snap. We assume > we need not index it because no remaining transaction will care about it. > The trick is that we must wait long enough to ensure that those two > assumptions are good. We already know about waiting long enough to > ensure that all live transactions are aware of the index, which takes > care of the first assumption as long as we take the snap after that > wait. However, the second assumption means that we must be sure there > are no other backends with serializable snapshots older than the snap > we are using for this. I think this means that we wait for all xacts > to be aware of our index, take the reference snap, then wait for all > xacts except ours to die --- without exception --- and then we can > get on with the second phase of the work. After a couple hours' thought I have not found any flaws in the above analysis; furthermore I believe that the second wait need not happen until just before we are ready to mark the index valid and commit. So that means that after the indexscan and second heapscan, we wait for any xacts alive before that to end. That's not too bad. > Does this analysis change our conclusions about whether we can > support unique index builds? I also believe that we can support unique index builds after all. There are two further changes needed besides the above: * The first pass has to insert tuples that are good according to a snapshot, rather than using HeapTupleSatisfiesVacuum. This prevents the problem of trying to unique-check multiple versions of the same row that all chanced to be committed live at the instants we passed over them. If ambuild gets a failure it means the dataset was not unique at the instant of the snap, so the user can hardly complain about getting the error. This will somewhat reduce the coverage of the initial index build, but that seems like a small price. (Perhaps the docs should recommend using a smaller-than-normal fillfactor during concurrent index creation?) * We have to add the already-discussed logic to bt_check_unique to recheck liveness of the to-be-inserted tuple immediately before it raises a unique-index-violation error. This is a waste of cycles in the normal case, but since it's in an error path that doesn't seem like a big problem. (Is it worth extending the aminsert API so that aminsert knows whether it's being called by CREATE INDEX CONCURRENTLY or not? I'm inclined not to bother, at least not right now with the bitmap AM patch still unmerged.) * When about to insert a tuple in pass 2, we can make the code check liveness of the tuple and suppress the unique check if it's already committed dead. This means the bt_check_unique addition only matters if the tuple goes dead in the interval between starting the index insert and finishing it. However that's an important case to handle the concurrent-UPDATE scenario: if we find a conflicting entry in the index, we'll wait for its inserter to commit, and that is the same transaction that will have committed our new entry dead. So the recheck is necessary and sufficient to cover this case. Barring objections, I'm off to program this. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > At the moment it may be moot, because I've realized that validate_index > doesn't work anyway. It is scanning the index and then assuming that > any tuple inserted into the index subsequent to that scan will still be > INSERT_IN_PROGRESS when the heapscan reaches it. EGADS Boy I feel stupid now. In fairness I think what happened is that the original plan was, like your new plan, based on snapshots. And I only switched to using HeapSatisfiesVacuum after several iterations. I guess there were some assumptions in the original thinking that I never revisited. Because of the way the AM API works changing how the initial heap scan works is a bit of a pain. It would require either having some global state or passing the concurrent flag through the AM methods or alternatively having a whole new AM method. I'll have to read (and reread) your description again in the morning -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: >> Hmmm. Or is that true. The problem may be somewhat easier since at least you >> can be sure every tuple in the heap is in the index. So if you see a >> DELETE_IN_PROGRESS either it *was* a constraint violation prior to the delete >> and failing is reasonable or it's an update in which case maybe it's possible >> to detect that they're part of the same chain? > > Unless we are willing to lock every single tuple while we insert it, > this seems unfixable to me. Without a lock, the tuple could become > DELETE_IN_PROGRESS immediately after we look at it. I think there's some confusion here. This above paragraph was taken from some thoughts about Hannu's suggestion of having a separate ALTER INDEX SET UNIQUE command. That command might have an advantage over CREATE INDEX CONCURRENTLY because it knows the index is already complete; it doesn't have to worry about potential conflicts with tuples that it will only find later in the scan. Effectively this is equivalent to making CREATE UNIQUE INDEX CONCURRENTLY three phases. The first two phases would be a regular CREATE INDEX CONCURRENTLY and the third phase would be what ALTER INDEX SET UNIQUE does which is scan the index and verify that it's unique. ALTER INDEX SET UNIQUE would have to perform a similar two-transaction dance though. It would have to set the index unique, wait until everyone has seen the new constraint. Then verify that the property is indeed unique, possibly rolling back the constraint creation if it's not. That would make the whole process of creating a unique index quite long. On the plus side it would be a useful command in itself. Doing an index scan might be pretty slow but if the table is mostly clean of dead and recently dead tuples it won't have to visit the heap much and should still be much quicker than building a new index. And it would itself be a concurrent command. > Actually it's worse than that. We could examine a tuple, see that > it's good, include it in the uniqueness check. Then someone updates > the tuple and puts the new version near the end of the table. By > the time we reach that version, it could be committed good. There > is absolutely no way that we could notice an issue without applying > extremely expensive tests to *every* apparently-good tuple. I think ALTER INDEX SET UNIQUE would not have this problem. It would only have to look at tuples using its own snapshot and see if there's a violation. If there isn't a violation as of its own snapshot then it can be sure later transactions will preserve this property since the index was always complete and it waited after creating the constraint. > [ thinks for a bit... ] At least, it seems hopeless if we use > SnapshotNow. Does it help if we use a real snapshot? I'm thinking > pass 1 inserts exactly those tuples that are good according to a > snap taken at its beginning, and then pass 2 considers only tuples > that are good according to a snap taken at *its* beginning. But > having consumed no caffeine yet this morning, I'm not sure I can > spot any flaws that might exist in this idea. What about tuples that are inserted and committed in the window between the two phases. Ie, they're RECENTLY_DEAD but not in phase2's snapshot. Or do you mean we use SatisfiesVacuum to determine what to insert but SatisfiesSnapshot to determine whether to check uniqueness? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > Because of the way the AM API works changing how the initial heap scan works > is a bit of a pain. It would require either having some global state or > passing the concurrent flag through the AM methods or alternatively having a > whole new AM method. Yeah, I dealt with that by adding a 'concurrent build' flag to IndexInfo. A bit grotty but it beats changing a lot of function signatures. regards, tom lane
Do we want something like this? I just made this error myself so unless I'm special (pauses for jokes) I imagine others would be prone to it as well. I would normally be pretty leery of code like this but it seems unlikely anyone would actually want an index named "concurrently" and the consequences if you get it wrong in a production environment are pretty dire. We might even consider making it an outright error. --- gram.y 25 Aug 2006 10:14:17 +0100 2.558 +++ gram.y 25 Aug 2006 14:04:54 +0100 @@ -56,6 +56,7 @@#include "commands/defrem.h"#include "nodes/makefuncs.h"#include "parser/gramparse.h" +#include "parser/scansup.h"#include "storage/lmgr.h"#include "utils/date.h"#include "utils/datetime.h" @@ -3653,6 +3654,12 @@ opt_definition OptTableSpace where_clause { IndexStmt*n = makeNode(IndexStmt); + + if (!strcmp(downcase_truncate_identifier($4,20,false), "concurrently")) + ereport(WARNING, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("performing non-concurrent index build of index named \"concurrently\""))); + n->unique = $2; n->concurrent = false; n->idxname = $4; -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > Do we want something like this? I just made this error myself so unless I'm > special (pauses for jokes) I imagine others would be prone to it as well. > > I would normally be pretty leery of code like this but it seems unlikely > anyone would actually want an index named "concurrently" and the consequences > if you get it wrong in a production environment are pretty dire. We might even > consider making it an outright error. > > > --- gram.y 25 Aug 2006 10:14:17 +0100 2.558 > +++ gram.y 25 Aug 2006 14:04:54 +0100 > @@ -56,6 +56,7 @@ > #include "commands/defrem.h" > #include "nodes/makefuncs.h" > #include "parser/gramparse.h" > +#include "parser/scansup.h" > #include "storage/lmgr.h" > #include "utils/date.h" > #include "utils/datetime.h" > @@ -3653,6 +3654,12 @@ > opt_definition OptTableSpace where_clause > { > IndexStmt *n = makeNode(IndexStmt); > + > + if (!strcmp(downcase_truncate_identifier($4,20,false), "concurrently")) > + ereport(WARNING, > + (errcode(ERRCODE_SYNTAX_ERROR), > + errmsg("performing non-concurrent index build of index named \"concurrently\""))); > + > n->unique = $2; > n->concurrent = false; > n->idxname = $4; > I see we have: CREATE index_opt_unique INDEX CONCURRENTLY index_name ... which explains how this error occurs. But might it not be better to have this instead? CREATE CONCURRENTLY index_opt_unique INDEX index_name ... Then ISTM no ambiguity could arise (and it's also closer to grammatical English, if that matters). Just a thought cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > I see we have: > CREATE index_opt_unique INDEX CONCURRENTLY index_name ... > which explains how this error occurs. Maybe to you, but I'm still caffeine-deprived and don't exactly see what it was that Greg mistyped. AFAICS he'd have to type CONCURRENTLY twice to get into a scenario where the proposed warning would fire. > But might it not be better to have this instead? > CREATE CONCURRENTLY index_opt_unique INDEX index_name ... When I was fooling with gram.y I was thinking that actually CREATE [UNIQUE] INDEX indexname [CONCURRENTLY] ... would be the most grammatical thing. But I can live with putting it right after CREATE, too. Or there was the proposal to put it first: [CONCURRENTLY] CREATE [UNIQUE] INDEX indexname ... regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> I see we have: >> CREATE index_opt_unique INDEX CONCURRENTLY index_name ... >> which explains how this error occurs. >> > > Maybe to you, but I'm still caffeine-deprived and don't exactly see what > it was that Greg mistyped. AFAICS he'd have to type CONCURRENTLY twice > to get into a scenario where the proposed warning would fire. > > AAUI, he left off the index name so the first rule was matched rather than the second, with "concurrently" being the index name. cheers andrew
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> I see we have: >> CREATE index_opt_unique INDEX CONCURRENTLY index_name ... >> which explains how this error occurs. > > Maybe to you, but I'm still caffeine-deprived and don't exactly see what > it was that Greg mistyped. AFAICS he'd have to type CONCURRENTLY twice > to get into a scenario where the proposed warning would fire. i guess Greg is talking about something like(ie just forgetting to name the index): devel=# create index concurrently on foo ( a); CREATE INDEX devel=# \d foo Table "public.foo"Column | Type | Modifiers --------+---------+-----------a | integer |b | text |c | integer |d | text | Indexes: "concurrently" btree (a) Stefan
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > > I see we have: > > CREATE index_opt_unique INDEX CONCURRENTLY index_name ... > > which explains how this error occurs. > > Maybe to you, but I'm still caffeine-deprived and don't exactly see what > it was that Greg mistyped. AFAICS he'd have to type CONCURRENTLY twice > to get into a scenario where the proposed warning would fire. > > > But might it not be better to have this instead? > > CREATE CONCURRENTLY index_opt_unique INDEX index_name ... > > When I was fooling with gram.y I was thinking that actually > > CREATE [UNIQUE] INDEX indexname [CONCURRENTLY] ... > > would be the most grammatical thing. But I can live with putting The original thinking was to use CONCURRENT, and CREATE CONCURRENT INDEX sounded like a different type of index, not a different way to build the index. I don't think CONCURRENTLY has that problem, so CREATE CONCURRENTLY INDEX sounds good. To read in English, it would be read as CREATE CONCURRENTLY, INDEX ii. > it right after CREATE, too. Or there was the proposal to put it > first: > > [CONCURRENTLY] CREATE [UNIQUE] INDEX indexname ... I think this suggested the command was CONCURRENTLY, which isn't good. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > The original thinking was to use CONCURRENT, and CREATE CONCURRENT INDEX > sounded like a different type of index, not a different way to build the > index. I don't think CONCURRENTLY has that problem, so CREATE > CONCURRENTLY INDEX sounds good. To read in English, it would be read as > CREATE CONCURRENTLY, INDEX ii. OK, we've got two votes for that, so I'll make it so. regards, tom lane
Bruce Momjian <bruce@momjian.us> writes: > The original thinking was to use CONCURRENT, and CREATE CONCURRENT INDEX > sounded like a different type of index, not a different way to build the > index. I don't think CONCURRENTLY has that problem, so CREATE > CONCURRENTLY INDEX sounds good. To read in English, it would be read as > CREATE CONCURRENTLY, INDEX ii. That doesn't sound like English at all to me. Fwiw, I think the best option was what Tom did. The gotcha I tripped on seems pretty minor to me. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > Bruce Momjian <bruce@momjian.us> writes: > > > The original thinking was to use CONCURRENT, and CREATE CONCURRENT INDEX > > sounded like a different type of index, not a different way to build the > > index. I don't think CONCURRENTLY has that problem, so CREATE > > CONCURRENTLY INDEX sounds good. To read in English, it would be read as > > CREATE CONCURRENTLY, INDEX ii. > > That doesn't sound like English at all to me. > > Fwiw, I think the best option was what Tom did. The gotcha I tripped on seems > pretty minor to me. What bothers me about what we have now is that we have optional keywords before and after INDEX, rather than only between CREATE and INDEX. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > What bothers me about what we have now is that we have optional keywords > before and after INDEX, rather than only between CREATE and INDEX. Yeah, putting them both into that space seems consistent to me, and it will fix the problem of making an omitted index name look like a valid command. I'm not sure I should be opening this can of worms, but do we want to use a different keyword than CONCURRENTLY to make it read better there? regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > What bothers me about what we have now is that we have optional keywords > > before and after INDEX, rather than only between CREATE and INDEX. > > Yeah, putting them both into that space seems consistent to me, and > it will fix the problem of making an omitted index name look like > a valid command. > > I'm not sure I should be opening this can of worms, but do we want to > use a different keyword than CONCURRENTLY to make it read better there? The problem is that what the qualifier is doing is modifying the operation itself, not the properties of the index to be created, like UNIQUE, which modifies the index. So the qualifier should be something that modifies the CREATE, that is, an adverb (AFAIK). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > The problem is that what the qualifier is doing is modifying the > operation itself, not the properties of the index to be created, like > UNIQUE, which modifies the index. Right, which was the same point Bruce made earlier. And really you can't respect that difference while putting them into the same place in the word order. So I'm starting to feel like maybe we should leave well enough alone. regards, tom lane
Gregory Stark wrote: > Bruce Momjian <bruce@momjian.us> writes: > > >> The original thinking was to use CONCURRENT, and CREATE CONCURRENT INDEX >> sounded like a different type of index, not a different way to build the >> index. I don't think CONCURRENTLY has that problem, so CREATE >> CONCURRENTLY INDEX sounds good. To read in English, it would be read as >> CREATE CONCURRENTLY, INDEX ii. >> > > That doesn't sound like English at all to me. > > Fwiw, I think the best option was what Tom did. The gotcha I tripped on seems > pretty minor to me. > > It's a form of construction my father (who was a noted orator) loved to use, maybe a little too much. It is arguably slightly archaic, but nevertheless quite grammatical ;-) I agree that these days it is more idiomatic to defer the adverb until after the object of the verb in most cases. cheers andrew
> > What bothers me about what we have now is that we have optional > > keywords before and after INDEX, rather than only between > CREATE and INDEX. > > Yeah, putting them both into that space seems consistent to > me, and it will fix the problem of making an omitted index > name look like a valid command. > > I'm not sure I should be opening this can of worms, but do we > want to use a different keyword than CONCURRENTLY to make it > read better there? precedent syntax (Oracle, Informix) uses the keyword ONLINE at the end:CREATE INDEX blabla_x0 ON blabla (a,b) ONLINE; I'd stick with that. Andreas
Zeugswetter Andreas DCP SD wrote: > > > > What bothers me about what we have now is that we have optional > > > keywords before and after INDEX, rather than only between > > CREATE and INDEX. > > > > Yeah, putting them both into that space seems consistent to > > me, and it will fix the problem of making an omitted index > > name look like a valid command. > > > > I'm not sure I should be opening this can of worms, but do we > > want to use a different keyword than CONCURRENTLY to make it > > read better there? > > precedent syntax (Oracle, Informix) uses the keyword ONLINE at the end: > CREATE INDEX blabla_x0 ON blabla (a,b) ONLINE; That was what the patch originally used, but it was changed because it made difficult for psql to auto-complete that. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
"Zeugswetter Andreas DCP SD" <ZeugswetterA@spardat.at> writes: > precedent syntax (Oracle, Informix) uses the keyword ONLINE at the end: > CREATE INDEX blabla_x0 ON blabla (a,b) ONLINE; We rejected that one already ... regards, tom lane
> > precedent syntax (Oracle, Informix) uses the keyword ONLINE > at the end: > > CREATE INDEX blabla_x0 ON blabla (a,b) ONLINE; > > That was what the patch originally used, but it was changed > because it made difficult for psql to auto-complete that. That is imho not enough of a reason to divert. Andreas
Alvaro Herrera <alvherre@commandprompt.com> writes: > That was what the patch originally used, but it was changed because it > made difficult for psql to auto-complete that. Psql has to be able to parse it not for auto-completion but because it needs to know that it's not a transactional command. The regular CREATE INDEX can be run from within a transaction but online index builds use two transactions on their own so psql has to know not to insert a BEGIN and savepoint around it. I'll use this opportunity to plug that feature again. I think most people should use autocommit off with on_error_rollack on for most of their daily use. Being able to double check the results of my ad-hoc updates before committing them saved me more headaches than I can count with Oracle. Autocommit off only became practical for interactive use with postgres when savepoints showed up. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
"Zeugswetter Andreas DCP SD" <ZeugswetterA@spardat.at> writes: >> That was what the patch originally used, but it was changed >> because it made difficult for psql to auto-complete that. > That is imho not enough of a reason to divert. My recollection is that the principal argument against ONLINE was that it didn't convey the function of the option to anyone who didn't already know Oracle's usage of the term. Also, psql's problem is not with auto-completion, it's with detecting whether the command is allowed inside a transaction block. That's not a function we can just blow off. regards, tom lane
On Fri, Aug 25, 2006 at 06:57:58PM +0100, Gregory Stark wrote: > I'll use this opportunity to plug that feature again. I think most people > should use autocommit off with on_error_rollack on for most of their daily > use. Being able to double check the results of my ad-hoc updates before > committing them saved me more headaches than I can count with Oracle. > Autocommit off only became practical for interactive use with postgres when > savepoints showed up. +1 -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Fri, Aug 25, 2006 at 11:25:43AM -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > The problem is that what the qualifier is doing is modifying the > > operation itself, not the properties of the index to be created, like > > UNIQUE, which modifies the index. > > Right, which was the same point Bruce made earlier. And really you > can't respect that difference while putting them into the same place in > the word order. So I'm starting to feel like maybe we should leave > well enough alone. Since we might eventually have other 'concurrent commands', perhaps CONCURRENT CREATE INDEX ... would be best. BTW, if we started to consider lazy vacuum a concurrent command we could ditch the use of FULL, which is always confusing if you're talking about database-wide vacuums. I know it'd take many versions to fully make that change, but it seems worth it to me to reduce confusion. There's also an issue of newbies thinking they should use vacuum full regularly because it's somehow better than lazyvac. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Tom Lane <tgl@sss.pgh.pa.us> writes: > Barring objections, I'm off to program this. A few concerns a) The use of ShareUpdateExclusiveLock is supposed to lock out concurrent vacuums. I just tried it and vacuum seemed tobe unaffected. I'm going to retry it with a clean cvs checkout to be sure it isn't something in my local tree that'sbroken. Do we still need to block concurrent vacuums if we're using snapshots? Obviously we have to block them during phase 1because it won't have a chance of removing the tuples from our private collection of index tuples that haven't been pushedlive yet. But if phase 2 is ignoring tuples too new to be visible in its snapshot then it shouldn't care if dead tuplesare deleted even if those slots are later reused. b) You introduced a LockRelationIdForSession() call (I even didn't realize we had this capability when I wrote the patch).Does this introduce the possibility of a deadlock though? If one of the transactions we're waiting to finish hasa shared lock on the relation and is waiting for an exclusive lock on the relation then it seems we'll wait forever forit to finish and never see either of our conditions for continuing. That would be fine except because we're waitingmanually the deadlock detection code doesn't have a chance of firing. To solve that we would have to replace the pg_sleep call with a XactLockTableWait. But I'm not clear how to find a transactionid to wait on. What we would want to find is any transaction id that has an xmin older than our xmin. Even thatisn't ideal since it wouldn't give us a chance to test our other out so if we choose a transaction to wait on that doesn't hold even a share lock on our table we could end up stuck longer than necessary (ie when we would have been ableto momentarily acquire the exclusive lock on the table earlier). c) It's a shame we don't support multiple concurrent concurrent index builds. We could create a ShareUpdateShareLock thatconflicts with the same list of locks that ShareUpdateExclusiveLock conflicts with but not itself. From a UI pointof view there's no excuse for not doing this, but from an implementation point of view there's a limit of 10 lock typesand this would get up to 9. This is my first time looking at this code so I'm not sure how hard that limit is. One caveat is that the two jobs would see each other and that would make it hard for them to proceed to phase 2. I thinkwhat would happen is that the first one to finish phase 1 would be able to continue as soon as the other finishesphase 1. The second one would have to wait until the first one's phase 2 finished. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Fri, Aug 25, 2006 at 08:02:21PM -0500, Jim C. Nasby wrote: > On Fri, Aug 25, 2006 at 06:57:58PM +0100, Gregory Stark wrote: > > I'll use this opportunity to plug that feature again. I think most > > people should use autocommit off with on_error_rollack on for most > > of their daily use. Being able to double check the results of my > > ad-hoc updates before committing them saved me more headaches than > > I can count with Oracle. Autocommit off only became practical for > > interactive use with postgres when savepoints showed up. > > +1 +1 here, too :) Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
Gregory Stark <stark@enterprisedb.com> writes: > A few concerns > a) The use of ShareUpdateExclusiveLock is supposed to lock out concurrent > vacuums. I just tried it and vacuum seemed to be unaffected. Can't reproduce such a problem here. > Do we still need to block concurrent vacuums if we're using snapshots? That's an interesting question. I don't think we need to lock out vacuum in either pass as far as tuple removal goes: we are only interested in tuples that are visible to a live transaction (ie ours) and vacuum shouldn't remove them. However one of the properties of ShareUpdateExclusive is supposed to be that it locks out schema changes ... such as creation of a new index. I wouldn't want to bet that vacuum works correctly if a new index appears partway through its collection of info about the relation. I think we need to keep that lock so that vacuum is safe against create index, rather than vice versa. > b) You introduced a LockRelationIdForSession() call (I even didn't realize we > had this capability when I wrote the patch). Does this introduce the > possibility of a deadlock though? Indeed, I had noticed this while testing your point (a). I thought that ConditionalLockRelation had enough smarts to grant itself a lock if there would be a deadlock possibility otherwise, but it seems not to be noticing. We'll need to look into that. > c) It's a shame we don't support multiple concurrent concurrent index builds. I can't get excited about that. The point of this facility is to build indexes with minimal impact on production queries, and the I/O load imposed by even one build is going to make that a bit questionable, let alone multiple ones. If you want to do some useful improvement on the patch, look into making it throttle its I/O rate like vacuum can. regards, tom lane
I wrote: > Gregory Stark <stark@enterprisedb.com> writes: >> b) You introduced a LockRelationIdForSession() call (I even didn't realize we >> had this capability when I wrote the patch). Does this introduce the >> possibility of a deadlock though? > Indeed, I had noticed this while testing your point (a). I thought that > ConditionalLockRelation had enough smarts to grant itself a lock if > there would be a deadlock possibility otherwise, but it seems not to be > noticing. We'll need to look into that. OK, the code I was remembering is the bit in ProcSleep to grant our proc the lock if we can prove that deadlock detection would do so anyway. It is probably possible to refactor things so that that's done before we fail in the ConditionalLockAcquire case, but it would involve uglifying ProcSleep's API even more. And it doesn't sound like a complete solution anyway --- the ProcSleep test is not capable of detecting deadlocks involving more than one lock, and I'm not sure it even intends to find all cases involving just one lock. It was only ever meant as a simple optimization to avoid a sleep in easy cases. >> To solve that we would have to replace the pg_sleep call with a >> XactLockTableWait. But I'm not clear how to find a transaction id to wait >> on. I'm toying with the idea of adding a lock manager call defined as "give me a list of XIDs that currently hold locks conflicting with lockmode X on object Y" --- but not including XIDs merely waiting for such a lock. Then we could get the list of XIDs currently blocking ExclusiveLock, and do XactLockTableWait on each one. Once they are all gone we are safe; we don't care about later acquirers of locks because they must see the index. This avoids both the arbitrary pg_sleep and the need for a check on "am I the oldest remaining XID"; and if we are in a deadlock situation it will be detected. This looks like it should be easy enough to code (though I haven't actually tried yet) ... do you see any logical flaws? regards, tom lane
I wrote: > I'm toying with the idea of adding a lock manager call defined as > "give me a list of XIDs that currently hold locks conflicting with > lockmode X on object Y" --- but not including XIDs merely waiting > for such a lock. Then we could get the list of XIDs currently blocking > ExclusiveLock, and do XactLockTableWait on each one. I've committed a patch along these lines. I also reduced the strength of the lock we check for from ExclusiveLock to ShareLock, which seems sufficient --- did you have a reason for selecting ExclusiveLock in the original coding? regards, tom lane