Thread: Reducing relation locking overhead
In looking at the current pgbench results, I notice that one considerable contribution to LWLock contention is access to the heavyweight-lock manager. Almost all of that comes from taking relation-level locks, so we could cut down the contention if we could reduce the number of distinct locks taken. (The lmgr code already avoids touching shared memory again if, say, you request another AccessShareLock on a relation you've already AccessShareLocked in the current transaction.) I see several places where we might possibly make this kind of savings: 1. In an UPDATE or DELETE query, we take RowExclusiveLock on the target relation, and also take AccessShareLock in the scan node that reads the relation. It wouldn't take much extra code to make the scan nodes avoid taking AccessShareLock if the relation is already opened as the query result relation. AFAICS there is no downside to this; any lock that would conflict with AccessShareLock will also conflict with RowExclusiveLock, so there's no real need to hold both lock types. (Moreover, we already make a similar kind of optimization when accessing system catalogs: those routines only take one lock not two.) Does anyone have an objection to it? 2. In the same way, an index that is used to scan the target relation will be locked in both RowExclusiveLock and AccessShareLock modes (corresponding to its roles as both source and update target). We could avoid taking both lock types, but this would require some restructuring because the code responsible for locking the indexes doesn't currently have access to the EState to find out whether an index belongs to a target relation. What I'm thinking here is that maybe there's no point in maintaining the read versus write distinction at all for indexes --- we could just take AccessShareLock in both cases. Any thoughts on the pros and cons there? 3. We could also save some trips to the lock manager if we adopt the same position for user indexes as we do for user tables, namely that locks once taken are held till end of transaction. Any thoughts about that? 4. The only reason we need to take relation-level locks on indexes at all is to make the world safe for REINDEX being done concurrently with read-only accesses to the table (that don't use the index being reindexed). If we went back to requiring exclusive lock for reindex we could forget all about both #2 and #3. Particularly for updates of relations with lots of indexes, this could be a pretty significant win. However we'd definitely be giving up something that was seen as a feature at one point, so I'm not sold on this idea ... unless someone can see a way to reduce the overhead without giving up concurrent REINDEX. regards, tom lane
> 4. The only reason we need to take relation-level locks on indexes > at all is to make the world safe for REINDEX being done concurrently > with read-only accesses to the table (that don't use the index being > reindexed). If we went back to requiring exclusive lock for reindex we > could forget all about both #2 and #3. Particularly for updates of > relations with lots of indexes, this could be a pretty significant win. > However we'd definitely be giving up something that was seen as a > feature at one point, so I'm not sold on this idea ... unless someone > can see a way to reduce the overhead without giving up concurrent > REINDEX. Surely in the real world REINDEX is run so rarely compared to all those other operations it'd be a win... Chris
* Christopher Kings-Lynne (chriskl@familyhealth.com.au) wrote: > >4. The only reason we need to take relation-level locks on indexes > >at all is to make the world safe for REINDEX being done concurrently > >with read-only accesses to the table (that don't use the index being > >reindexed). If we went back to requiring exclusive lock for reindex we > >could forget all about both #2 and #3. Particularly for updates of > >relations with lots of indexes, this could be a pretty significant win. > >However we'd definitely be giving up something that was seen as a > >feature at one point, so I'm not sold on this idea ... unless someone > >can see a way to reduce the overhead without giving up concurrent > >REINDEX. > > Surely in the real world REINDEX is run so rarely compared to all those > other operations it'd be a win... Yeah, except that in the real world you don't want to bring everything to a halt while you do a REINDEX. For my use cases it'd be fine but I could see cases where it wouldn't be. Kinda makes me wish we could give the user the option at runtime somehow but I'm not sure that could be done... Thanks, Stephen
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > Surely in the real world REINDEX is run so rarely compared to all those other > operations it'd be a win... It's not a question of frequency. We're not talking about something like a 10% performance loss. You're talking about whether REINDEX is useful at all. Consider installations where REINDEX will require shutting down business critical operations for days... It was a *major* new feature that many people were waiting for when Oracle finally implemented live CREATE INDEX and REINDEX. The ability to run create an index without blocking any operations on a table, even updates, was absolutely critical for 24x7 operation. -- greg
Greg Stark <gsstark@mit.edu> writes: > It was a *major* new feature that many people were waiting for when Oracle > finally implemented live CREATE INDEX and REINDEX. The ability to run create > an index without blocking any operations on a table, even updates, was > absolutely critical for 24x7 operation. Well, we're still not in *that* ballpark and I haven't seen any serious proposals to make us so. How "absolutely critical" is it really? Is REINDEX-in-parallel-with-reads-but-not-writes, which is what we actually have at the moment, an "absolutely critical" facility? regards, tom lane
On Fri, 2005-12-02 at 02:14 -0500, Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > It was a *major* new feature that many people were waiting for when Oracle > > finally implemented live CREATE INDEX and REINDEX. The ability to run create > > an index without blocking any operations on a table, even updates, was > > absolutely critical for 24x7 operation. > > Well, we're still not in *that* ballpark and I haven't seen any serious > proposals to make us so. How "absolutely critical" is it really? > Is REINDEX-in-parallel-with-reads-but-not-writes, which is what we > actually have at the moment, an "absolutely critical" facility? REINDEX isn't run that regularly, so perhaps might warrant special attention. (I think there are other things we could do to avoid ever needing to run a REINDEX.) CREATE/DROP INDEX is important however, since we may want to try out new index choices without stopping access altogether. But we do also want the locking contention to be reduced also.... I know at least one other RDBMS that uses optimistic locking when creating indexes. It checks the table description, builds the index with a read lock, then checks the table description again before attempting to lock the catalog, "create" the index and then complete. There is a risk of getting a "table restructured error" after the build is nearly complete. If we did that, then we wouldn't need to lock the indexes because you wouldn't be able to see an index until it was built. Doing something similar might allow us to have online CREATEs yet without a locking overhead. 24x7 operation is actually fairly common. Maybe not with a strong SLA for availability, but many websites and embedded apps are out there all the time. The PostgreSQL claim to fame has concurrency at the top of the list, so we should assume that in all we do. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Fri, 2005-12-02 at 02:14 -0500, Tom Lane wrote: > > Greg Stark <gsstark@mit.edu> writes: > > > It was a *major* new feature that many people were waiting for when Oracle > > > finally implemented live CREATE INDEX and REINDEX. The ability to run create > > > an index without blocking any operations on a table, even updates, was > > > absolutely critical for 24x7 operation. > > > > Well, we're still not in *that* ballpark and I haven't seen any serious > > proposals to make us so. How "absolutely critical" is it really? > > Is REINDEX-in-parallel-with-reads-but-not-writes, which is what we > > actually have at the moment, an "absolutely critical" facility? Alright, I'll grant Tom that "absolutely critical" was a bit of hyperbole. > I know at least one other RDBMS that uses optimistic locking when > creating indexes. It checks the table description, builds the index with > a read lock, then checks the table description again before attempting > to lock the catalog, "create" the index and then complete. There is a > risk of getting a "table restructured error" after the build is nearly > complete. I suspect this comes out of a very different storage model from Postgres's. Postgres would have no trouble building an index of the existing data using only shared locks. The problem is that any newly inserted (or updated) records could be missing from such an index. To do it you would then have to gather up all those newly inserted records. And of course while you're doing that new records could be inserted. And so on. There's no guarantee it would ever finish, though I suppose you could detect the situation if the size of the new batch wasn't converging to 0 and throw an error. One optimization would be to have a flag that disabled the use of the FSM, forcing all inserts to extend the table and allocate new tuples at the end. This would at least limit the amount the index build would have to scan. The index build could just do one-by-one insertions for the remaining tuples until it catches up to the head. At the end of the index build there's also a problem upgrading locks to put in place the new index. That would create a deadlock risk. Perhaps that's where the "table restructured error" comes up in these other databases? > 24x7 operation is actually fairly common. Maybe not with a strong SLA > for availability, but many websites and embedded apps are out there all > the time. The PostgreSQL claim to fame has concurrency at the top of the > list, so we should assume that in all we do. Off the top of my head I would put these items on the list of "necessary for 24x7 operation": . (non-FULL) VACUUM . Online/PITR backups . Partitioned Tables . online index builds Of which Postgres has 2.5 out of 4. And most of those have come in just the last 12 months or so. Doing pretty damned good. -- greg
On 02 Dec 2005 15:25:58 -0500, Greg Stark <gsstark@mit.edu> wrote: > I suspect this comes out of a very different storage model from Postgres's. > > Postgres would have no trouble building an index of the existing data using > only shared locks. The problem is that any newly inserted (or updated) records > could be missing from such an index. > > To do it you would then have to gather up all those newly inserted records. > And of course while you're doing that new records could be inserted. And so > on. There's no guarantee it would ever finish, though I suppose you could > detect the situation if the size of the new batch wasn't converging to 0 and > throw an error. After you're mostly caught up, change locking behavior to block further updates while the final catchup happens. This could be driven by a hurestic that says make up to N attempts to catch up without blocking, after that just take a lock and finish the job. Presumably the catchup would be short compared to the rest of the work. Are their enviroments which could not tolerate even this minimal hit? Probably, which leaves the choice of telling them 'don't reindex then' or providingaA knob which would tell it to never block (would just try N times and then give up, failing the reindex).
Gregory Maxwell wrote: > On 02 Dec 2005 15:25:58 -0500, Greg Stark <gsstark@mit.edu> wrote: > > I suspect this comes out of a very different storage model from Postgres's. > > > > Postgres would have no trouble building an index of the existing data using > > only shared locks. The problem is that any newly inserted (or updated) records > > could be missing from such an index. > > > > To do it you would then have to gather up all those newly inserted records. > > And of course while you're doing that new records could be inserted. And so > > on. There's no guarantee it would ever finish, though I suppose you could > > detect the situation if the size of the new batch wasn't converging to 0 and > > throw an error. > > After you're mostly caught up, change locking behavior to block > further updates while the final catchup happens. This could be driven > by a hurestic that says make up to N attempts to catch up without > blocking, after that just take a lock and finish the job. Presumably > the catchup would be short compared to the rest of the work. The problem is that you need to upgrade the lock at the end of the operation. This is very deadlock prone, and likely to abort the whole operation just when it's going to finish. Is this a showstopper? Tom seems to think it is. I'm not sure anyone is going to be happy if they find that their two-day reindex was aborted just when it was going to finish. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, 2005-12-02 at 19:04 -0300, Alvaro Herrera wrote: > Gregory Maxwell wrote: > > On 02 Dec 2005 15:25:58 -0500, Greg Stark <gsstark@mit.edu> wrote: > > > I suspect this comes out of a very different storage model from Postgres's. > > > > > > Postgres would have no trouble building an index of the existing data using > > > only shared locks. The problem is that any newly inserted (or updated) records > > > could be missing from such an index. > > > > > > To do it you would then have to gather up all those newly inserted records. > > > And of course while you're doing that new records could be inserted. And so > > > on. CREATE INDEX uses SnapshotAny, so the scan that feeds the build could easily include rows added after the CREATE INDEX started. When the scan was exhausted we could mark that last TID and return to it after the sort/build. > There's no guarantee it would ever finish, though I suppose you could > > > detect the situation if the size of the new batch wasn't converging to 0 and > > > throw an error. > > > > After you're mostly caught up, change locking behavior to block > > further updates while the final catchup happens. This could be driven > > by a hurestic that says make up to N attempts to catch up without > > blocking, after that just take a lock and finish the job. Presumably > > the catchup would be short compared to the rest of the work. > > The problem is that you need to upgrade the lock at the end of the > operation. This is very deadlock prone, and likely to abort the whole > operation just when it's going to finish. Is this a showstopper? Tom > seems to think it is. I'm not sure anyone is going to be happy if they > find that their two-day reindex was aborted just when it was going to > finish. If that is the only objection against such a seriously useful feature, then we should look at making some exceptions. (I understand the lock upgrade issue). Greg has come up with an exceptional idea here, so can we look deeper? We already know others have done it. What types of statement would cause the index build to fail? How else can we prevent them from executing while the index is being built? Best Regards, Simon Riggs
On 12/2/05, Alvaro Herrera wrote: > Gregory Maxwell wrote: >> >> After you're mostly caught up, change locking behavior to block >> further updates while the final catchup happens. This could be driven >> by a hurestic that says make up to N attempts to catch up without >> blocking, after that just take a lock and finish the job. Presumably >> the catchup would be short compared to the rest of the work. > > The problem is that you need to upgrade the lock at the end of the > operation. This is very deadlock prone, and likely to abort the whole > operation just when it's going to finish. Is this a showstopper? Tom > seems to think it is. I'm not sure anyone is going to be happy if they > find that their two-day reindex was aborted just when it was going to > finish. How about the following sceanrio for building a new index: - create an empty index - flag it as incomplete - commit it so it becomes visible to new transactions - new transactions will update the index when inserting / updating - the planner will not use it for queries because it is flagged as incomplete - wait until the the index is visible to all running transactions - start a new seqscan and insert all records in the index - commit - remove the incomplete flag Wouldn't this overcome the lock upgrade problem? Jochem
Simon Riggs <simon@2ndquadrant.com> writes: > CREATE INDEX uses SnapshotAny, so the scan that feeds the build could > easily include rows added after the CREATE INDEX started. When the scan > was exhausted we could mark that last TID and return to it after the > sort/build. And do what? This has nothing to do with the fundamental problem of never being able to catch up unless you can upgrade your lock to exclude writes. What's worse, once you have excluded writes you have to rescan the entire table to be sure you haven't missed anything. So in the scenarios where this whole thing is actually interesting, ie enormous tables, you're still talking about a fairly long interval with writes locked out. Maybe not as long as a complete REINDEX, but long. regards, tom lane
Jochem van Dieten <jochemd@gmail.com> writes: > How about the following sceanrio for building a new index: > - create an empty index > - flag it as incomplete > - commit it so it becomes visible to new transactions > - new transactions will update the index when inserting / updating > - the planner will not use it for queries because it is flagged as incomplete > - wait until the the index is visible to all running transactions > - start a new seqscan and insert all records in the index > - commit > - remove the incomplete flag > Wouldn't this overcome the lock upgrade problem? Doesn't really solve the problem for REINDEX, though. Presumably, the reason that you are REINDEXing is that you would like to defragment the existing index. Well, that requires collecting all the index entries and sorting them. The above method is not going to produce a nicely sorted index; whatever entries get made on-the-fly during the first stage are going to determine the index shape. This same problem applies to the build-lock-catchup paradigm, although less severely since you can hope that the entries to be added on at the end are only a small part of the total and will fit in the excess space that you leave in the index leaf pages. If you spend too long catching up, though (as in the multiple-pass ideas that various people were advocating), you'll end up with an index messy enough that it's questionable why you bothered. regards, tom lane
On Fri, 2005-12-02 at 17:45 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > CREATE INDEX uses SnapshotAny, so the scan that feeds the build could > > easily include rows added after the CREATE INDEX started. When the scan > > was exhausted we could mark that last TID and return to it after the > > sort/build. > > And do what? This has nothing to do with the fundamental problem of > never being able to catch up unless you can upgrade your lock to exclude > writes. What's worse, once you have excluded writes you have to rescan > the entire table to be sure you haven't missed anything. So in the > scenarios where this whole thing is actually interesting, ie enormous > tables, you're still talking about a fairly long interval with writes > locked out. Maybe not as long as a complete REINDEX, but long. Those are technical difficulties that I believe can be solved. Greg was right: online index builds are very useful (for us). Best Regards, Simon Riggs
On Friday 02 December 2005 09:53, Simon Riggs wrote: > On Fri, 2005-12-02 at 02:14 -0500, Tom Lane wrote: > > Greg Stark <gsstark@mit.edu> writes: > > > It was a *major* new feature that many people were waiting for when > > > Oracle finally implemented live CREATE INDEX and REINDEX. The ability > > > to run create an index without blocking any operations on a table, even > > > updates, was absolutely critical for 24x7 operation. > > > > Well, we're still not in *that* ballpark and I haven't seen any serious > > proposals to make us so. How "absolutely critical" is it really? > > Is REINDEX-in-parallel-with-reads-but-not-writes, which is what we > > actually have at the moment, an "absolutely critical" facility? > > REINDEX isn't run that regularly, so perhaps might warrant special > attention. (I think there are other things we could do to avoid ever > needing to run a REINDEX.) > > CREATE/DROP INDEX is important however, since we may want to try out new > index choices without stopping access altogether. But we do also want > the locking contention to be reduced also.... > Just thought I'd toss in this random data point... I know I still have a least one 7.3 system running were reindexes are a part of the regular routine and the ability to query against the table simultaneously is certainly approaching "absolutly critical" territory. Hoping to get that system upgraded by the end of the month, at which point the frequency of reindex will surely decrease, but I'm not sure it's going to go away completly. I could probably get by at that point with DROP/CREATE INDEX but it wouldn't be my preferred way to do it. -- Robert Treat Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
Tom Lane <tgl@sss.pgh.pa.us> writes: > What's worse, once you have excluded writes you have to rescan the entire > table to be sure you haven't missed anything. So in the scenarios where this > whole thing is actually interesting, ie enormous tables, you're still > talking about a fairly long interval with writes locked out. Maybe not as > long as a complete REINDEX, but long. I was thinking you would set a flag to disable use of the FSM for inserts/updates while the reindex was running. So you would know where to find the new tuples, at the end of the table after the last tuple you read. -- greg
Alvaro Herrera <alvherre@commandprompt.com> writes: > The problem is that you need to upgrade the lock at the end of the > operation. This is very deadlock prone, and likely to abort the whole > operation just when it's going to finish. Is this a showstopper? Tom > seems to think it is. I'm not sure anyone is going to be happy if they > find that their two-day reindex was aborted just when it was going to > finish. How likely is this really to be a problem in this particular case? Obviously if two people try to reindex the same index they'll have a problem, but that's not really a problem. (Postgres should probably do something to block that up front rather than wait until the end to fail.) Other than that case is there any other case the reindex could deadlock with? I'm a bit hampered thinking about this because I'm not really sure exactly what locks a reindex needs and what else takes those locks. -- greg
Greg Stark wrote: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > What's worse, once you have excluded writes you have to rescan the entire > > table to be sure you haven't missed anything. So in the scenarios where this > > whole thing is actually interesting, ie enormous tables, you're still > > talking about a fairly long interval with writes locked out. Maybe not as > > long as a complete REINDEX, but long. > > I was thinking you would set a flag to disable use of the FSM for > inserts/updates while the reindex was running. So you would know where to find > the new tuples, at the end of the table after the last tuple you > read. If REINDEX works by seqscanning the table then the inclusion of new tuples would happen for free if you turn off the FSM before beginning the REINDEX operation -- you're guaranteed to see them last. But that only works if REINDEX behaves this way. Then it's a question of what to do with in-flight updates at the time the REINDEX hits the end of the table. Even if REINDEX hits the table in non-sequential order, turning off the FSM should still work. REINDEX wouldn't need to acquire any additional locks until after it has scanned the appended area. So the way I (perhaps naively) envision it working is: 1. Acquire read lock on the table 2. Turn off FSM 3. Note the location of the end of the table 4. Release read lock on the table 5. Perform REINDEX operation 6. Read and index the bit of the table starting with the location noted previously. 7. Note new end of table 8. Acquire read lock on the table 9. Scan any entries that have been appended past new end of table. 10. Release read lock on table 11. Turn on FSM In the above for large relations, the bulk of the REINDEX should happen without any locks being held by the REINDEX operation. For small tables (where the amount of new insert activity can be a large percentage of the total table size), it would almost certainly be more efficient to just take a read lock for the whole operation. So it might be wise to set up some sort of threshold, and to take a read lock during the whole operation if the table size is smaller than the threshold. The reason the sequence I enumerate above involves taking any locks at all is to avoid the issues that Tom brought up about having to rescan the entire table to make sure nothing gets missed, to avoid possible race conditions between steps 2 and 3, and to allow step 9 to definitively complete, since otherwise in-flight updates would be missed. In the context of the original discussion (reduction of lock acquisition), REINDEX isn't a common operation even if it is a critical one, so acquisition of more than the usual number of locks here shouldn't be a big deal. -- Kevin Brown kevin@sysexperts.com
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> What's worse, once you have excluded writes you have to rescan the entire >> table to be sure you haven't missed anything. > I was thinking you would set a flag to disable use of the FSM for > inserts/updates while the reindex was running. So you would know where > to find the new tuples, at the end of the table after the last tuple > you read. There are paths that do not consult the FSM, eg update where the new tuple fits on the same page as the old, or where the backend has already been given a target page by the FSM. And this "set a flag" idea in itself has enormous costs, because the flag will become a central point of contention: it *must* be touched by every single tuple insertion. regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > Other than that case is there any other case the reindex could deadlock with? Only SELECT, INSERT, UPDATE, and DELETE. regards, tom lane
Kevin Brown <kevin@sysexperts.com> writes: > In the above for large relations, the bulk of the REINDEX should > happen without any locks being held by the REINDEX operation. As I just pointed out to Greg, the arm-waving notion that you can "turn off the FSM" requires a great deal of low-level locking that is not there now. Even ignoring that, you *still* have a lock upgrade problem in this sketch. regards, tom lane
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > In the above for large relations, the bulk of the REINDEX should > > happen without any locks being held by the REINDEX operation. > > As I just pointed out to Greg, the arm-waving notion that you can "turn > off the FSM" requires a great deal of low-level locking that is not > there now. Yeah, I thought that the check for use of the FSM was already being done by the lower level operators, and that contention would only occur on the modification of the FSM "enabled" flag. Obviously this doesn't work at all if the FSM is just assumed to be in use at all times, or if the FSM values are read only at transaction start or something... > Even ignoring that, you *still* have a lock upgrade problem > in this sketch. Hmm, well, I can see a deadlock potential for those operations that have to acquire multiple locks simultaneously, and I suppose that the table + FSM lock would qualify in the sequence here, but the rest of it involves just a single read lock against the table. What am I missing? -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > Tom Lane wrote: >> Even ignoring that, you *still* have a lock upgrade problem >> in this sketch. > Hmm, well, I can see a deadlock potential for those operations that > have to acquire multiple locks simultaneously, and I suppose that the > table + FSM lock would qualify in the sequence here, but the rest of > it involves just a single read lock against the table. What am I > missing? At some point you have to lock out writers, else you can never be certain you have all the tuples. I was taking your "read lock" to actually mean "lock out writers"; otherwise the sketch doesn't work at all. The real situation is that you must hold at least AccessShareLock on the table throughout the entire operation, else you have no defense against (say) someone dropping the index or the entire table out from under you. And when you add onto this lock in order to lock out writers temporarily, you are engaging in a lock upgrade, which can deadlock against any sort of exclusive lock request. The fact that you've been holding the AccessShareLock for quite a long time means that the window for deadlock problems is very wide. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > Other than that case is there any other case the reindex could deadlock with? > > Only SELECT, INSERT, UPDATE, and DELETE. How does that happen? What exclusive locks do these take that reindex would conflict with? I guess I'm missing some basic piece of the puzzle here. I've also started wondering if the other approach proposed doesn't have more merit than it was granted, namely the plan to create an "incomplete index" that other transactions update appropriately as they make changes and then insert the records one by one. The main objection given was that the newly created index wouldn't be as cleanly structured as a real reindex. But it seems like that's something that can be overcome. There's plenty of research on how to build trees that withstand sequential inserts for example. In the worst case Postgres could sort and build the entire index temporarily to decide where the page splits ought to be and then arrange to insert the new records in the real index such that the page splits end up in those places. The unstated objection that I think is more severe here is that performance would suck. Repeatedly inserting individual records would be much slower than a real reindex. And worse, it would slow down regular operation since every update and insert would need to content with this batch job to get their chance to insert their entries. But I'm not sure performance sucking is really that big of a problem here. There's always the option of an offline REINDEX if you can withstand the downtime. To do an online reindex you'll just need a faster system. Nobody ever said 24x7 operation wouldn't have costs. -- greg
On Sat, 2005-12-03 at 08:47 -0500, Robert Treat wrote: > On Friday 02 December 2005 09:53, Simon Riggs wrote: > > On Fri, 2005-12-02 at 02:14 -0500, Tom Lane wrote: > > > Greg Stark <gsstark@mit.edu> writes: > > > > It was a *major* new feature that many people were waiting for when > > > > Oracle finally implemented live CREATE INDEX and REINDEX. The ability > > > > to run create an index without blocking any operations on a table, even > > > > updates, was absolutely critical for 24x7 operation. > > > > > > Well, we're still not in *that* ballpark and I haven't seen any serious > > > proposals to make us so. How "absolutely critical" is it really? > > > Is REINDEX-in-parallel-with-reads-but-not-writes, which is what we > > > actually have at the moment, an "absolutely critical" facility? > > > > REINDEX isn't run that regularly, so perhaps might warrant special > > attention. (I think there are other things we could do to avoid ever > > needing to run a REINDEX.) > > > > CREATE/DROP INDEX is important however, since we may want to try out new > > index choices without stopping access altogether. But we do also want > > the locking contention to be reduced also.... > > > > Just thought I'd toss in this random data point... I know I still have a least > one 7.3 system running were reindexes are a part of the regular routine and > the ability to query against the table simultaneously is certainly > approaching "absolutly critical" territory. Hoping to get that system > upgraded by the end of the month, at which point the frequency of reindex > will surely decrease, but I'm not sure it's going to go away completly. I > could probably get by at that point with DROP/CREATE INDEX but it wouldn't be > my preferred way to do it. Understood. At 7.3, REINDEXing is essential, since rows never got deleted and space was not reused. That is not the case now, hence a REINDEX is less often required. Best Regards, Simon Riggs
Simon Riggs wrote: > Understood. At 7.3, REINDEXing is essential, since rows never got > deleted and space was not reused. That is not the case now, hence a > REINDEX is less often required. But it's still required or at least desirable under some circumstances. If it could be improved, it would be good. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On 12/3/05, Tom Lane wrote: > Jochem van Dieten writes: >> How about the following sceanrio for building a new index: >> - create an empty index >> - flag it as incomplete >> - commit it so it becomes visible to new transactions >> - new transactions will update the index when inserting / updating >> - the planner will not use it for queries because it is flagged as incomplete >> - wait until the the index is visible to all running transactions >> - start a new seqscan and insert all records in the index >> - commit >> - remove the incomplete flag >> Wouldn't this overcome the lock upgrade problem? > > Doesn't really solve the problem for REINDEX, though. Presumably, the > reason that you are REINDEXing is that you would like to defragment the > existing index. Well, that requires collecting all the index entries > and sorting them. The above method is not going to produce a nicely > sorted index; whatever entries get made on-the-fly during the first > stage are going to determine the index shape. So how about merging this approach with the catch-up approach? - take an access share lock so the index doesn't change - build the index in a non-locking approach - flag the index as incomplete - commit it - from now on new transaction will register their changes in the index - wait until it is visible to all - do a table scan with special xmin-xmax comparison - commit the mising tuples - remove the incomplete flag Jochem
On Sat, 2005-12-03 at 17:16 -0300, Alvaro Herrera wrote: > Simon Riggs wrote: > > > Understood. At 7.3, REINDEXing is essential, since rows never got > > deleted and space was not reused. That is not the case now, hence a > > REINDEX is less often required. > > But it's still required or at least desirable under some circumstances. > If it could be improved, it would be good. 100% Agreed. I just think having an online CREATE INDEX is a much more important thing, FWIW. Best Regards, Simon Riggs
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > > The real situation is that you must hold at least AccessShareLock on the > table throughout the entire operation, else you have no defense against > (say) someone dropping the index or the entire table out from under you. > And when you add onto this lock in order to lock out writers > temporarily, you are engaging in a lock upgrade, which can deadlock > against any sort of exclusive lock request. The fact that you've been > holding the AccessShareLock for quite a long time means that the window > for deadlock problems is very wide. > Maybe the deadlock problem is solvable, our current deadlock removal mechanism is like this: /* who wakes up first who removes himself -- quite unfair :-( */RemoveFromWaitQueue(MyProc); What if we change to cost-based removal, i.e., remove the one whose cost is smaller. In this case, an two-days-to-be-done reindex should never get killed. Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > What if we change to cost-based removal, i.e., remove the one whose cost is > smaller. In this case, an two-days-to-be-done reindex should never get > killed. Instead, it would kill all useful work in your system :-(. An old transaction would be the one invincible gunslinger in town, mowing down every transaction that dared to go up against it, and getting older and even less vincible with every win. Yet this would still not give it any guarantee of ever being able to commit, as long as there's a constant stream of competition. Certainly it's interesting to think about a smarter deadlock-escape algorithm, but that's no solution compared to not getting into a deadlock in the first place. regards, tom lane
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > Tom Lane wrote: > >> Even ignoring that, you *still* have a lock upgrade problem > >> in this sketch. > > > Hmm, well, I can see a deadlock potential for those operations that > > have to acquire multiple locks simultaneously, and I suppose that the > > table + FSM lock would qualify in the sequence here, but the rest of > > it involves just a single read lock against the table. What am I > > missing? > > At some point you have to lock out writers, else you can never be > certain you have all the tuples. I was taking your "read lock" to > actually mean "lock out writers"; otherwise the sketch doesn't work > at all. Right, but the idea is to lock out writers for as brief a time as possible. That not only minimizes the possibility of lock contention but guarantees that REINDEX will get a complete view of the database. That said, it hinges on some sort of efficient way of identifying the new tuples created by other transactions that are/were running during the bulk of the time RINDEX was running. If there's no good way to do that, then there's no way to avoid blocking writers for an extended period of time. > The real situation is that you must hold at least AccessShareLock on the > table throughout the entire operation, else you have no defense against > (say) someone dropping the index or the entire table out from under you. > And when you add onto this lock in order to lock out writers > temporarily, you are engaging in a lock upgrade, which can deadlock > against any sort of exclusive lock request. But won't that depend on the order in which the lock requests appear? If locks A, B, and C are taken in that same order by every transaction, then there's no possibility of deadlock, right? > The fact that you've been holding the AccessShareLock for quite a > long time means that the window for deadlock problems is very wide. But with respect to deadlocks, that's true only if deadlocks are possible, which is true only if the order of lock acquisition differs between transactions. I guess the real question here is: is it possible to, in code, guarantee the order of lock acquisition by any given transaction? For REINDEX, the problem is simplified somewhat because it's operating against a single index and a single table, so the locks it has to acquire are somewhat limited in scope compared with a generic transaction. An endeavor to acquire all locks in the same order throughout the code would not only take care of this REINDEX deadlock problem but would essentially eliminate all possible deadlocks arising from code-ordered lock acquisition in the system, which I expect would be considered a very good thing indeed. But I expect it would be a lot of effort and wouldn't be worth it just to make REINDEX behave differently than it does now. So what am I missing/overlooking here? -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > I guess the real question here is: is it possible to, in code, > guarantee the order of lock acquisition by any given transaction? Yes, but not in our code :-(. This is largely determined by what the application does. regards, tom lane
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > I guess the real question here is: is it possible to, in code, > > guarantee the order of lock acquisition by any given transaction? > > Yes, but not in our code :-(. This is largely determined by what the > application does. Yeah, that's what I figured. But what of the more limited problem of lock acquisition relative to the locks that REINDEX might need to acquire? Since those locks are limited to a single table and a single index, I'd think the problem wouldn't be insurmountable. No? Suppose the following rules were enforced in the code: - when requesting a type of lock, one must first acquire all lesser lock types against the object in order of strength. Hence, one must acquire AccessShareLock before acquiring AccessExclusiveLock. - one must always acquire a given lock type against the table before acquiring it against the index. The latter rule might be sufficient, if the former rule is already implied by the lock types and how they're acquired. Thus, acquisition of AccessExclusiveLock against the index automatically implies acquisition of AccessShareLock(table), then AccessShareLock(index), then AccessExclusiveLock(table), then AccessExclusiveLock(index). I could see that possibly causing performance problems (and would be interested in knowing what performance problems it would likely cause), but given the limited scope of the locks that REINDEX needs, wouldn't the above be sufficient to keep REINDEX deadlock-free against other transactions? -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > > The fact that you've been holding the AccessShareLock for quite a > > long time means that the window for deadlock problems is very wide. > > But with respect to deadlocks, that's true only if deadlocks are > possible, which is true only if the order of lock acquisition differs > between transactions. Taking locks in different orders is one way to trigger deadlocks. But it's not the only way. If two threads both take a shared lock on a resource, then try to upgrade it to an exclusive lock they'll both be waiting on the other thread to release its shared lock... And in this case they're acquiring the locks in precisely the same order. What I don't get from what Tom said was that he implied REINDEX would deadlock against a regular thread acquiring a plain old exclusive lock even when it's not also a lock upgrade. It seems to me that REINDEX should be able to get in and get its exclusive lock and finish its job and then the other job should be able to get its exclusive lock and complete fine. -- greg
Kevin Brown <kevin@sysexperts.com> writes: > - when requesting a type of lock, one must first acquire all lesser > lock types against the object in order of strength. Hence, one must > acquire AccessShareLock before acquiring AccessExclusiveLock. This is exactly wrong ... regards, tom lane
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > - when requesting a type of lock, one must first acquire all lesser > > lock types against the object in order of strength. Hence, one must > > acquire AccessShareLock before acquiring AccessExclusiveLock. > > This is exactly wrong ... And now I see why, since it will introduce deadlocks (sigh). But what of the other rule (always acquiring locks against the table before the index)? You may have stopped reading at the above... One thing I don't quite understand about the discussion is why there's particular attention being paid to deadlocks with respect to REINDEX when it clearly can happen in the general case when lock promotion is involved. Why is REINDEX special here? If the problem is that REINDEX has to hold an AccessShareLock to prevent the table or index from being dropped, but does not need to prevent writers in general (because the presumption is that there is some way of efficiently discovering the addtional modifications made during the bulk of REINDEX processing), then doesn't that mean that an AccessShareLock is the wrong kind of lock for REINDEX to be holding, and that the appropriate type of lock should be created if it doesn't already exist? Additionally, I was under the impression that normal INSERTs, UPDATEs, and DELETEs didn't generally need to acquire AccessExclusiveLock, because of MVCC. If that's the case, then aren't the operations that could deadlock REINDEX relatively rare? And if those operations *do* need to acquire that lock type, then what exactly does MVCC buy you? -- Kevin Brown kevin@sysexperts.com
On Sun, Dec 04, 2005 at 10:40:55PM -0800, Kevin Brown wrote: > One thing I don't quite understand about the discussion is why there's > particular attention being paid to deadlocks with respect to REINDEX > when it clearly can happen in the general case when lock promotion is > involved. Why is REINDEX special here? Although there is a general case, I don't believe it is a common case. INSERT creates the rows invisible to other transactions. UPDATE and DELETE would lock the row for writing. Normally, lock promotion isn't required. If, however, a person locks the row for reading using SELECT, and then attempts to upgrade the lock by SELECT, UPDATE, or DELETE, this would result in a lock promotion. > If the problem is that REINDEX has to hold an AccessShareLock to > prevent the table or index from being dropped, but does not need to > prevent writers in general (because the presumption is that there is > some way of efficiently discovering the addtional modifications made > during the bulk of REINDEX processing), then doesn't that mean that an > AccessShareLock is the wrong kind of lock for REINDEX to be holding, > and that the appropriate type of lock should be created if it doesn't > already exist? I think the problem is the 'efficiently discovering the additional modifications during' REINDEX processing. A few ideas have been floating around - but none proven. Even the (rather clever in my opinion) solution that would create the index incomplete, allowing new updates to be written to the index automatically, while REINDEX fills in the rest of it in the background before marking it complete, still has problems. If it is a new unique index, then for a time, the unique constraint would not be enforced. I think it has to be tweaked some, such that the new incomplete index would be created along-side the other index, and once the indexes match up, remove the original, and rename the new one into place. > Additionally, I was under the impression that normal INSERTs, UPDATEs, > and DELETEs didn't generally need to acquire AccessExclusiveLock, > because of MVCC. If that's the case, then aren't the operations that > could deadlock REINDEX relatively rare? And if those operations *do* > need to acquire that lock type, then what exactly does MVCC buy you? REINDEX needs to see visible and invisible rows. This goes back to the previous point. Efficiently and reliably discovering newly created rows. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
Concurrent CREATE INDEX, try 2 (was Re: Reducing relation locking overhead)
From
Hannu Krosing
Date:
Ühel kenal päeval, R, 2005-12-02 kell 02:14, kirjutas Tom Lane: > Greg Stark <gsstark@mit.edu> writes: > > It was a *major* new feature that many people were waiting for when Oracle > > finally implemented live CREATE INDEX and REINDEX. The ability to run create > > an index without blocking any operations on a table, even updates, was > > absolutely critical for 24x7 operation. > > Well, we're still not in *that* ballpark and I haven't seen any serious > proposals to make us so. How "absolutely critical" is it really? > Is REINDEX-in-parallel-with-reads-but-not-writes, which is what we > actually have at the moment, an "absolutely critical" facility? I don't think REINDEX to be very critical (exept if for some reason you have failed to vacuum a high-traffic table for some time and need to get index sizes down). OTOH, being able to add indexes on live database is sometimes required, and neither of the current ways ( accept a few hours of downtime or use slony relica and do a swithcover) are always acceptable. This capability is reportedly present in MSSQL and available for Oracle if you get the more expensive Enetrprise Edition. So, after more thinking, I have come up with a proposal for fully concurrent (read+write) create index, which should need minimal amount of locking. Concurrent CREATE INDEX ======================== Concurrent index NDX1 on table TAB1 is created like this: 1) start transaction. take a snapshot SNAP1 1.1) optionally, remove pages for TAB1 from FSM to force (?) all newer inserts/updates to happen at end of table (won't work for in-page updates without code changes) 2) create the index as we do now, but only for pages which are visible in SNAP1 3) record the index in pg_class, but mark it as "do not use for lookups" in a new field. Take snapshot SNAP2. commit transaction. -- at this point all new inserts and updates will be recorded in NDX1 4) Run a full scan over TAB1 and add all rows that are visible in SNAP2 but not in SNAP1 to NDX1. (if there is some way (like p1.1) to restrict or record the area in heap that new tuples go to, then this can be done more efficiently than full scan) 5) record the status of index as "ready for use". -- now the index is fully created and usable This is in no way a final proposal, but rather starting point for discussion of how things might be doable. For example p.3 is probably tricky to do in a way that all backends pick up at the right time. This will need most places that do table updates to be reviewed to make sure that they check for new indexes. Any comments are appreciated. ------------- Hannu Krosing
Kevin Brown <kevin@sysexperts.com> writes: > And now I see why, since it will introduce deadlocks (sigh). But what > of the other rule (always acquiring locks against the table before the > index)? You may have stopped reading at the above... We already do that. > One thing I don't quite understand about the discussion is why there's > particular attention being paid to deadlocks with respect to REINDEX > when it clearly can happen in the general case when lock promotion is > involved. Why is REINDEX special here? Because what people are asking for is an on-line REINDEX, ie, something that will go through in the presence of concurrent updates. Most other sorts of DDL changes to tables take exclusive locks so they don't have to worry about concurrency. (There is of course some risk of deadlock from the exclusive lock, but at least it happens *before* you've done a lot of work not *after*.) > Additionally, I was under the impression that normal INSERTs, UPDATEs, > and DELETEs didn't generally need to acquire AccessExclusiveLock, > because of MVCC. If that's the case, then aren't the operations that > could deadlock REINDEX relatively rare? The concern about deadlock applies to the various proposals that involve upgrading to a write-prevention lock at some late point in the process. That clearly has the potential to deadlock against relatively weak lock requests. regards, tom lane
I wonder if it would work to release the AccessShareLock before acquiring the ExclusiveLock. Of course, this would let any ALTER TABLE or DROP TABLE to do anything they wanted, but we could check that the table is still the same after reacquiring the exclusive lock. REINDEX would have to abort if anything unexpected happened to the table while the REINDEX transaction was waiting after releasing the shared lock. What's not at all clear to me is at what time is the lock "upgraded"? Just after scanning the heap for the first time? In this case, how do we detect all the tuples that need to be inserted after we acquire the exclusive lock? Are they listed somewhere? I imagine each table could have a global flag telling "there is an online reindex running for this relation". If this flag is set, each insert/delete to the index needs to save aside somewhere, the CTIDs of tuples it is inserting/deleting. So the protocol for reindex could be: acquire vacuum lockacquire read lock set REINDEX flag build the bulk of the index -- this takes a lot of time ... --meanwhile other transactions save CTIDs in a "spill area"release read lock acquire exclusive lock recheck the table, abort if something weird happened read the spill area, insert/delete from theindex as appropiate mark the index as completerelease exclusive lock release vacuum lock The "vacuum lock" is designed to leave any VACUUM out of the equation, but let run any select/insert/update/delete run. Maybe this lock could leave ALTER TABLE and other stuff out too. Not sure if we have something like this in our lock table -- if not, can we create it? Note that by this proposal, any DDL gets more expensive -- but if it's the normal case (no REINDEX running), it's only a flag check. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Tom Lane wrote: > The concern about deadlock applies to the various proposals that involve > upgrading to a write-prevention lock at some late point in the process. > That clearly has the potential to deadlock against relatively weak lock > requests. After looking at the various lock types, I don't see how this is the case at all (which may mean that I'm more confused than ever. But please read on). It seems to me that only ops that promote to AccessExclusiveLock can deadlock against at least some of the proposed REINDEX implementations. REINDEX would have to initially grab AccessShareLock, of course, but AccessExclusiveLock is the only lock type that blocks against it, so in the case of lock promotion the only operations that would cause REINDEX to really deadlock (as opposed to simply blocking) are operations on the entire table (ALTER TABLE, DROP TABLE, etc.). None of the common operations block against an AccessShareLock, and the order of acquisition against objects (table vs index) is already enforced, so where's the deadlock potential? REINDEX would, I expect, promote its lock to ShareLock when it's time for it to block writers. That would block against quite a number of operations, of course, but that's not a problem in and of itself, because it need only wait until the operations in question are finished. The lock won't be granted until those other operations are finished, and nothing aside from table-level ops will block against the REINDEX until that lock is granted. A true deadlock won't happen against common operations unless REINDEX promotes its lock again to something stronger than ShareLock, and that's easy to avoid: just have REINDEX promote directly from AccessShareLock to the strongest lock it will ever need. -- Kevin Brown kevin@sysexperts.com
Re: Concurrent CREATE INDEX, try 2 (was Re: Reducing relation locking overhead)
From
Jochem van Dieten
Date:
On 12/5/05, Hannu Krosing wrote: > > Concurrent CREATE INDEX > ======================== > > Concurrent index NDX1 on table TAB1 is created like this: > > 1) start transaction. take a snapshot SNAP1 > > 1.1) optionally, remove pages for TAB1 from FSM to force (?) all newer > inserts/updates to happen at end of table (won't work for in-page > updates without code changes) > > 2) create the index as we do now, but only for pages which are visible > in SNAP1 > > 3) record the index in pg_class, but mark it as "do not use for lookups" > in a new field. Take snapshot SNAP2. commit transaction. What happens if another transaction takes a snapshot between SNAP2 and the commit? Wouldn't you need a lock to guard against that? (Not that I don't know if that is possible or desirable.) Jochem
Jochem van Dieten <jochemd@gmail.com> writes: > On 12/5/05, Hannu Krosing wrote: >> 3) record the index in pg_class, but mark it as "do not use for lookups" >> in a new field. Take snapshot SNAP2. commit transaction. > What happens if another transaction takes a snapshot between SNAP2 and > the commit? Wouldn't you need a lock to guard against that? (Not that > I don't know if that is possible or desirable.) It's worse than that, because an updating command that is already running has already made its list of which indexes to update. You can't say "commit" and expect transactions already in flight to react magically to the presence of the new index. If you take a lock that excludes writes, and then release that lock with your commit (lock release actually happens after commit btw), then you can be sure that subsequent write transactions will see your new index, because they take their writer's lock before they inspect pg_index to see what indexes they need to update. Short of taking such a lock, you have a race condition. There's another little problem: it's not clear that "present in SNAP2 but not in SNAP1" has anything to do with the condition you need. This would exclude rows made by transactions still in progress as of SNAP2, but you can't know whether such rows were made before or after your commit of the index. It doesn't do the right thing for deleted rows either (deleted rows may still need to be entered into the index), though perhaps you could fix that with a creative reinterpretation of what "present in a snap" means. regards, tom lane
Ühel kenal päeval, T, 2005-12-06 kell 20:50, kirjutas Jochem van Dieten: > On 12/5/05, Hannu Krosing wrote: > > > > Concurrent CREATE INDEX > > ======================== > > > > Concurrent index NDX1 on table TAB1 is created like this: > > > > 1) start transaction. take a snapshot SNAP1 > > > > 1.1) optionally, remove pages for TAB1 from FSM to force (?) all newer > > inserts/updates to happen at end of table (won't work for in-page > > updates without code changes) > > > > 2) create the index as we do now, but only for pages which are visible > > in SNAP1 > > > > 3) record the index in pg_class, but mark it as "do not use for lookups" > > in a new field. Take snapshot SNAP2. commit transaction. > > What happens if another transaction takes a snapshot between SNAP2 and > the commit? I'm hoping there to be some clever way to circumvent (the effects) of it. But I can't see it yet. > Wouldn't you need a lock to guard against that? (Not that > I don't know if that is possible or desirable.) That may be needed. At least I hope it to be possible in a way that can quarantee avoiding deadlocks. What I have in mind would be something like this to get both SNAP2 and commit between any transactions: LOOP: LOCK AGAINST STARTING NEW TRANSACTIONS LOOP UP TO N SEC : IF NO OTHER TRANSACTIONS: BREAK ELSE: CONTINUE IF NO OTHER TRANSACTIONS: BREAK ELSE: UNLOCK AGAINST STARTING NEW TRANSACTIONS SLEEP N SEC TAKE SNAP2 COMMIT (AND UNLOCK) This will eventually succeed (given right values for N ) and will quarantee that SNAP2 and COMMIT are atomic wrt other backends. -------------- Hannu
Ühel kenal päeval, T, 2005-12-06 kell 15:12, kirjutas Tom Lane: > Jochem van Dieten <jochemd@gmail.com> writes: > > On 12/5/05, Hannu Krosing wrote: > >> 3) record the index in pg_class, but mark it as "do not use for lookups" > >> in a new field. Take snapshot SNAP2. commit transaction. > > > What happens if another transaction takes a snapshot between SNAP2 and > > the commit? Wouldn't you need a lock to guard against that? (Not that > > I don't know if that is possible or desirable.) > > It's worse than that, because an updating command that is already > running has already made its list of which indexes to update. You can't > say "commit" and expect transactions already in flight to react > magically to the presence of the new index. If you take a lock that > excludes writes, and then release that lock with your commit (lock > release actually happens after commit btw), Is it possible to release a lock without commit ? > then you can be sure that > subsequent write transactions will see your new index, because they take > their writer's lock before they inspect pg_index to see what indexes > they need to update. > > Short of taking such a lock, you have a race condition. > > There's another little problem: it's not clear that "present in SNAP2 > but not in SNAP1" has anything to do with the condition you need. This > would exclude rows made by transactions still in progress as of SNAP2, > but you can't know whether such rows were made before or after your > commit of the index. It doesn't do the right thing for deleted rows > either (deleted rows may still need to be entered into the index), > though perhaps you could fix that with a creative reinterpretation of > what "present in a snap" means. It seems that taking SNAP1 also needs to be fitted between any other transactions (i.e no transaction can be running at the time) which can hopefully be done as I outlined to my other answer to grandparent. ----------------- Hannu
Hannu Krosing <hannu@skype.net> writes: > What I have in mind would be something like this to get both SNAP2 and > commit between any transactions: > LOOP: > LOCK AGAINST STARTING NEW TRANSACTIONS I can hardly credit that "let's block startup of ALL new transactions" is a more desirable restriction than "let's block writers to the table we wish to reindex". regards, tom lane
Hannu Krosing <hannu@skype.net> writes: > Is it possible to release a lock without commit ? Yes, but I don't see where that helps you here. (To do any of this, you'd need to use the same kluge VACUUM does to hold selected locks across a series of transactions. So in reality you'd probably be thinking of committing a startup transaction and letting some of the locks be released by that.) regards, tom lane
Ühel kenal päeval, T, 2005-12-06 kell 15:38, kirjutas Tom Lane: > Hannu Krosing <hannu@skype.net> writes: > > What I have in mind would be something like this to get both SNAP2 and > > commit between any transactions: > > > LOOP: > > LOCK AGAINST STARTING NEW TRANSACTIONS > > I can hardly credit that "let's block startup of ALL new transactions" > is a more desirable restriction than "let's block writers to the table > we wish to reindex". If the block is short-time (will be removed one way or other in a few (tenths of) seconds, then this is much more desirable than blocking writers for a few hours. The scenario where concurrent create index command is be needed is 24/7 OLTP databases, which can't be taken down for maintenance. Usully they can be arranged to tolerate postponing a few transactions for one second. ---------- Hannu
Ühel kenal päeval, T, 2005-12-06 kell 15:41, kirjutas Tom Lane: > Hannu Krosing <hannu@skype.net> writes: > > Is it possible to release a lock without commit ? > > Yes, but I don't see where that helps you here. > > (To do any of this, you'd need to use the same kluge VACUUM does to hold > selected locks across a series of transactions. So in reality you'd > probably be thinking of committing a startup transaction and letting > some of the locks be released by that.) Hmm, that sounds like an plan: 1) run a transaction repeatedly, trying to hit a point of no concurrent transactions, encance the odds by locking out starting other transactions for a few (tenths or hundredths of) seconds, if it succeeds, record SNAP1, commit and and continue, else rollback, then sleep a little and retry. 2) build index on all rows inserted before SNAP1 3) run a transaction repeatedly, trying to hit a point of no concurrent transactions by locking out other transactions for a few (tenths or hundredths of) seconds, if it succeeds, record SNAP2, mark index as visible for inserts, commit. now all new transactions see the index and use it when inserting new tuples. 4) scan over table, add all tuples between SNAP1 and SNAP2 to index 5) mark index as usable for query plans ------------- Hannu
Hannu Krosing <hannu@skype.net> writes: > 1) run a transaction repeatedly, trying to hit a point of no concurrent > transactions, In the sort of 24x7 environment that people are arguing this is needed for, it's entirely possible that that will *never* succeed. regards, tom lane
Ühel kenal päeval, T, 2005-12-06 kell 16:01, kirjutas Tom Lane: > Hannu Krosing <hannu@skype.net> writes: > > 1) run a transaction repeatedly, trying to hit a point of no concurrent > > transactions, > > In the sort of 24x7 environment that people are arguing this is needed > for, it's entirely possible that that will *never* succeed. My OLTP transactions are usually 5-50ms in length. common sense tells me, that if I disallow new transactions for 100ms, I am more than likely to have waited for all existing ones to have finished and can do my 1 ms of "take snapshot + commit" and let all the waiting transactions to commence. If the database is running longer transactions, there can be a GUC to adjust the time CUNCURRENT CREATE INDEX will wait. For example for trx'es mostly in 0.5-2 sec range the wait could be set to 3 sec. ------------- Hannu
Re: Concurrent CREATE INDEX, try 2 (was Re: Reducing relation locking overhead)
From
Greg Stark
Date:
Hannu Krosing <hannu@skype.net> writes: > The scenario where concurrent create index command is be needed is 24/7 > OLTP databases, which can't be taken down for maintenance. Usully they > can be arranged to tolerate postponing a few transactions for one > second. Well, the dominant defining characteristic of "OLTP" is precisely that you do *not* have under your control the timing requirements and can't make such arrangements. That is, you have to process requests as fast as they come in whatever that might be. But that said, realistically *any* solution has to obtain a lock at some time to make the schema change. I would say pretty much any O(1) (constant time) outage is at least somewhat acceptable as contrasted with the normal index build which locks out other writers for at least O(n lg n) time. Anything on the order of 100ms is probably as good as it gets here. -- greg
Ühel kenal päeval, T, 2005-12-06 kell 19:32, kirjutas Greg Stark: > Hannu Krosing <hannu@skype.net> writes: > > > The scenario where concurrent create index command is be needed is 24/7 > > OLTP databases, which can't be taken down for maintenance. Usully they > > can be arranged to tolerate postponing a few transactions for one > > second. > > Well, the dominant defining characteristic of "OLTP" is precisely that you do > *not* have under your control the timing requirements and can't make such > arrangements. That is, you have to process requests as fast as they come in > whatever that might be. While "as fast as possible" is a good goal when designing and optimising a DB engine proper, you never need to design a real system to a spec "as fast as possible" but rather to some given expected performance. For me a 24/7 OLTP is more like a "Real Time" system, where all queries have to be processed in "not more than" a certain time v.s. "as fast as possible". There "as fast as possible" is a secondary goal, a lot less important than meeting the deadlines. For example one real db processes requests usually in 50-200ms, but the maximum the client is prepared to wait is set to 20 sec. Anything longer than that and the bells start ringing. > But that said, realistically *any* solution has to obtain a lock at some time > to make the schema change. I would say pretty much any O(1) (constant time) > outage is at least somewhat acceptable as contrasted with the normal index > build which locks out other writers for at least O(n lg n) time. Anything on > the order of 100ms is probably as good as it gets here. For me any delay less than the client timeout is acceptable and anything more than that is not. N sec is ok, N+1 is not. It's as simple as that. And if the CREATE INDEX takes 2 weeks in order to let other OLTP processing go on uninterrupted then it is completely OK. I can afford to set the deadline for it accordingly. Thinking of it, maybe concurrent CREATE INDEX should also honour vacuum_cost_* GUC's and throttle its progress accordingly in order to not starve others on IO/CPU . -------------------- Hannu
On 12/6/05, Hannu Krosing wrote: > > 1) run a transaction repeatedly, trying to hit a point of no concurrent > transactions, encance the odds by locking out starting other > transactions for a few (tenths or hundredths of) seconds, if it > succeeds, record SNAP1, commit and and continue, else rollback, then > sleep a little and retry. Which locks can be released by committing here? > 2) build index on all rows inserted before SNAP1 > > 3) run a transaction repeatedly, trying to hit a point of no concurrent > transactions by locking out other transactions for a few (tenths or > hundredths of) seconds, if it succeeds, record SNAP2, mark index as > visible for inserts, commit. now all new transactions see the index and > use it when inserting new tuples. > > 4) scan over table, add all tuples between SNAP1 and SNAP2 to index You can not guarantee that every tuple inserted in the table will be visible to SNAP 2 if you take SNAP2 before the commit of the insert-only index has dropped below the global XMIN-horizon. > 5) mark index as usable for query plans How about: - begin transaction X1 - insert all visible tuples in an index - mark index incomplete - commit - wait for X1 to become visible to all running transactions (X1 is known from the XMIN in pg_class / pg_index) - begin transaction X2 - insert all missing tuples in index - mark index complete - commit Jochem
Hannu Krosing <hannu@skype.net> writes: > > But that said, realistically *any* solution has to obtain a lock at some time > > to make the schema change. I would say pretty much any O(1) (constant time) > > outage is at least somewhat acceptable as contrasted with the normal index > > build which locks out other writers for at least O(n lg n) time. Anything on > > the order of 100ms is probably as good as it gets here. > > For me any delay less than the client timeout is acceptable and anything > more than that is not. N sec is ok, N+1 is not. It's as simple as that. I don't think the client timeout is directly relevant here. If your client timeout is 20s and you take 19s, how many requests have queued up behind you? If you normally process requests in under 200ms and receive 10 requests per second (handling at least 2 simultaneously) then you now have 190 requests queued up. Those requests take resources and will slow down your server. If they slow things down too much then you will start failing to meet your 200ms deadline. It's more likely that your system is engineered to use queueing and simultaneous dispatch to deal with spikes in load up to a certain margin. Say you know it can deal with spikes in load of up to 2x the regular rate. Then you can deal with service outage of up to the 200ms deadline. If you can deal with spikes of up to 4x the regular rate then you can deal with an outage of up to 600ms. Moreover even if you had the extra resources available to handle a 19s backlog of requests, how long would it take you to clear that backlog? If you have a narrow headroom on meeting the deadline in the first place, and now you have even less headroom because of the resources dedicated to the queue, it'll take you a long time to clear the backlog. We periodically ran into problems with load spikes or other performance problems causing things to get very slow and stay slow for a while. Letting things settle out usually worked but occasionally we had to restart the whole system to clear out the queue of requests. -- greg
Ühel kenal päeval, K, 2005-12-07 kell 13:36, kirjutas Greg Stark: > Hannu Krosing <hannu@skype.net> writes: > > > > But that said, realistically *any* solution has to obtain a lock at some time > > > to make the schema change. I would say pretty much any O(1) (constant time) > > > outage is at least somewhat acceptable as contrasted with the normal index > > > build which locks out other writers for at least O(n lg n) time. Anything on > > > the order of 100ms is probably as good as it gets here. > > > > For me any delay less than the client timeout is acceptable and anything > > more than that is not. N sec is ok, N+1 is not. It's as simple as that. > > I don't think the client timeout is directly relevant here. It is relevant. It is the ultimate check of success or failure :) > If your client > timeout is 20s and you take 19s, how many requests have queued up behind you? > If you normally process requests in under 200ms and receive 10 requests per > second (handling at least 2 simultaneously) then you now have 190 requests > queued up. Again, I'm handling 20 to 200 simultaneously quite nicely. > Those requests take resources and will slow down your server. If > they slow things down too much then you will start failing to meet your 200ms > deadline. If I can't meet the deadline, I've got a problem. The rest is implementation detail. > It's more likely that your system is engineered to use queueing and > simultaneous dispatch to deal with spikes in load up to a certain margin. Say > you know it can deal with spikes in load of up to 2x the regular rate. I know it can, just that the 3x spike lasts for 6 hours :P > Then > you can deal with service outage of up to the 200ms deadline. If you can deal > with spikes of up to 4x the regular rate then you can deal with an outage of > up to 600ms. Small local fluctuations happen all the time. As a rule of a thumb I want to stay below 50% of resource usage on average for any noticable period and will start looking for code optimisations or additional hardware if this is crossed. > Moreover even if you had the extra resources available to handle a 19s backlog > of requests, how long would it take you to clear that backlog? If you have a > narrow headroom on meeting the deadline in the first place, and now you have > even less headroom because of the resources dedicated to the queue, it'll take > you a long time to clear the backlog. While it feels heroic to run at 90% capacity, it is not usually a good policy. All kinds of unforeseen stuff happens all the time - checkpoints, backups, vacuums, unexpected growth, system cronjobs, ... With too little headroom you are screwed anyway. What I am aiming at with this CONCURRENT CREATE INDEX proposal, is being no more disruptive than other stuff that keeps happening anyway. That would be the baseline. Anything better is definitely desirable, but should not be a stopper for implementing the baseline functionality. ----------------- Hannu
On Fri, Dec 02, 2005 at 03:25:58PM -0500, Greg Stark wrote: > Postgres would have no trouble building an index of the existing data using > only shared locks. The problem is that any newly inserted (or updated) records > could be missing from such an index. > > To do it you would then have to gather up all those newly inserted records. > And of course while you're doing that new records could be inserted. And so > on. There's no guarantee it would ever finish, though I suppose you could > detect the situation if the size of the new batch wasn't converging to 0 and > throw an error. Why throw an error? Just grab a lock that would prevent any new inserts from occuring. Or at least make that an option. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > What's worse, once you have excluded writes you have to rescan the entire > > table to be sure you haven't missed anything. So in the scenarios where this > > whole thing is actually interesting, ie enormous tables, you're still > > talking about a fairly long interval with writes locked out. Maybe not as > > long as a complete REINDEX, but long. > > I was thinking you would set a flag to disable use of the FSM for > inserts/updates while the reindex was running. So you would know where to find > the new tuples, at the end of the table after the last tuple you read. What about keeping a seperate list of new tuples? Obviously we'd only do this when an index was being built on a table. Since it would probably be problematic and expensive to check for this every time you accessed a table, it would make sense to check only at the start of a transaction and have an index build wait until all running transactions knew that an index build was going to happen. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Ühel kenal päeval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby: > On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > What's worse, once you have excluded writes you have to rescan the entire > > > table to be sure you haven't missed anything. So in the scenarios where this > > > whole thing is actually interesting, ie enormous tables, you're still > > > talking about a fairly long interval with writes locked out. Maybe not as > > > long as a complete REINDEX, but long. > > > > I was thinking you would set a flag to disable use of the FSM for > > inserts/updates while the reindex was running. So you would know where to find > > the new tuples, at the end of the table after the last tuple you read. > > What about keeping a seperate list of new tuples? Obviously we'd only do > this when an index was being built on a table. The problem with separate list is that it can be huge. For example on a table with 200 inserts/updates per second an index build lasting 6 hours would accumulate total on 6*3600*200 = 4320000 new tuples. ---------------- Hannu
On Thu, Dec 08, 2005 at 08:57:42AM +0200, Hannu Krosing wrote: > ??hel kenal p??eval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby: > > On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote: > > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > > What's worse, once you have excluded writes you have to rescan the entire > > > > table to be sure you haven't missed anything. So in the scenarios where this > > > > whole thing is actually interesting, ie enormous tables, you're still > > > > talking about a fairly long interval with writes locked out. Maybe not as > > > > long as a complete REINDEX, but long. > > > > > > I was thinking you would set a flag to disable use of the FSM for > > > inserts/updates while the reindex was running. So you would know where to find > > > the new tuples, at the end of the table after the last tuple you read. > > > > What about keeping a seperate list of new tuples? Obviously we'd only do > > this when an index was being built on a table. > > The problem with separate list is that it can be huge. For example on a > table with 200 inserts/updates per second an index build lasting 6 hours > would accumulate total on 6*3600*200 = 4320000 new tuples. Sure, but it's unlikely that such a table would be very wide, so 4.3M tuples would probably only amount to a few hundred MB of data. It's also possible that this list could be vacuumed by whatever the regular vacuum process is for the table. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Thu, 2005-12-01 at 21:37 -0500, Tom Lane wrote: > In looking at the current pgbench results, I notice that one > considerable contribution to LWLock contention is access to the > heavyweight-lock manager. > 4. The only reason we need to take relation-level locks on indexes > at all is to make the world safe for REINDEX being done concurrently > with read-only accesses to the table (that don't use the index being > reindexed). If we went back to requiring exclusive lock for reindex we > could forget all about both #2 and #3. Particularly for updates of > relations with lots of indexes, this could be a pretty significant win. > However we'd definitely be giving up something that was seen as a > feature at one point, so I'm not sold on this idea ... unless someone > can see a way to reduce the overhead without giving up concurrent > REINDEX. In a high txn rate workload the majority of lock requests will definitely be towards indexes. This is a major overhead and contention point and as you say, exists only to allow REINDEX (and online CREATE TABLE when it exists) to operate. There is a considerable price to be paid for this flexibility, so although I think that flexibility is important, I do think we need to look at ways of reducing the cost for the 95%+ of the time we pay the price for no benefit. It seems clear that if we do not, then we are going to have to completely redesign the lock manager, so any way forward from here is a reasonable size task. Putting extra restrictions on REINDEX-like operations would be acceptable in these circumstances. Now follows various ideas, many of which are quite possibly bogus, but the main point is that we need a way, even if it isn't one of these: One such idea might be to require that they occur outside of a transaction block, just like VACUUM. Further thoughts: 1. Normally, we do not lock indexes via the LockMgrLock 2. When a REINDEX-like operation comes along, it first of all updates an MaintIntentLock flag on the index relation, which causes a relcache invalidation. It then waits until all backends have updated their relcache. 3. While the MaintIntentLock is set, all operations acquire and release LockMgrLocks on the index. 4. The REINDEX-like operation proceeds, having accepted a possibly extensive delay in acquiring the MaintIntentLock, in the name of concurrent access. 5. When the REINDEX completes, we unset the MaintIntentLock and other backends return to normal operation. Best Regards, Simon Riggs
Ühel kenal päeval, N, 2005-12-08 kell 01:08, kirjutas Jim C. Nasby: > On Thu, Dec 08, 2005 at 08:57:42AM +0200, Hannu Krosing wrote: > > ??hel kenal p??eval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby: > > > On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote: > > > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > > > What's worse, once you have excluded writes you have to rescan the entire > > > > > table to be sure you haven't missed anything. So in the scenarios where this > > > > > whole thing is actually interesting, ie enormous tables, you're still > > > > > talking about a fairly long interval with writes locked out. Maybe not as > > > > > long as a complete REINDEX, but long. > > > > > > > > I was thinking you would set a flag to disable use of the FSM for > > > > inserts/updates while the reindex was running. So you would know where to find > > > > the new tuples, at the end of the table after the last tuple you read. > > > > > > What about keeping a seperate list of new tuples? Obviously we'd only do > > > this when an index was being built on a table. > > > > The problem with separate list is that it can be huge. For example on a > > table with 200 inserts/updates per second an index build lasting 6 hours > > would accumulate total on 6*3600*200 = 4320000 new tuples. > > Sure, but it's unlikely that such a table would be very wide, so 4.3M > tuples would probably only amount to a few hundred MB of data. It's also > possible that this list could be vacuumed by whatever the regular vacuum > process is for the table. I think that keeping such list as part the table at well defined location (like pages from N to M) is the best strategy, as it will automatically make all new tuples available to parallel processes and avoids both duplicate storage as well as the the need for changing insert/update code. --------------- Hannu
On Wed, 2005-12-07 at 19:36, Greg Stark wrote: [snip] > We periodically ran into problems with load spikes or other performance > problems causing things to get very slow and stay slow for a while. Letting > things settle out usually worked but occasionally we had to restart the whole > system to clear out the queue of requests. Just as a personal opinion: I would love a REINDEX which does not block reads/writes, even if writes will be more expensive while it's running. There's always a period of time I can schedule the REINDEX so there's very low write activity, but it is impossible to find a time slot when there's NO write activity... and I think most of the real world applications are like this. I think it's very rare that an application is constantly getting high load, but most of them are constantly getting SOME important activity which makes downtime hard to schedule. Now if the slowdown of writes is not more than the acceptable service level, then it is a very workable solution to schedule the REINDEX on a not so busy time slot. Cheers, Csaba.
Simon Riggs <simon@2ndquadrant.com> writes: > Further thoughts: > 1. Normally, we do not lock indexes via the LockMgrLock > 2. When a REINDEX-like operation comes along, it first of all updates an > MaintIntentLock flag on the index relation, which causes a relcache > invalidation. It then waits until all backends have updated their > relcache. There isn't any way for it to do that (ie, be sure everyone else has adjusted to the new state of affairs), short of acquiring some sort of short-term exclusive lock on the table, which is a really bad idea. The pending lock would block other incoming requests on the table until all the current users exited their transactions. I think the idea of not doing locking on indexes was pretty much shot down in this thread, and we have to go looking for other solutions ... thus my other proposal. regards, tom lane
On Thu, 2005-12-08 at 16:05, Tom Lane wrote: [SNIP] > There isn't any way for it to do that (ie, be sure everyone else has > adjusted to the new state of affairs), short of acquiring some sort of > short-term exclusive lock on the table, which is a really bad idea. > The pending lock would block other incoming requests on the table until > all the current users exited their transactions. > But it is an acceptable compromise to lock the table until all current transactions are over... the alternative for reindexing a big table is usually to schedule a down-time, which is even worse... REINDEX is usually used to fix a big tables big index bloat, and that won't fly without a downtime, or, with this short-term full table lock in a low-traffic time-slot. For my usage patterns I would vote with the table lock if it is just a means of blocking new transactions until the running ones finish. I'll just make sure there are none long running when I issue the REINDEX... Cheers, Csaba.
On Thu, 2005-12-08 at 16:23 +0100, Csaba Nagy wrote: > On Thu, 2005-12-08 at 16:05, Tom Lane wrote: > [SNIP] > > There isn't any way for it to do that (ie, be sure everyone else has > > adjusted to the new state of affairs), short of acquiring some sort of > > short-term exclusive lock on the table, which is a really bad idea. > > The pending lock would block other incoming requests on the table until > > all the current users exited their transactions. It might be possible to do this with some new form of utility lock that can be unlocked before end of transaction, then re-locked again later. That would imply the use of optimistic locking, as already discussed. Only DDL and VACUUMs *need* to be locked out during the REINDEX. I'm hand-waving here, so I'll stop. But we do know there *is* a way, because this is already implemented elsewhere, somehow. > But it is an acceptable compromise to lock the table until all current > transactions are over... the alternative for reindexing a big table is > usually to schedule a down-time, which is even worse... > > REINDEX is usually used to fix a big tables big index bloat, and that > won't fly without a downtime, or, with this short-term full table lock > in a low-traffic time-slot. > > For my usage patterns I would vote with the table lock if it is just a > means of blocking new transactions until the running ones finish. I'll > just make sure there are none long running when I issue the REINDEX... Certainly that is my view. You only schedule these things when you have to, picking as light traffic period as you can. Most people would accept some pretty hefty restrictions in order to get it to work. Best Regards, Simon Riggs
On Thu, Dec 08, 2005 at 11:58:50AM +0200, Hannu Krosing wrote: > ??hel kenal p??eval, N, 2005-12-08 kell 01:08, kirjutas Jim C. Nasby: > > On Thu, Dec 08, 2005 at 08:57:42AM +0200, Hannu Krosing wrote: > > > ??hel kenal p??eval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby: > > > > On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote: > > > > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > > > > What's worse, once you have excluded writes you have to rescan the entire > > > > > > table to be sure you haven't missed anything. So in the scenarios where this > > > > > > whole thing is actually interesting, ie enormous tables, you're still > > > > > > talking about a fairly long interval with writes locked out. Maybe not as > > > > > > long as a complete REINDEX, but long. > > > > > > > > > > I was thinking you would set a flag to disable use of the FSM for > > > > > inserts/updates while the reindex was running. So you would know where to find > > > > > the new tuples, at the end of the table after the last tuple you read. > > > > > > > > What about keeping a separate list of new tuples? Obviously we'd only do > > > > this when an index was being built on a table. > > > > > > The problem with separate list is that it can be huge. For example on a > > > table with 200 inserts/updates per second an index build lasting 6 hours > > > would accumulate total on 6*3600*200 = 4320000 new tuples. > > > > Sure, but it's unlikely that such a table would be very wide, so 4.3M > > tuples would probably only amount to a few hundred MB of data. It's also > > possible that this list could be vacuumed by whatever the regular vacuum > > process is for the table. > > I think that keeping such list as part the table at well defined > location (like pages from N to M) is the best strategy, as it will > automatically make all new tuples available to parallel processes and > avoids both duplicate storage as well as the the need for changing > insert/update code. There's one thing I hate about that idea though: good luck trying to move those tuples somewhere else after the index build is done and you now want to shrink the table back down to a more normal size. If we had a better way to do that it would be much more palatable, but right now on a heavily updated table this would result in a lot of bloat. Along those lines, I've wondered if it makes sense to add more flexibility in how free space is reclaimed in a table. One obvious possibility is to have a strategy where new tuples will always look to the FSM for space (instead of going into the current page if possible), and the FSM will always hand out the earliest page in the table it has. This mode would have the effect of moving tuples towards the front of the table, allowing for space reclamation. A variation might be that this mode will not effect tuples that are generated as part of an UPDATE and are in the first x% of the table, since it doesn't make sense to move a tuple from page 2 to page 1 in a 1000 page table. Another possibility is to always go to the FSM and to have the FSM hand back the page that is closest to the new tuple according to a certain index. This would allow for ALTER TABLE CLUSTER to be much more self-maintaining. The downside is that you'd have to do a lookup on that index, but presumably if the index is important enough to cluster on then it should be well-cached. There's probably some other tweaks that could be done as well to make this more performant. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Simon Riggs <simon@2ndquadrant.com> writes: > I'm hand-waving here, so I'll stop. But we do know there *is* a way, > because this is already implemented elsewhere, somehow. That's not really the point --- the question is whether the cure is worse than the disease. It's entirely possible that the tradeoffs needed to support fully concurrent REINDEX would represent a higher price than the feature is worth, or that it's simply too much work to get there from here. For instance, I would imagine that the way Oracle does this relies on their use of rollback segments, which is something we're certainly unlikely to emulate. Given the discussion so far, it seems likely to me that completely concurrent REINDEX is indeed out of reach, and that what we ought to be thinking about is what sort of compromise design (ie, partially concurrent REINDEX) is reasonable. Something that might work is: 1. Take ShareUpdateExclusiveLock (this blocks VACUUM and DDL changes), then run existing CREATE INDEX code. The finished index may be missing some tuples inserted during the run. 2. Commit transaction so that index becomes visible (we assume it's marked so that the planner will know not to rely on it). Continue to hold ShareUpdateExclusiveLock so VACUUM doesn't run. 3. Attempt to acquire ShareLock (possibly a ConditionalLockAcquire/sleep loop instead of just flat-out LockAcquire). Once we have this we know there are no active writer transactions. Release the lock immediately. 4. Make a new scan of the table and insert any rows not already present in the index. (This need not process anything inserted later than step 3, because any new transactions will know to insert rows in the index anyway.) 5. Mark index good and commit, releasing all locks. I don't think that it's worth the effort and complexity to try to avoid a full table scan in step 4. At best you would save much less than 50% of the total work, and the impact on normal operations is not free. If what you want is a REINDEX rather than creating an independent new index, then at step 5 you need to do a swap operation which'll require obtaining exclusive lock on the index. This creates another opportunity for deadlock failures, but again a conditional loop might help. There are still some issues about the behavior when the index is UNIQUE. Ideally you would like the CREATE INDEX to fail on a duplicate, not any concurrent writer transaction, but I don't think it's possible to guarantee that. Also, I'm not sure how we get rid of the broken index if there is a failure later than step 2. regards, tom lane
Simon Riggs wrote: > ...REINDEX...CREATE/DROP INDEX... I'm curious if partitioning can help help provide online create/reindex. For example, if I could set set up a table partitioned by "modified time"; could I make a new partition so all new inserts go into a new partition and then I can re-index / create-index on the old partitions concurrently because only reads and deletes would be happening on that old table? The new partition would have a new index anyway so no reindex would even be needed on that part.
Ühel kenal päeval, N, 2005-12-08 kell 12:57, kirjutas Jim C. Nasby: > On Thu, Dec 08, 2005 at 11:58:50AM +0200, Hannu Krosing wrote: > > > > The problem with separate list is that it can be huge. For example on a > > > > table with 200 inserts/updates per second an index build lasting 6 hours > > > > would accumulate total on 6*3600*200 = 4320000 new tuples. > > > > > > Sure, but it's unlikely that such a table would be very wide, so 4.3M > > > tuples would probably only amount to a few hundred MB of data. It's also > > > possible that this list could be vacuumed by whatever the regular vacuum > > > process is for the table. > > > > I think that keeping such list as part the table at well defined > > location (like pages from N to M) is the best strategy, as it will > > automatically make all new tuples available to parallel processes and > > avoids both duplicate storage as well as the the need for changing > > insert/update code. > > There's one thing I hate about that idea though: good luck trying to > move those tuples somewhere else after the index build is done and you > now want to shrink the table back down to a more normal size. I feel your pain. To solve similar problem I have been forced to write scripts that do updates of pk_column=pk_column until the tuple moves to another page as shown by ctid . Not a sensible thing to do (do a lot of updates to *increase* performance), but necessary nevertheless considering current postgres behaviour. > If we had > a better way to do that it would be much more palatable, but right now > on a heavily updated table this would result in a lot of bloat. Actually any long transaction would do that. > Along those lines, I've wondered if it makes sense to add more > flexibility in how free space is reclaimed in a table. One obvious > possibility is to have a strategy where new tuples will always look to > the FSM for space (instead of going into the current page if possible), > and the FSM will always hand out the earliest page in the table it has. > This mode would have the effect of moving tuples towards the front of > the table, allowing for space reclamation. A variation might be that > this mode will not effect tuples that are generated as part of an UPDATE > and are in the first x% of the table, since it doesn't make sense to > move a tuple from page 2 to page 1 in a 1000 page table. This % could be depending on some "fill factor" of the table, aiming not to move tuples, that would end up in the final volume of a balance table, which, in case of heavily updated table, would probably be 2 to 3 times the volume of densely populated table. > Another possibility is to always go to the FSM and to have the FSM hand > back the page that is closest to the new tuple according to a certain > index. This would allow for ALTER TABLE CLUSTER to be much more > self-maintaining. The downside is that you'd have to do a lookup on that > index, but presumably if the index is important enough to cluster on > then it should be well-cached. There's probably some other tweaks that > could be done as well to make this more performant. Yes, I agree on all your points about better placement of new tuples, all they would be useful indeed. -------------- Hannu
On Fri, Dec 09, 2005 at 12:00:14AM +0200, Hannu Krosing wrote: > > Along those lines, I've wondered if it makes sense to add more > > flexibility in how free space is reclaimed in a table. One obvious > > possibility is to have a strategy where new tuples will always look to > > the FSM for space (instead of going into the current page if possible), > > and the FSM will always hand out the earliest page in the table it has. > > This mode would have the effect of moving tuples towards the front of > > the table, allowing for space reclamation. A variation might be that > > this mode will not effect tuples that are generated as part of an UPDATE > > and are in the first x% of the table, since it doesn't make sense to > > move a tuple from page 2 to page 1 in a 1000 page table. > > This % could be depending on some "fill factor" of the table, aiming not > to move tuples, that would end up in the final volume of a balance > table, which, in case of heavily updated table, would probably be 2 to 3 > times the volume of densely populated table. > > > Another possibility is to always go to the FSM and to have the FSM hand > > back the page that is closest to the new tuple according to a certain > > index. This would allow for ALTER TABLE CLUSTER to be much more > > self-maintaining. The downside is that you'd have to do a lookup on that > > index, but presumably if the index is important enough to cluster on > > then it should be well-cached. There's probably some other tweaks that > > could be done as well to make this more performant. > > Yes, I agree on all your points about better placement of new tuples, > all they would be useful indeed. Sounds like a TODO, barring objections... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Ühel kenal päeval, N, 2005-12-08 kell 14:53, kirjutas Tom Lane: > Given the discussion so far, it seems likely to me that completely > concurrent REINDEX is indeed out of reach, and that what we ought to > be thinking about is what sort of compromise design (ie, partially > concurrent REINDEX) is reasonable. > > Something that might work is: > > 1. Take ShareUpdateExclusiveLock (this blocks VACUUM and DDL changes), > then run existing CREATE INDEX code. The finished index may be missing > some tuples inserted during the run. > > 2. Commit transaction so that index becomes visible (we assume it's > marked so that the planner will know not to rely on it). Continue to > hold ShareUpdateExclusiveLock so VACUUM doesn't run. > > 3. Attempt to acquire ShareLock (possibly a ConditionalLockAcquire/sleep > loop instead of just flat-out LockAcquire). Once we have this we know > there are no active writer transactions. Release the lock immediately. > > 4. Make a new scan of the table and insert any rows not already present > in the index. (This need not process anything inserted later than step > 3, because any new transactions will know to insert rows in the index > anyway.) How do you plan to determine "any rows not already present in the index" without explicitly remembering the start and end snapshots of existing CREATE INDEX (SNAP1 and SNAP2 in my proposal)? actually the end point seems to be covered, but what about start condition ? In the last round of discussion you pointed out that index itself can't be effectively used for this in case there are lots of equal index keys. (As I pointed out, this can be fixed if we will start using ctid to determine placement/order of equal keys, but I don't think we are building indexes this way now). I still think that wedging start of 1. and end of 2. into points where no concurrent transaction is running would be the easiest and most robust way to do it. And if the attempts (locking periods) to find/force that spot are short enough, they can be tolerated in practice. > 5. Mark index good and commit, releasing all locks. > > I don't think that it's worth the effort and complexity to try to avoid > a full table scan in step 4. At best you would save much less than 50% > of the total work, and the impact on normal operations is not free. Agreed. The usecase needing concurrent index, being already slow, can probably be made to tolerate another 2-3x slowdown. > If what you want is a REINDEX rather than creating an independent new > index, then at step 5 you need to do a swap operation which'll require > obtaining exclusive lock on the index. This creates another opportunity > for deadlock failures, but again a conditional loop might help. > > There are still some issues about the behavior when the index is UNIQUE. > Ideally you would like the CREATE INDEX to fail on a duplicate, not any > concurrent writer transaction, but I don't think it's possible to > guarantee that. Ideally, but probably not too important in practice. The point can be always made that there already is a unique index at the point where concurrent trx fails. If the point is before end of 2. the concurrent trx will probably wait until first commit before failing, no ? > Also, I'm not sure how we get rid of the broken index if there is a > failure later than step 2. What about expicit DROP INDEX ? Even for REINDEX the index has to be visible as a separate index after 2. so that inserts updates will be aware of it. -------------- Hannu
Hannu Krosing <hannu@skype.net> writes: > How do you plan to determine "any rows not already present in the index" > without explicitly remembering the start and end snapshots of existing > CREATE INDEX (SNAP1 and SNAP2 in my proposal)? I was thinking in terms of actually looking into the index to see if the particular TID is present or not. You could use snapshots to optimize this by avoiding index probes for tuples that must be present, which hopefully will be most of 'em. Also you need a snapshot to detect tuples that are new enough that they certainly will be indexed by their inserting transaction, so that you don't have a race condition between an active inserter and the REINDEX. (I think this is possible but maybe I missed something.) That leaves you looking at just the tuples inserted by transactions that might or might not have known about the index. So yeah, you do need SNAP1 and SNAP2 but they're being used in a different way than the original proposal. > In the last round of discussion you pointed out that index itself can't > be effectively used for this in case there are lots of equal index keys. True, but if you can avoid going to the index at all for the majority of the tuples, I think this is tolerable. In any case the design idea here seems to be "we don't care how long REINDEX takes as long as it's not blocking anyone". regards, tom lane
On Sat, 2005-12-10 at 21:07 -0500, Tom Lane wrote: > In any case the design idea here > seems to be "we don't care how long REINDEX takes as long as it's not > blocking anyone". All sounds great so far. I'd like this as an option for CREATE INDEX also. Best Regards, Simon Riggs
Ühel kenal päeval, L, 2005-12-10 kell 21:07, kirjutas Tom Lane: > Hannu Krosing <hannu@skype.net> writes: > > How do you plan to determine "any rows not already present in the index" > > without explicitly remembering the start and end snapshots of existing > > CREATE INDEX (SNAP1 and SNAP2 in my proposal)? > > I was thinking in terms of actually looking into the index to see if the > particular TID is present or not. You could use snapshots to optimize > this by avoiding index probes for tuples that must be present, which > hopefully will be most of 'em. Also you need a snapshot to detect > tuples that are new enough that they certainly will be indexed by their > inserting transaction, so that you don't have a race condition between > an active inserter and the REINDEX. (I think this is possible but maybe > I missed something.) So would it be possible to do the whole indexing whil another long transaction runs, possibly in SERIALIZABLE isolation level, inserting tuples all the way. > That leaves you looking at just the tuples > inserted by transactions that might or might not have known about the > index. So yeah, you do need SNAP1 and SNAP2 but they're being used in > a different way than the original proposal. By taking snaps at points with no concurrrent transactions I was hoping to make things (both concept and code) simpler by reducing SNAP1 and SNAP2 to mere transaction ids and making sure that these and only these tuples inserted by transactions which fall between them need to be added to the index. If it does not actually make things clearer then it may not be a good idea to try to fit them between transactions. Otoh, we still do need to wait - possibly a long time - for "3. Attempt to acquire ShareLock". So why not wait also at start in case that would make things cleares/simpler ? > > In the last round of discussion you pointed out that index itself can't > > be effectively used for this in case there are lots of equal index keys. > > True, but if you can avoid going to the index at all for the majority of > the tuples, I think this is tolerable. And it seems that we could store same-valued index leafs in TID order, at least during concurrent create index/reindex, making the index lookup cheaper. To me, it seems a good idea to store similar values in index always in TID order, as this makes an index scan do the right thing (scan in heap order), at least for key=const case, even without intermediate bitmap index. Or are there downsides to this in general case ? > In any case the design idea here > seems to be "we don't care how long REINDEX takes as long as it's not > blocking anyone". Yes, thats the general idea. Within reason of course, for example making a seqscan over the index for each and every tuple inserted during building the first index would probably still be too slow :) > regards, tom lane
Hannu Krosing wrote: > Ühel kenal päeval, L, 2005-12-10 kell 21:07, kirjutas Tom Lane: > > In any case the design idea here > > seems to be "we don't care how long REINDEX takes as long as it's not > > blocking anyone". > > Yes, thats the general idea. > > Within reason of course, for example making a seqscan over the index for > each and every tuple inserted during building the first index would > probably still be too slow :) You don't need to seqscan the _index_. You need to scan the table. Those tuples that do not satisfy the snapshot or where you are in doubt, you examine the index to see whether they are there. The bulk of it you just skip. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Added to TODO: * Allow FSM to return free space toward the beginning of the heap file, in hopes that empty pages at the end can be truncatedby VACUUM --------------------------------------------------------------------------- Jim C. Nasby wrote: > On Fri, Dec 09, 2005 at 12:00:14AM +0200, Hannu Krosing wrote: > > > Along those lines, I've wondered if it makes sense to add more > > > flexibility in how free space is reclaimed in a table. One obvious > > > possibility is to have a strategy where new tuples will always look to > > > the FSM for space (instead of going into the current page if possible), > > > and the FSM will always hand out the earliest page in the table it has. > > > This mode would have the effect of moving tuples towards the front of > > > the table, allowing for space reclamation. A variation might be that > > > this mode will not effect tuples that are generated as part of an UPDATE > > > and are in the first x% of the table, since it doesn't make sense to > > > move a tuple from page 2 to page 1 in a 1000 page table. > > > > This % could be depending on some "fill factor" of the table, aiming not > > to move tuples, that would end up in the final volume of a balance > > table, which, in case of heavily updated table, would probably be 2 to 3 > > times the volume of densely populated table. > > > > > Another possibility is to always go to the FSM and to have the FSM hand > > > back the page that is closest to the new tuple according to a certain > > > index. This would allow for ALTER TABLE CLUSTER to be much more > > > self-maintaining. The downside is that you'd have to do a lookup on that > > > index, but presumably if the index is important enough to cluster on > > > then it should be well-cached. There's probably some other tweaks that > > > could be done as well to make this more performant. > > > > Yes, I agree on all your points about better placement of new tuples, > > all they would be useful indeed. > > Sounds like a TODO, barring objections... > -- > Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com > Pervasive Software http://pervasive.com work: 512-231-6117 > vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 > -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. +
At the risk of editorializing... * Allow for tables with more than a certain percent of free space to force new tuple allocation in the last few pages tocome only from the FSM (prefering the earliest pages), with the intention of allowing the table to shrink I also believe there's also a second TODO: * Allow FSM page hand-out on tables to be influenced by the clustered index. This would help maintain cluster order. On Thu, Mar 02, 2006 at 10:12:59PM -0500, Bruce Momjian wrote: > > Added to TODO: > > * Allow FSM to return free space toward the beginning of the heap file, in > hopes that empty pages at the end can be truncated by VACUUM > > > --------------------------------------------------------------------------- > > Jim C. Nasby wrote: > > On Fri, Dec 09, 2005 at 12:00:14AM +0200, Hannu Krosing wrote: > > > > Along those lines, I've wondered if it makes sense to add more > > > > flexibility in how free space is reclaimed in a table. One obvious > > > > possibility is to have a strategy where new tuples will always look to > > > > the FSM for space (instead of going into the current page if possible), > > > > and the FSM will always hand out the earliest page in the table it has. > > > > This mode would have the effect of moving tuples towards the front of > > > > the table, allowing for space reclamation. A variation might be that > > > > this mode will not effect tuples that are generated as part of an UPDATE > > > > and are in the first x% of the table, since it doesn't make sense to > > > > move a tuple from page 2 to page 1 in a 1000 page table. > > > > > > This % could be depending on some "fill factor" of the table, aiming not > > > to move tuples, that would end up in the final volume of a balance > > > table, which, in case of heavily updated table, would probably be 2 to 3 > > > times the volume of densely populated table. > > > > > > > Another possibility is to always go to the FSM and to have the FSM hand > > > > back the page that is closest to the new tuple according to a certain > > > > index. This would allow for ALTER TABLE CLUSTER to be much more > > > > self-maintaining. The downside is that you'd have to do a lookup on that > > > > index, but presumably if the index is important enough to cluster on > > > > then it should be well-cached. There's probably some other tweaks that > > > > could be done as well to make this more performant. > > > > > > Yes, I agree on all your points about better placement of new tuples, > > > all they would be useful indeed. > > > > Sounds like a TODO, barring objections... > > -- > > Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com > > Pervasive Software http://pervasive.com work: 512-231-6117 > > vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 > > > > -- > Bruce Momjian http://candle.pha.pa.us > SRA OSS, Inc. http://www.sraoss.com > > + If your life is a hard drive, Christ can be your backup. + > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote: > At the risk of editorializing... > > * Allow for tables with more than a certain percent of free space to > force new tuple allocation in the last few pages to come only from the > FSM (prefering the earliest pages), with the intention of allowing the > table to shrink That seems too confusing. > I also believe there's also a second TODO: > > * Allow FSM page hand-out on tables to be influenced by the clustered > index. This would help maintain cluster order. How would FSM know how to do that? I added it with a question mark. -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. +