Thread: proposed TODO: non-locking CREATE INDEX / REINDEX
It seems that currently we do not allow modifying a table while an index is built on it (at least my experience and some small tests I did indicate it). Strangely I did not find a TODO item for it, so I may be overlooking something and we already have it. In case we really all For big 24/7 tables this can be a real annoyance, especially in an evolving database. There are many ways this could be made to work, so it needs some discussion. I propose the following approach: 1) when CREATE INDEX starts a ctid position (CTID_INDEX_MIN) of last tuple is remembered and all new tuples are inserted after that point while an index is being built. 2) the index is built in the usual fast way up to the remembered ctid, and make it visible for all backends for inserting, but not yet to planner for using. we remember the last ctid inserted while the fast- build phase was running (CTID_INDEX_MAX). the restriction to add new tuples only after CTID_INDEX_MIN is lifted. 3) then add index entries for all tuples whose ctid is between CTID_INDEX_MIN and CTID_INDEX_MAX. 4) declare the index usable for planner. Additionally CREATE INDEX could be made to honour vacuum_cost_* variables and not hog busy database. -- Hannu Krosing <hannu@skype.net>
Hannu Krosing <hannu@skype.net> writes: > There are many ways this could be made to work, so it needs some > discussion. (1) when do you ever catch up? (2) if your answer to (1) involves increasing the strength of a lock, how do you avoid risk of deadlock? regards, tom lane
On R, 2005-06-10 at 09:47 -0400, Tom Lane wrote: > Hannu Krosing <hannu@skype.net> writes: > > There are many ways this could be made to work, so it needs some > > discussion. > > (1) when do you ever catch up? > > (2) if your answer to (1) involves increasing the strength of a lock, > how do you avoid risk of deadlock? No. I don't plan on locking the table at all. The only thing that is changed during the initial fast-build-index is that new tuples are inserted after CTID_INDEX_MIN, and after the initial fastbuild index is done, the only restriction is that the index can't be used in queries before the tuples between CTID_INDEX_MIN and CTID_INDEX_MAX are added to the index. As the number of tuples between CTID_INDEX_MIN and CTID_INDEX_MAX is finite, they must be added in finite time, by which time the index will be up-to-date and usable for querie planner. (i.e. (1) is done) All tuples inserted after the initial fast-build-index has finished and CTID_INDEX_MAX is fixed, are inserted into the index at the time of inserting the tuple, like for any other index. -- Hannu Krosing <hannu@skype.net>
On R, 2005-06-10 at 17:54 +0300, Hannu Krosing wrote: > On R, 2005-06-10 at 09:47 -0400, Tom Lane wrote: > > Hannu Krosing <hannu@skype.net> writes: > > > There are many ways this could be made to work, so it needs some > > > discussion. > > > > (1) when do you ever catch up? > > > > (2) if your answer to (1) involves increasing the strength of a lock, > > how do you avoid risk of deadlock? > > No. I don't plan on locking the table at all. > > The only thing that is changed during the initial fast-build-index is > that new tuples are inserted after CTID_INDEX_MIN, and after the initial > fastbuild index is done, the only restriction is that the index can't be > used in queries before the tuples between CTID_INDEX_MIN and > CTID_INDEX_MAX are added to the index. Maybe I did not make it very clear, but the initial fast-build-index is done only for the tuples whose ctid is below CTID_INDEX_MIN, thereby this also will happen in finite amount of time. The total time will be: initial_index_build_time + inserting_tuples_added_while_doing_the_initial_build the latter wil be slowed down a little, as these are competing with index entries from inserts happening in real time. > As the number of tuples between CTID_INDEX_MIN and CTID_INDEX_MAX is > finite, they must be added in finite time, by which time the index will > be up-to-date and usable for querie planner. (i.e. (1) is done) > > All tuples inserted after the initial fast-build-index has finished and > CTID_INDEX_MAX is fixed, are inserted into the index at the time of > inserting the tuple, like for any other index. > -- Hannu Krosing <hannu@tm.ee>
Hannu Krosing <hannu@skype.net> writes: > As the number of tuples between CTID_INDEX_MIN and CTID_INDEX_MAX is > finite, they must be added in finite time, by which time the index will > be up-to-date and usable for querie planner. (i.e. (1) is done) ... and by which time, some more could have been added after CTID_INDEX_MAX. Have you forgotten Zeno's paradox? I don't see a reason to assume the indexer can *ever* catch up --- it's entirely likely that adding a new unindexed row is faster than adding an index entry for it. > All tuples inserted after the initial fast-build-index has finished and > CTID_INDEX_MAX is fixed, are inserted into the index at the time of > inserting the tuple, like for any other index. This implies that you are hoping for an asynchronous change in the behavior of other processes, which you are not going to get without taking out locks, which is what you wanted to avoid. regards, tom lane
On R, 2005-06-10 at 12:12 -0400, Tom Lane wrote: > Hannu Krosing <hannu@skype.net> writes: > > As the number of tuples between CTID_INDEX_MIN and CTID_INDEX_MAX is > > finite, they must be added in finite time, by which time the index will > > be up-to-date and usable for querie planner. (i.e. (1) is done) > > ... and by which time, some more could have been added after > CTID_INDEX_MAX. But at that time they are inserted into the index as well, so that may slow down the inserts, but they can't delay the index completion indefinitely. > Have you forgotten Zeno's paradox? I don't see a > reason to assume the indexer can *ever* catch up --- it's entirely > likely that adding a new unindexed row is faster than adding an index > entry for it. The same is true of doing a lazy vacuum over a table where tuples are constantly added - there is no guarantee that the vacuum will ever finish. > > All tuples inserted after the initial fast-build-index has finished and > > CTID_INDEX_MAX is fixed, are inserted into the index at the time of > > inserting the tuple, like for any other index. > > This implies that you are hoping for an asynchronous change in the > behavior of other processes, which you are not going to get without > taking out locks, which is what you wanted to avoid. One way to avoid locking, is to allow the "add tuple to index" routine silently succeed if the index already has it. Then we can announce the change in behaviour to running backends, wait for all backends to confirm they have learned about it and only then record CTID_INDEX_MAX. This should work because all the backends which get the message start adding their new tuples to both heap and index, and it is safe for them to put tuples anywhere, including CTID_INDEX_MIN, those who have not yet received it, will add between CTID_INDEX_MIN the final CTID_INDEX_MAX and not the index. The index entries for between CTID_INDEX_MIN and CTID_INDEX_MAX will be added only after all running backends have confirmed the change. And if those adding to heap and index happen to insert it between CTID_INDEX_MIN and the eventual CTID_INDEX_MAX, then there is no harm done, as the process inserting the index entry for it will silently succeed if the index entry already exists. -- Hannu Krosing <hannu@tm.ee>
Hannu Krosing <hannu@tm.ee> writes: > On R, 2005-06-10 at 12:12 -0400, Tom Lane wrote: >> Have you forgotten Zeno's paradox? I don't see a >> reason to assume the indexer can *ever* catch up --- it's entirely >> likely that adding a new unindexed row is faster than adding an index >> entry for it. > The same is true of doing a lazy vacuum over a table where tuples are > constantly added - there is no guarantee that the vacuum will ever > finish. No, there definitely is such a guarantee: the vacuum only scans as many blocks as were in the relation when it started. The vacuum need not worry about tuples added after it starts, because it couldn't delete them under MVCC rules. And there is no logical-consistency requirement for it to promise to scan every tuple, anyway. >> This implies that you are hoping for an asynchronous change in the >> behavior of other processes, which you are not going to get without >> taking out locks, which is what you wanted to avoid. > One way to avoid locking, is to allow the "add tuple to index" routine > silently succeed if the index already has it. ... thereby silently breaking unique-index checking, you mean? > Then we can announce the change in behaviour to running backends, wait > for all backends to confirm they have learned about it and only then > record CTID_INDEX_MAX. You can't wait for confirmation from all other backends without expecting to create deadlock issues left and right. And what is it you are waiting for, anyway? For a backend to confirm that it is prepared to insert into an index that it can't even see yet (because the CREATE INDEX hasn't committed)? In the case of REINDEX, are you expecting that backends will be prepared to insert into *both* old and new versions of the index? They'd better, since there's still every prospect of the REINDEX failing and rolling back, leaving the old version as the live copy. Speaking of rollback, what happens when those backends try to insert into the new copy just after the REINDEX has failed and rolled back and deleted the new copy? Or equivalently, what happens when they are still trying to insert into the old copy just after the REINDEX commits and deletes that one? The complexity and fragility of what you are proposing vastly outweighs any potential benefit, even if it could be made to work at all, which I continue to doubt. regards, tom lane
On Fri, Jun 10, 2005 at 12:12:05PM -0400, Tom Lane wrote: > Hannu Krosing <hannu@skype.net> writes: > > As the number of tuples between CTID_INDEX_MIN and CTID_INDEX_MAX is > > finite, they must be added in finite time, by which time the index will > > be up-to-date and usable for querie planner. (i.e. (1) is done) > > ... and by which time, some more could have been added after > CTID_INDEX_MAX. Have you forgotten Zeno's paradox? I don't see a > reason to assume the indexer can *ever* catch up --- it's entirely > likely that adding a new unindexed row is faster than adding an index > entry for it. > > > All tuples inserted after the initial fast-build-index has finished and > > CTID_INDEX_MAX is fixed, are inserted into the index at the time of > > inserting the tuple, like for any other index. > > This implies that you are hoping for an asynchronous change in the > behavior of other processes, which you are not going to get without > taking out locks, which is what you wanted to avoid. > Maybe I am misunderstanding something, but the scenario seems to be in the ever simpler pseudo-code: 1. Issue CREATE INDEX command 2. Record CTID_INDEX_MAX 3. Mark the index as created with the "FAST" flag. This means that all of the normal INSERT/DELETE processing with regardsto the data tuples and the index tuples is performed. i.e. the new tuples cannot be added faster than the indexbecause the insert of the data tuple does not commit until the index entries are added. 4. Start generating the index entries until you reach CTID_INDEX_MAX 5. Mark the index as LIVE. Ken Marshall
On L, 2005-06-11 at 12:25 -0400, Tom Lane wrote: > Hannu Krosing <hannu@tm.ee> writes: > > On R, 2005-06-10 at 12:12 -0400, Tom Lane wrote: > >> Have you forgotten Zeno's paradox? I don't see a > >> reason to assume the indexer can *ever* catch up --- it's entirely > >> likely that adding a new unindexed row is faster than adding an index > >> entry for it. > > > The same is true of doing a lazy vacuum over a table where tuples are > > constantly added - there is no guarantee that the vacuum will ever > > finish. > > No, there definitely is such a guarantee: the vacuum only scans as many > blocks as were in the relation when it started. The vacuum need not > worry about tuples added after it starts, because it couldn't delete > them under MVCC rules. And there is no logical-consistency requirement > for it to promise to scan every tuple, anyway. But why is such a quarantee needed at all ? I could not care less, if there is a theorethical quarantee that some query will end, even if that end will come in 3 billion years, after filling all available disk space on earth and exhausting 32bit int OID's over and over again. In other words - is there some fundamental reason why all commands/queries need to finish in finite time (however long) ? Is the finite time execution mandated by some SQL standard (ISO/ANSI) ? If there is, can't it be solved by 'set statement_timeout=NNNN' ? > >> This implies that you are hoping for an asynchronous change in the > >> behavior of other processes, which you are not going to get without > >> taking out locks, which is what you wanted to avoid. It seems that "asynchronous change" and "locks" are indeed synonymous, and the most sensible way to achieve such a change is introducing a new kind of lock. > > One way to avoid locking, is to allow the "add tuple to index" routine > > silently succeed if the index already has it. > > ... thereby silently breaking unique-index checking, you mean? no no. I mean that it will silently succeed only if _exactly_ the same tuple (the same value AND the same ctid) are already present. > > Then we can announce the change in behaviour to running backends, wait > > for all backends to confirm they have learned about it and only then > > record CTID_INDEX_MAX. > > You can't wait for confirmation from all other backends without > expecting to create deadlock issues left and right. And what is it you > are waiting for, anyway? For a backend to confirm that it is prepared > to insert into an index that it can't even see yet (because the CREATE > INDEX hasn't committed)? I am ready to do the CREATE INDEX in several transactions, just like VACUUM is done. so that after the fastbuild path is finished, all backends will see it. It just has to be flagged, so that planner will not consider it for queries until it is up-to-date. > In the case of REINDEX, are you expecting > that backends will be prepared to insert into *both* old and new > versions of the index? They'd better, since there's still every > prospect of the REINDEX failing and rolling back, leaving the old > version as the live copy. Yes, i am prepared to (in fact I expect them to) insert into both indexes, until then new one is ready. The cost of doing so is peanuts compared to the alternatives. > Speaking of rollback, what happens when > those backends try to insert into the new copy just after the REINDEX > has failed and rolled back and deleted the new copy? Or equivalently, > what happens when they are still trying to insert into the old copy > just after the REINDEX commits and deletes that one? I am perfectly prepared for doing the the create index in 24 hours if the database can continue serving requests during that period, if that saves me 24 minutes "maintenance" downtime. I am also willing to wait for all other backends to commit or rollback their running transactions. Heck, I am even ready to restart every other backend (close and reconnect all clients), if that enables me to avoid the downtime. > The complexity and fragility of what you are proposing vastly outweighs > any potential benefit, even if it could be made to work at all, which I > continue to doubt. The current workaround (using a second database and slony (log-shipping does not cut it) replication) is far more expensive and complex still. It requires at least 2X write capacity of the whole database (compared to 2X single index writes), plus another computer, plus fast network. And the ability to add new indexes to 24/7 databases without significant downtime is a requirement for a database for any fast growing business. -- Hannu Krosing <hannu@skype.net>