Thread: proposed TODO: non-locking CREATE INDEX / REINDEX

proposed TODO: non-locking CREATE INDEX / REINDEX

From
Hannu Krosing
Date:
It seems that currently we do not allow modifying a table while an index
is built on it (at least my experience and some small tests I did
indicate it). Strangely I did not find a TODO item for it, so I may be
overlooking something and we already have it.

In case we really all

For big 24/7 tables this can be a real annoyance, especially in an
evolving database.

There are many ways this could be made to work, so it needs some
discussion.

I propose the following approach:

1) when CREATE INDEX starts a ctid position (CTID_INDEX_MIN) of last
tuple is remembered and all new tuples are inserted after that point
while an index is being built.

2) the index is built in the usual fast way up to the remembered ctid,
and make it visible for all backends for inserting, but not yet to
planner for using. we remember the last ctid inserted while the fast-
build phase was running (CTID_INDEX_MAX). the restriction to add new
tuples only after CTID_INDEX_MIN is lifted.

3) then add index entries for all tuples whose ctid is between
CTID_INDEX_MIN and CTID_INDEX_MAX.

4) declare the index usable for planner.

Additionally CREATE INDEX could be made to honour vacuum_cost_*
variables and not hog busy database.

-- 
Hannu Krosing <hannu@skype.net>



Re: proposed TODO: non-locking CREATE INDEX / REINDEX

From
Tom Lane
Date:
Hannu Krosing <hannu@skype.net> writes:
> There are many ways this could be made to work, so it needs some
> discussion.

(1) when do you ever catch up?

(2) if your answer to (1) involves increasing the strength of a lock,   how do you avoid risk of deadlock?
        regards, tom lane


Re: proposed TODO: non-locking CREATE INDEX / REINDEX

From
Hannu Krosing
Date:
On R, 2005-06-10 at 09:47 -0400, Tom Lane wrote:
> Hannu Krosing <hannu@skype.net> writes:
> > There are many ways this could be made to work, so it needs some
> > discussion.
> 
> (1) when do you ever catch up?
>
> (2) if your answer to (1) involves increasing the strength of a lock,
>     how do you avoid risk of deadlock?

No. I don't plan on locking the table at all. 

The only thing that is changed during the initial fast-build-index is
that new tuples are inserted after CTID_INDEX_MIN, and after the initial
fastbuild index is done, the only restriction is that the index can't be
used in queries before the tuples between  CTID_INDEX_MIN and
CTID_INDEX_MAX are added to the index.

As the number of tuples between CTID_INDEX_MIN and CTID_INDEX_MAX is
finite, they must be added in finite time, by which time the index will
be up-to-date and usable for querie planner. (i.e. (1) is done)

All tuples inserted after the initial fast-build-index has finished and
CTID_INDEX_MAX is fixed, are inserted into the index at the time of
inserting the tuple, like for any other index.

-- 
Hannu Krosing <hannu@skype.net>



Re: proposed TODO: non-locking CREATE INDEX / REINDEX

From
Hannu Krosing
Date:
On R, 2005-06-10 at 17:54 +0300, Hannu Krosing wrote:
> On R, 2005-06-10 at 09:47 -0400, Tom Lane wrote:
> > Hannu Krosing <hannu@skype.net> writes:
> > > There are many ways this could be made to work, so it needs some
> > > discussion.
> > 
> > (1) when do you ever catch up?
> >
> > (2) if your answer to (1) involves increasing the strength of a lock,
> >     how do you avoid risk of deadlock?
> 
> No. I don't plan on locking the table at all. 
> 
> The only thing that is changed during the initial fast-build-index is
> that new tuples are inserted after CTID_INDEX_MIN, and after the initial
> fastbuild index is done, the only restriction is that the index can't be
> used in queries before the tuples between  CTID_INDEX_MIN and
> CTID_INDEX_MAX are added to the index.

Maybe I did not make it very clear, but the initial fast-build-index is
done only for the tuples whose ctid is below CTID_INDEX_MIN, thereby
this also will happen in finite amount of time.

The total time will be:
initial_index_build_time +
inserting_tuples_added_while_doing_the_initial_build

the latter wil be slowed down a little, as these are competing with
index entries from inserts happening in real time.

> As the number of tuples between CTID_INDEX_MIN and CTID_INDEX_MAX is
> finite, they must be added in finite time, by which time the index will
> be up-to-date and usable for querie planner. (i.e. (1) is done)
> 
> All tuples inserted after the initial fast-build-index has finished and
> CTID_INDEX_MAX is fixed, are inserted into the index at the time of
> inserting the tuple, like for any other index.
> 
-- 
Hannu Krosing <hannu@tm.ee>


Re: proposed TODO: non-locking CREATE INDEX / REINDEX

From
Tom Lane
Date:
Hannu Krosing <hannu@skype.net> writes:
> As the number of tuples between CTID_INDEX_MIN and CTID_INDEX_MAX is
> finite, they must be added in finite time, by which time the index will
> be up-to-date and usable for querie planner. (i.e. (1) is done)

... and by which time, some more could have been added after
CTID_INDEX_MAX.  Have you forgotten Zeno's paradox?  I don't see a
reason to assume the indexer can *ever* catch up --- it's entirely
likely that adding a new unindexed row is faster than adding an index
entry for it.

> All tuples inserted after the initial fast-build-index has finished and
> CTID_INDEX_MAX is fixed, are inserted into the index at the time of
> inserting the tuple, like for any other index.

This implies that you are hoping for an asynchronous change in the
behavior of other processes, which you are not going to get without
taking out locks, which is what you wanted to avoid.
        regards, tom lane


Re: proposed TODO: non-locking CREATE INDEX / REINDEX

From
Hannu Krosing
Date:
On R, 2005-06-10 at 12:12 -0400, Tom Lane wrote:
> Hannu Krosing <hannu@skype.net> writes:
> > As the number of tuples between CTID_INDEX_MIN and CTID_INDEX_MAX is
> > finite, they must be added in finite time, by which time the index will
> > be up-to-date and usable for querie planner. (i.e. (1) is done)
> 
> ... and by which time, some more could have been added after
> CTID_INDEX_MAX.

But at that time they are inserted into the index as well, so that may
slow down the inserts, but they can't delay the index completion
indefinitely.

>   Have you forgotten Zeno's paradox?  I don't see a
> reason to assume the indexer can *ever* catch up --- it's entirely
> likely that adding a new unindexed row is faster than adding an index
> entry for it.

The same is true of doing a lazy vacuum over a table where tuples are
constantly added -  there is no guarantee that the vacuum will ever
finish.

> > All tuples inserted after the initial fast-build-index has finished and
> > CTID_INDEX_MAX is fixed, are inserted into the index at the time of
> > inserting the tuple, like for any other index.
> 
> This implies that you are hoping for an asynchronous change in the
> behavior of other processes, which you are not going to get without
> taking out locks, which is what you wanted to avoid.

One way to avoid locking, is to allow the "add tuple to index" routine
silently succeed if the index already has it.

Then we can announce the change in behaviour to running backends, wait
for all backends to confirm they have learned about it and only then
record CTID_INDEX_MAX.

This should work because all the backends which get the message start
adding their new tuples to both heap and index, and it is safe for them
to put tuples anywhere, including CTID_INDEX_MIN, those who have not yet
received it, will add between CTID_INDEX_MIN the final CTID_INDEX_MAX
and not the index. 

The index entries for between CTID_INDEX_MIN and CTID_INDEX_MAX will be
added only after all running backends have confirmed the change.

And if those adding to heap and index happen to insert it between
CTID_INDEX_MIN and the eventual CTID_INDEX_MAX, then there is no harm
done, as the process inserting the index entry for it will silently
succeed if the index entry already exists.

-- 
Hannu Krosing <hannu@tm.ee>


Re: proposed TODO: non-locking CREATE INDEX / REINDEX

From
Tom Lane
Date:
Hannu Krosing <hannu@tm.ee> writes:
> On R, 2005-06-10 at 12:12 -0400, Tom Lane wrote:
>> Have you forgotten Zeno's paradox?  I don't see a
>> reason to assume the indexer can *ever* catch up --- it's entirely
>> likely that adding a new unindexed row is faster than adding an index
>> entry for it.

> The same is true of doing a lazy vacuum over a table where tuples are
> constantly added -  there is no guarantee that the vacuum will ever
> finish.

No, there definitely is such a guarantee: the vacuum only scans as many
blocks as were in the relation when it started.  The vacuum need not
worry about tuples added after it starts, because it couldn't delete
them under MVCC rules.  And there is no logical-consistency requirement
for it to promise to scan every tuple, anyway.

>> This implies that you are hoping for an asynchronous change in the
>> behavior of other processes, which you are not going to get without
>> taking out locks, which is what you wanted to avoid.

> One way to avoid locking, is to allow the "add tuple to index" routine
> silently succeed if the index already has it.

... thereby silently breaking unique-index checking, you mean?

> Then we can announce the change in behaviour to running backends, wait
> for all backends to confirm they have learned about it and only then
> record CTID_INDEX_MAX.

You can't wait for confirmation from all other backends without
expecting to create deadlock issues left and right.  And what is it you
are waiting for, anyway?  For a backend to confirm that it is prepared
to insert into an index that it can't even see yet (because the CREATE
INDEX hasn't committed)?  In the case of REINDEX, are you expecting
that backends will be prepared to insert into *both* old and new
versions of the index?  They'd better, since there's still every
prospect of the REINDEX failing and rolling back, leaving the old
version as the live copy.  Speaking of rollback, what happens when
those backends try to insert into the new copy just after the REINDEX
has failed and rolled back and deleted the new copy?  Or equivalently,
what happens when they are still trying to insert into the old copy
just after the REINDEX commits and deletes that one?

The complexity and fragility of what you are proposing vastly outweighs
any potential benefit, even if it could be made to work at all, which I
continue to doubt.
        regards, tom lane


Re: proposed TODO: non-locking CREATE INDEX / REINDEX

From
Kenneth Marshall
Date:
On Fri, Jun 10, 2005 at 12:12:05PM -0400, Tom Lane wrote:
> Hannu Krosing <hannu@skype.net> writes:
> > As the number of tuples between CTID_INDEX_MIN and CTID_INDEX_MAX is
> > finite, they must be added in finite time, by which time the index will
> > be up-to-date and usable for querie planner. (i.e. (1) is done)
> 
> ... and by which time, some more could have been added after
> CTID_INDEX_MAX.  Have you forgotten Zeno's paradox?  I don't see a
> reason to assume the indexer can *ever* catch up --- it's entirely
> likely that adding a new unindexed row is faster than adding an index
> entry for it.
> 
> > All tuples inserted after the initial fast-build-index has finished and
> > CTID_INDEX_MAX is fixed, are inserted into the index at the time of
> > inserting the tuple, like for any other index.
> 
> This implies that you are hoping for an asynchronous change in the
> behavior of other processes, which you are not going to get without
> taking out locks, which is what you wanted to avoid.
> 

Maybe I am misunderstanding something, but the scenario seems to be
in the ever simpler pseudo-code:

1. Issue CREATE INDEX command
2. Record CTID_INDEX_MAX
3. Mark the index as created with the "FAST" flag. This means that all  of the normal INSERT/DELETE processing with
regardsto the data tuples  and the index tuples is performed. i.e. the new tuples cannot be added  faster than the
indexbecause the insert of the data tuple does not  commit until the index entries are added.
 
4. Start generating the index entries until you reach CTID_INDEX_MAX
5. Mark the index as LIVE.

Ken Marshall


Re: proposed TODO: non-locking CREATE INDEX / REINDEX

From
Hannu Krosing
Date:
On L, 2005-06-11 at 12:25 -0400, Tom Lane wrote: 
> Hannu Krosing <hannu@tm.ee> writes:
> > On R, 2005-06-10 at 12:12 -0400, Tom Lane wrote:
> >> Have you forgotten Zeno's paradox?  I don't see a
> >> reason to assume the indexer can *ever* catch up --- it's entirely
> >> likely that adding a new unindexed row is faster than adding an index
> >> entry for it.
> 
> > The same is true of doing a lazy vacuum over a table where tuples are
> > constantly added -  there is no guarantee that the vacuum will ever
> > finish.
> 
> No, there definitely is such a guarantee: the vacuum only scans as many
> blocks as were in the relation when it started.  The vacuum need not
> worry about tuples added after it starts, because it couldn't delete
> them under MVCC rules.  And there is no logical-consistency requirement
> for it to promise to scan every tuple, anyway.

But why is such a quarantee needed at all ?

I could not care less, if there is a theorethical quarantee that some
query will end, even if that end will come in 3 billion years, after
filling all available disk space on earth and exhausting 32bit int OID's
over and over again. 

In other words - is there some fundamental reason why all
commands/queries need to finish in finite time (however long) ? 

Is the finite time execution mandated by some SQL standard (ISO/ANSI) ?

If there is, can't it be solved by 'set statement_timeout=NNNN' ?

> >> This implies that you are hoping for an asynchronous change in the
> >> behavior of other processes, which you are not going to get without
> >> taking out locks, which is what you wanted to avoid.

It seems that "asynchronous change" and "locks" are indeed synonymous,
and the most sensible way to achieve such a change is introducing a new
kind of lock.

> > One way to avoid locking, is to allow the "add tuple to index" routine
> > silently succeed if the index already has it.
> 
> ... thereby silently breaking unique-index checking, you mean?

no no. I mean that it will silently succeed only if _exactly_ the same
tuple (the same value AND the same ctid) are already present.

> > Then we can announce the change in behaviour to running backends, wait
> > for all backends to confirm they have learned about it and only then
> > record CTID_INDEX_MAX.
> 
> You can't wait for confirmation from all other backends without
> expecting to create deadlock issues left and right.  And what is it you
> are waiting for, anyway?  For a backend to confirm that it is prepared
> to insert into an index that it can't even see yet (because the CREATE
> INDEX hasn't committed)? 

I am ready to do the CREATE INDEX in several transactions, just like
VACUUM is done. so that after the fastbuild path is finished, all
backends will see it. It just has to be flagged, so that planner will
not consider it for queries until it is up-to-date.

>  In the case of REINDEX, are you expecting
> that backends will be prepared to insert into *both* old and new
> versions of the index?  They'd better, since there's still every
> prospect of the REINDEX failing and rolling back, leaving the old
> version as the live copy.  

Yes, i am prepared to (in fact I expect them to) insert into both
indexes, until then new one is ready. The cost of doing so is peanuts
compared to the alternatives.

> Speaking of rollback, what happens when
> those backends try to insert into the new copy just after the REINDEX
> has failed and rolled back and deleted the new copy?  Or equivalently,
> what happens when they are still trying to insert into the old copy
> just after the REINDEX commits and deletes that one?

I am perfectly prepared for doing the the create index in 24 hours if
the database can continue serving requests during that period, if that
saves me 24 minutes "maintenance" downtime.

I am also willing to wait for all other backends to commit or rollback
their running transactions. Heck, I am even ready to restart every other
backend (close and reconnect all clients), if that enables me to avoid
the downtime.

> The complexity and fragility of what you are proposing vastly outweighs
> any potential benefit, even if it could be made to work at all, which I
> continue to doubt.

The current workaround (using a second database and slony (log-shipping
does not cut it) replication) is far more expensive and complex still.
It requires at least 2X write capacity of the whole database (compared
to 2X single index writes), plus another computer, plus fast network.

And the ability to add new indexes to 24/7 databases without significant
downtime is a requirement for a database for any fast growing business.

-- 
Hannu Krosing <hannu@skype.net>