Thread: Tricky bugs in concurrent index build

Tricky bugs in concurrent index build

From
Tom Lane
Date:
I see fairly nasty problems in the concurrent-index patch when it's
trying to build a unique index.  I think it's solvable but want some
more eyeballs on my reasoning.

Look at the code in IndexBuildHeapScan where we are deciding whether or
not to include a tuple in the index ("indexIt") and also whether or not
to include it in the uniqueness check ("tupleIsAlive").  In a normal
non-concurrent build, we have to include recently-dead tuples in the
index because transactions started before ours might try to use the
index after we finish it.  (Remember system catalogs generally operate
on SnapshotNow, so a query could use a newly created index even though
it will be run with a serializable snapshot much older than the index.)
So we have to put tuples into the index if any active transaction might
wish to see those tuples.  OTOH, we should exclude dead tuples from the
uniqueness check: the uniqueness constraint ought to be across
currently-valid tuples only.  In particular, for tuples previously
created or deleted by our own transaction, we certainly must include
created ones and not include deleted ones in the uniqueness check.

In the past, the only way we could see HEAPTUPLE_INSERT_IN_PROGRESS
or HEAPTUPLE_DELETE_IN_PROGRESS was for tuples created/deleted by our
own transaction, and so the actions taken by IndexBuildHeapScan are
to include in the index in both cases, but exclude DELETE_IN_PROGRESS
tuples from the uniqueness check.

This does not work for a concurrent build, though, because if the
in-progress delete is from another transaction, it could roll back after
we look.  In that case we have an entry that is in the index and has
escaped the uniqueness check.  If it conflicts with another tuple also
entered into the index in the first pass, we'll never notice that.

I think we can solve this by having IndexBuildHeapScan not index
DELETE_IN_PROGRESS tuples if it's doing a concurrent build.  The problem
of old transactions trying to use the index does not exist, because
we'll wait 'em out before marking the index valid, so we need not
worry about preserving validity for old snapshots.  And if the deletion
does in fact roll back, we'll insert the tuple during the second pass,
and catch any uniqueness violation at that point.

But wait, there's more: in the patch as it stands, the second pass
over the table ignores DELETE_IN_PROGRESS tuples, which is wrong.
It's entirely possible for a tuple that is RECENTLY_DEAD or
DELETE_IN_PROGRESS to have no entry in the index, if it was inserted
during the first pass, and then the deletion occurred after the first
pass (and either has or hasn't committed yet).  If we ignore
DELETE_IN_PROGRESS and then the deleter rolls back, the tuple never
gets into the index at all.  Furthermore, the patch also tries to
insert RECENTLY_DEAD tuples, which is good for MVCC coverage, but wrong
for uniqueness checking --- keep in mind that in the second pass,
we are just doing normal index insertions, and so anything we insert
into the index will be uniqueness-checked as though still alive.
We could get a uniqueness failure that should not occur, eg from
trying to insert the old version of a recently-updated row.

What I think we can do about this is to include DELETE_IN_PROGRESS
tuples in the set of candidate tuples to insert in the second pass.
During the merge step that verifies whether the tuple is already
in the index, if we find that it's not, then we must wait for the
deleter to commit or roll back.  If the deleter commits then we
ignore the tuple.  If the deleter rolls back then we have to insert
the tuple in the index.  (I think we have to actually take a FOR
UPDATE or possibly FOR SHARE lock on the tuple while we do this,
else we have race conditions against someone else starting a new
deletion attempt on the tuple.)  In the commit case we've essentially
waited long enough to transform DELETE_IN_PROGRESS into RECENTLY_DEAD,
and for both of those statuses we are not going to insert into the
index for fear of causing a false uniqueness violation.  What that
means is that after we finish the second pass, we need *another*
wait-for-everyone-to-die before we can mark the index valid.  Otherwise
we have the same MVCC problem that someone might try to use the index
for a query with a snapshot old enough that it should be able to see
the not-inserted tuple.

Have I missed anything?  This is tricky stuff.

In any case it's clear that the patch still needs major work.
Greg, do you have cycles to spare now?
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> I think we can solve this by having IndexBuildHeapScan not index
> DELETE_IN_PROGRESS tuples if it's doing a concurrent build.  

Sure

> It's entirely possible for a tuple that is RECENTLY_DEAD or
> DELETE_IN_PROGRESS to have no entry in the index, if it was inserted
> during the first pass, and then the deletion occurred after the first
> pass (and either has or hasn't committed yet).  

Egads. That's nasty indeed.

> Furthermore, the patch also tries to insert RECENTLY_DEAD tuples, which is
> good for MVCC coverage, but wrong for uniqueness checking --- keep in mind
> that in the second pass, we are just doing normal index insertions, and so
> anything we insert into the index will be uniqueness-checked as though still
> alive. We could get a uniqueness failure that should not occur, eg from
> trying to insert the old version of a recently-updated row.

Hm, I hadn't absorbed the purpose of isAlive and the distinction between live
for uniqueness checks and live for index build purposes.

Is it not possible to brute force this adding an AM method to insert without
the uniqueness check? That would mean the index build would fail even if the
transaction eventually aborts though. (or even if it has already aborted?)

[ extended description of complex footwork involving more waiting while
holding locks ]

> Have I missed anything?  This is tricky stuff.

Wow, that seems pretty unsatisfactory, all the waiting and locking sounds
awful. If you have a lot of update transactions starting continuously you
could keep bumping into this situation and repeatedly have to wait for new
transactions to end.

It also seems like a lot of code :(

> In any case it's clear that the patch still needs major work.
> Greg, do you have cycles to spare now?

I do. But I'll have to spend some time just rereading the code and your
comments to convince myself that all this waiting and locking is the best
solution.

-- 
greg



Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Is it not possible to brute force this adding an AM method to insert without
> the uniqueness check?

Hm.  Actually there already is a feature of aminsert to allow
suppressing the unique check, but I'm not sure whether using it for
RECENTLY_DEAD tuples helps.  Seems like we have to wait to see whether
DELETE_IN_PROGRESS deleters commit in any case.

>> Have I missed anything?  This is tricky stuff.

> Wow, that seems pretty unsatisfactory, all the waiting and locking sounds
> awful.

Yeah, I'm very unhappy.  The whole idea may be going down in flames :-(
It's fairly clear that we could support concurrent builds of nonunique
indexes, but is that enough of a use-case to justify it?
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
"Joshua D. Drake"
Date:
> 
>> Wow, that seems pretty unsatisfactory, all the waiting and locking sounds
>> awful.
> 
> Yeah, I'm very unhappy.  The whole idea may be going down in flames :-(
> It's fairly clear that we could support concurrent builds of nonunique
> indexes, but is that enough of a use-case to justify it?

I believe there would be. Most PostgreSQL users I run into, develop in 
production, which means being able to add an index they forgot when 
doing query analysis.

Most of the time (I would say >95%) this is not a unique index.

Sincerely,

Joshua D. Drake


> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
> 


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/
 




Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:
>> It's fairly clear that we could support concurrent builds of nonunique
>> indexes, but is that enough of a use-case to justify it?

> I believe there would be. Most PostgreSQL users I run into, develop in 
> production, which means being able to add an index they forgot when 
> doing query analysis.

True, unique constraints are usually something you should get right to
start with.  But it'll be annoying if we can do everything BUT that :-(
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
"Joshua D. Drake"
Date:
Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>>> It's fairly clear that we could support concurrent builds of nonunique
>>> indexes, but is that enough of a use-case to justify it?
> 
>> I believe there would be. Most PostgreSQL users I run into, develop in 
>> production, which means being able to add an index they forgot when 
>> doing query analysis.
> 
> True, unique constraints are usually something you should get right to
> start with.  But it'll be annoying if we can do everything BUT that :-(

Agreed, but better then nothing :).

Sincerely,

Joshua D. Drake

> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
> 


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/
 




Re: Tricky bugs in concurrent index build

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > Is it not possible to brute force this adding an AM method to insert without
> > the uniqueness check?
> 
> Hm.  Actually there already is a feature of aminsert to allow
> suppressing the unique check, but I'm not sure whether using it for
> RECENTLY_DEAD tuples helps.  Seems like we have to wait to see whether
> DELETE_IN_PROGRESS deleters commit in any case.

Hm, actually don't we need both of these to make it work? We need to see
whether the deleter commits to determine whether to enforce the uniqueness
constraint on the missing tuple. 

. If the deleter aborts we need to insert the tuple normally including enforcing the constraint.

. If the deleter commits then we still need to insert the tuple but without enforcing the constraint in case some old
transactionqueries the index
 

What would happen if we just insert DELETE_IN_PROGRESS tuples normally? Would
the only risk be that the index build would fail with a spurious unique
constraint violation? I suppose it would be pretty common though given how
updates work.


Incidentally does this point out a problem with the planner depending on
unique constraints? For old (serializable) transactions the unique index
exists and the constraint is enforced but they can still find tuples that were
deleted before the index was built and might violate the unique constraint.
Even if we had the plan invalidation mechanism that's frequently mentioned you
would still have to check constraints against your snapshot and not just
snapshotnow for planning purposes.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> What would happen if we just insert DELETE_IN_PROGRESS tuples normally? Would
> the only risk be that the index build would fail with a spurious unique
> constraint violation? I suppose it would be pretty common though given how
> updates work.

Yeah, that's the problem: if we can't support UPDATEs that don't change
the to-be-unique column, it ain't much of a feature.

> Incidentally does this point out a problem with the planner depending on
> unique constraints?

Good point.  It doesn't depend on them yet, but we've been hoping to
make it do so once we have plan invalidation capability.  We shall have
to think very carefully about timing semantics of all that.
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Hannu Krosing
Date:
Ühel kenal päeval, T, 2006-08-22 kell 16:48, kirjutas Tom Lane:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
> >> It's fairly clear that we could support concurrent builds of nonunique
> >> indexes, but is that enough of a use-case to justify it?
> 
> > I believe there would be. Most PostgreSQL users I run into, develop in 
> > production, which means being able to add an index they forgot when 
> > doing query analysis.
> 
> True, unique constraints are usually something you should get right to
> start with.  But it'll be annoying if we can do everything BUT that :-(

Maybe we could find a way to build a non-unique index first and then
convert it to a unique one later, in yet another pass ?

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: Tricky bugs in concurrent index build

From
Hannu Krosing
Date:
Ühel kenal päeval, K, 2006-08-23 kell 11:05, kirjutas Hannu Krosing:
> Ühel kenal päeval, T, 2006-08-22 kell 16:48, kirjutas Tom Lane:
> > "Joshua D. Drake" <jd@commandprompt.com> writes:
> > >> It's fairly clear that we could support concurrent builds of nonunique
> > >> indexes, but is that enough of a use-case to justify it?
> > 
> > > I believe there would be. Most PostgreSQL users I run into, develop in 
> > > production, which means being able to add an index they forgot when 
> > > doing query analysis.
> > 
> > True, unique constraints are usually something you should get right to
> > start with.  But it'll be annoying if we can do everything BUT that :-(
> 
> Maybe we could find a way to build a non-unique index first and then
> convert it to a unique one later, in yet another pass ?

Or even add ALTER INDEX myindex ADD/DROP UNIQUE; command

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



Re: Tricky bugs in concurrent index build

From
"Zeugswetter Andreas DCP SD"
Date:
> > Is it not possible to brute force this adding an AM method to insert

> > without the uniqueness check?
>
> Hm.  Actually there already is a feature of aminsert to allow
> suppressing the unique check, but I'm not sure whether using
> it for RECENTLY_DEAD tuples helps.  Seems like we have to
> wait to see whether DELETE_IN_PROGRESS deleters commit in any case.

Um, but if we wait for the DELETE_IN_PROGRESS tuple, after the wait we
can
add it eighter with or without the unique check (depending on
commit/abort).

Then at least we don't need to wait in a 3rd pass for readers ?

Andreas


Re: Tricky bugs in concurrent index build

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> What I think we can do about this is to include DELETE_IN_PROGRESS
> tuples in the set of candidate tuples to insert in the second pass.
> During the merge step that verifies whether the tuple is already
> in the index, if we find that it's not, then we must wait for the
> deleter to commit or roll back.  If the deleter commits then we
> ignore the tuple.  If the deleter rolls back then we have to insert
> the tuple in the index.  (I think we have to actually take a FOR
> UPDATE or possibly FOR SHARE lock on the tuple while we do this,
> else we have race conditions against someone else starting a new
> deletion attempt on the tuple.)  

Hm, my first thought was to just try to get a lock on the record which would
inherently wait until the deleter commits or aborts.

But then wouldn't we have deadlock risks? If we come across these records in a
different order from someone else (possibly even the deleter) who also wants
to lock them? Or would it be safe to lock and release them one by one so we
only every hold one lock at a time?

I'm also pondering whether it might be worth saving up all the
DELETE_IN_PROGRESS tuples in a second tuplesort and processing them all in a
third phase. That seems like it would reduce the amount of waiting that might
be involved. The fear I have though is that this third phase could become
quite large.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: Tricky bugs in concurrent index build

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> In the past, the only way we could see HEAPTUPLE_INSERT_IN_PROGRESS
> or HEAPTUPLE_DELETE_IN_PROGRESS was for tuples created/deleted by our
> own transaction, and so the actions taken by IndexBuildHeapScan are
> to include in the index in both cases, but exclude DELETE_IN_PROGRESS
> tuples from the uniqueness check.
> 
> This does not work for a concurrent build, though, because if the
> in-progress delete is from another transaction, it could roll back after
> we look.  In that case we have an entry that is in the index and has
> escaped the uniqueness check.  If it conflicts with another tuple also
> entered into the index in the first pass, we'll never notice that.




> I think we can solve this by having IndexBuildHeapScan not index
> DELETE_IN_PROGRESS tuples if it's doing a concurrent build.  The problem
> of old transactions trying to use the index does not exist, because
> we'll wait 'em out before marking the index valid, so we need not
> worry about preserving validity for old snapshots.  And if the deletion
> does in fact roll back, we'll insert the tuple during the second pass,
> and catch any uniqueness violation at that point.

Actually that's a bit of a pain. This function is called from the AM functions
so it doesn't have any access to whether it's doing a concurrent build or not.
I would have to stuff a flag in the indexInfo or change the AM api.

The API is pretty bizarre around this. The core calls the AM to build the
index, the AM calls back this function in core to do the scan, then this
function calls back into the AM to handle the inserts. 

It seems like it would be simpler to leave the core in charge the whole time.
It would call an AM method to initialize state, then call an AM method for
each tuple that should be indexed, and lastly call a finalize method.


Also, I think there may be another problem here with INSERT_IN_PROGRESS. I'm
currently testing unique index builds while pgbench is running and I'm
consistently getting unique index violations from phase 1. I think what's
happening is that insert that haven't committed yet (and hence ought to be
invisible to us) are hitting unique constraint violations against older
versions that are still alive to us. 

-- 
greg



Re: Tricky bugs in concurrent index build

From
Greg Stark
Date:
[Sorry for the duplicate -- I accidentally sent the previous before I was
finished editing it]

Tom Lane <tgl@sss.pgh.pa.us> writes:

> I think we can solve this by having IndexBuildHeapScan not index
> DELETE_IN_PROGRESS tuples if it's doing a concurrent build.  The problem
> of old transactions trying to use the index does not exist, because
> we'll wait 'em out before marking the index valid, so we need not
> worry about preserving validity for old snapshots.  And if the deletion
> does in fact roll back, we'll insert the tuple during the second pass,
> and catch any uniqueness violation at that point.

Actually that's a bit of a pain. This function is called from the AM functions
so it doesn't have any access to whether it's doing a concurrent build or not.
I would have to stuff a flag in the indexInfo or change the AM api.

The API is pretty bizarre around this. The core calls the AM to build the
index, the AM calls back this function in core to do the scan, then this
function calls back into the AM to handle the inserts. 

It seems like it would be simpler to leave the core in charge the whole time.
It would call an AM method to initialize state, then call an AM method for
each tuple that should be indexed, and lastly call a finalize method.


Also, I think there may be another problem here with INSERT_IN_PROGRESS. I'm
currently testing unique index builds while pgbench is running and I'm
consistently getting unique index violations from phase 1. I think what's
happening is that insert that haven't committed yet (and hence ought to be
invisible to us) are hitting unique constraint violations against older
versions that are still alive to us. 

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: Tricky bugs in concurrent index build

From
Greg Stark
Date:
Hannu Krosing <hannu@skype.net> writes:

> Ühel kenal päeval, K, 2006-08-23 kell 11:05, kirjutas Hannu Krosing:
> > 
> > Maybe we could find a way to build a non-unique index first and then
> > convert it to a unique one later, in yet another pass ?
> 
> Or even add ALTER INDEX myindex ADD/DROP UNIQUE; command

That would be great. But note that it suffers from precisely the same problem.
If you come across a couple of records with the same key and one of them is
DELETE_IN_PROGRESS then you'll have to wait until you can acquire a sharelock
on it before you can determine if there's a constraint violation.


Hmmm. Or is that true. The problem may be somewhat easier since at least you
can be sure every tuple in the heap is in the index. So if you see a
DELETE_IN_PROGRESS either it *was* a constraint violation prior to the delete
and failing is reasonable or it's an update in which case maybe it's possible
to detect that they're part of the same chain?

(Actually there is another corner case. a transaction that inserts a value,
then deletes it in the same transaction then inserts that same value again.
Now you have a INSERT_IN_PROGRESS and a DELETE_IN_PROGRESS that conflict but
should be allowed since they come from the same transaction. Hopefully the
ALTER INDEX command would be able to determine they come from the same
transaction.)

In the case of concurrent index builds that's not really safe since you don't
have the other tuples you're conflicting with together at the same time and
even if you did you may or may not have a complete set of them.

Tom's right. This stuff is tricky.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> But then wouldn't we have deadlock risks? If we come across these records in a
> different order from someone else (possibly even the deleter) who also wants
> to lock them? Or would it be safe to lock and release them one by one so we
> only every hold one lock at a time?

AFAICS we could release the lock as soon as we've inserted the index
entry.  (Whether there is any infrastructure to do that is another
question...)

> I'm also pondering whether it might be worth saving up all the
> DELETE_IN_PROGRESS tuples in a second tuplesort and processing them all in a
> third phase. That seems like it would reduce the amount of waiting that might
> be involved. The fear I have though is that this third phase could become
> quite large.

Actually --- a tuple that is live when we do the "second pass" scan
could well be DELETE_IN_PROGRESS (or even RECENTLY_DEAD) by the time we
do the merge and discover that it hasn't got an index entry.  So offhand
I'm thinking that we *must* take a tuple lock on *every* tuple we insert
in stage two.  Ugh.
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> It seems like it would be simpler to leave the core in charge the whole time.
> It would call an AM method to initialize state, then call an AM method for
> each tuple that should be indexed, and lastly call a finalize method.

[ shrug... ]  I'm uninterested in refactoring the AM API right now.
We've got enough stuff to deal with before beta, not to mention an
uncommitted bitmap AM patch that it would certainly break.

> Also, I think there may be another problem here with INSERT_IN_PROGRESS. I'm
> currently testing unique index builds while pgbench is running and I'm
> consistently getting unique index violations from phase 1. I think what's
> happening is that insert that haven't committed yet (and hence ought to be
> invisible to us) are hitting unique constraint violations against older
> versions that are still alive to us. 

Hmm ... it's certainly highly likely that we could pick up multiple
versions of a row during pass 1, but the uniqueness checker should
notice that some versions are dead?  Oooh, no there's a problem:
the tuple we are inserting could become dead in the interval between
when we pick it up and when we put it into the index.  So we could
try to put multiple versions of the same row into the uniqueness check.

Right at the moment, unique index builds with this mechanism are looking
unfixably broken :-(.  Anyone see any chance at all of making them work?
Maybe we should just cut our losses and go with the nonunique case only.
That one is pretty easy: just stuff everything in the index ...
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Hmmm. Or is that true. The problem may be somewhat easier since at least you
> can be sure every tuple in the heap is in the index. So if you see a
> DELETE_IN_PROGRESS either it *was* a constraint violation prior to the delete
> and failing is reasonable or it's an update in which case maybe it's possible
> to detect that they're part of the same chain?

Unless we are willing to lock every single tuple while we insert it,
this seems unfixable to me.  Without a lock, the tuple could become
DELETE_IN_PROGRESS immediately after we look at it.

Actually it's worse than that.  We could examine a tuple, see that
it's good, include it in the uniqueness check.  Then someone updates
the tuple and puts the new version near the end of the table.  By
the time we reach that version, it could be committed good.  There
is absolutely no way that we could notice an issue without applying
extremely expensive tests to *every* apparently-good tuple.

[ thinks for a bit... ]  At least, it seems hopeless if we use
SnapshotNow.  Does it help if we use a real snapshot?  I'm thinking
pass 1 inserts exactly those tuples that are good according to a
snap taken at its beginning, and then pass 2 considers only tuples
that are good according to a snap taken at *its* beginning.  But
having consumed no caffeine yet this morning, I'm not sure I can
spot any flaws that might exist in this idea.
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Hannu Krosing
Date:
Ühel kenal päeval, K, 2006-08-23 kell 09:01, kirjutas Tom Lane:
> Greg Stark <gsstark@mit.edu> writes:
> > Hmmm. Or is that true. The problem may be somewhat easier since at least you
> > can be sure every tuple in the heap is in the index. So if you see a
> > DELETE_IN_PROGRESS either it *was* a constraint violation prior to the delete
> > and failing is reasonable or it's an update in which case maybe it's possible
> > to detect that they're part of the same chain?
> 
> Unless we are willing to lock every single tuple while we insert it,
> this seems unfixable to me.  Without a lock, the tuple could become
> DELETE_IN_PROGRESS immediately after we look at it.
> 
> Actually it's worse than that.  We could examine a tuple, see that
> it's good, include it in the uniqueness check.  Then someone updates
> the tuple and puts the new version near the end of the table.  By
> the time we reach that version, it could be committed good. 

Perhaps we should scan the index in index order, and set the unique flag
per index page.

That would 
a) help us avoid going to heap unless there are multiple entries for
some value and 
b) enable us, once all index pages containing pointers to some possibly
duplicate value are checked, to release locks on those tuples.

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
stark <stark@enterprisedb.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> [ thinks for a bit... ]  At least, it seems hopeless if we use
>> SnapshotNow.  Does it help if we use a real snapshot?  I'm thinking
>> pass 1 inserts exactly those tuples that are good according to a
>> snap taken at its beginning, and then pass 2 considers only tuples
>> that are good according to a snap taken at *its* beginning.  But
>> having consumed no caffeine yet this morning, I'm not sure I can
>> spot any flaws that might exist in this idea.

> What about tuples that are inserted and committed in the window between the
> two phases. Ie, they're RECENTLY_DEAD but not in phase2's snapshot.

We'd put them in the index but skip uniqueness check.

> Or do you mean we use SatisfiesVacuum to determine what to insert but
> SatisfiesSnapshot to determine whether to check uniqueness?

Right.  The problems seem to all stem from the risk of trying to
unique-check more than one version of a tuple, and using a snap would
stop that.  We need to think through all the cases though and be sure
they all work.
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

>> Or do you mean we use SatisfiesVacuum to determine what to insert but
>> SatisfiesSnapshot to determine whether to check uniqueness?
>
> Right.  The problems seem to all stem from the risk of trying to
> unique-check more than one version of a tuple, and using a snap would
> stop that.  We need to think through all the cases though and be sure
> they all work.

What happens if someone inserts a record that we miss, but it gets deleted by
the same phase 2 starts. So it's not visible to phase 2 but conflicts with
some other record we find. I suppose that's ok since the delete would have to
have comitted for that to happen. It just means that having a unique
constraint doesn't guarantee uniqueness if your transaction started before the
index was finished being built.

Or what if there's an insert that occurs before phase 2 starts and hasn't
committed yet. There's a conflicting record in the heap that's missing in the
index. I guess the build would have to block when it finds the missing record
until the new insert either commits or aborts just like inserts do when a user
inserts a potential conflict. Would I have to handle that myself or does
index_insert handle that automatically?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
stark <stark@enterprisedb.com> writes:
> What happens if someone inserts a record that we miss, but it gets deleted by
> the same phase 2 starts. So it's not visible to phase 2 but conflicts with
> some other record we find. I suppose that's ok since the delete would have to
> have comitted for that to happen.

I think that's OK, but the whole idea of using an MVCC snap in phase 2
doesn't work on closer inspection.  The problem is still the same one
that you need to take (at least) share lock on each tuple you insert
into the index.  Telling aminsert to check uniqueness implicitly assumes
the new tuple is live, and without any lock on the tuple you can't
promise that.  So there's a significant risk of falsely declaring a
uniqueness failure due to someone else's perfectly innocent UPDATE.
Using an MVCC snap would actually make this worse, by widening the
window for that UPDATE to happen.

Even though we're hoping not to have to process too many tuples here,
having to row-lock each one sounds pretty awful.  Remember that that
involves writing a WAL record these days.
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Gregory Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> I think that's OK, but the whole idea of using an MVCC snap in phase 2
> doesn't work on closer inspection.  The problem is still the same one
> that you need to take (at least) share lock on each tuple you insert
> into the index.  Telling aminsert to check uniqueness implicitly assumes
> the new tuple is live, and without any lock on the tuple you can't
> promise that.  

No wait. It's still "live" according to my snapshot. How could it be possible
for a single snapshot to see two different versions of the same tuple as live?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> I think that's OK, but the whole idea of using an MVCC snap in phase 2
>> doesn't work on closer inspection.  The problem is still the same one
>> that you need to take (at least) share lock on each tuple you insert
>> into the index.  Telling aminsert to check uniqueness implicitly assumes
>> the new tuple is live, and without any lock on the tuple you can't
>> promise that.  

> No wait. It's still "live" according to my snapshot. How could it be possible
> for a single snapshot to see two different versions of the same tuple as live?

The problem case is that we take a tuple and try to insert it into the index.
Meanwhile someone else updates the tuple, and they're faster than us so
they get the new version into the index first.  Now our aminsert sees a
conflicting index entry, and as soon as it commits good aminsert will
raise a uniqueness error.  There's no backoff for "oh, the tuple I'm
inserting stopped being live while I was inserting it".
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
I wrote:
> The problem case is that we take a tuple and try to insert it into the index.
> Meanwhile someone else updates the tuple, and they're faster than us so
> they get the new version into the index first.  Now our aminsert sees a
> conflicting index entry, and as soon as it commits good aminsert will
> raise a uniqueness error.  There's no backoff for "oh, the tuple I'm
> inserting stopped being live while I was inserting it".

It's possible that the problem could be solved by introducing such a
backoff, ie, make aminsert recheck liveness of the tuple-to-be-inserted
before declaring error.  Since we're about to fail anyway, performance
of this code path probably isn't a huge issue.  But I haven't thought
through whether it can be made to work with that addition.

Unless someone's got a brilliant idea, my recommendation at this point
is that we restrict the patch to building only non-unique indexes.
Per discussion upthread, that's still a useful feature.  We can revisit
the problem of doing uniqueness checks correctly in some future release,
but time to work on it for 8.2 is running out fast.
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Gregory Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> I wrote:
>> The problem case is that we take a tuple and try to insert it into the index.
>> Meanwhile someone else updates the tuple, and they're faster than us so
>> they get the new version into the index first.  Now our aminsert sees a
>> conflicting index entry, and as soon as it commits good aminsert will
>> raise a uniqueness error.  There's no backoff for "oh, the tuple I'm
>> inserting stopped being live while I was inserting it".
>
> It's possible that the problem could be solved by introducing such a
> backoff, ie, make aminsert recheck liveness of the tuple-to-be-inserted
> before declaring error.  Since we're about to fail anyway, performance
> of this code path probably isn't a huge issue.  But I haven't thought
> through whether it can be made to work with that addition.

Yesterday I considered if I could just catch the error in validate_index and
retest HeapSatisfiesVacuum after the insert but found that didn't work any
better. I don't remember the problem though and it's possible it would work if
it were inside aminsert.

> Unless someone's got a brilliant idea, my recommendation at this point
> is that we restrict the patch to building only non-unique indexes.
> Per discussion upthread, that's still a useful feature.  We can revisit
> the problem of doing uniqueness checks correctly in some future release,
> but time to work on it for 8.2 is running out fast.

I agree. There's other functionality in this area that would be nice too such
as REINDEX CONCURRENTLY and deleting the invalid index in case of error. Once
one chunk gets into CVS it makes it easier to extend it without making for a
bigger divergence to merge in one day.

I was also considering going ahead and implementing Hannu's ALTER INDEX SET
UNIQUE too. We would have the option of making CREATE UNIQUE INDEX
CONCURRENTLY automatically invoke that code afterwards. It would require a
second waiting phase though and a full index scan so it would be a much slower
option than handling it in the index build. On the plus side it would never
have to lock anything -- locking things inside a command explicitly billed as
concurrent strikes me as undesirable.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> I was also considering going ahead and implementing Hannu's ALTER INDEX SET
> UNIQUE too.

Don't waste your time, when we don't have an algorithm that would make
it work.  It's too late for 8.2 anyway...
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Gregory Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> I was also considering going ahead and implementing Hannu's ALTER INDEX SET
>> UNIQUE too.
>
> Don't waste your time, when we don't have an algorithm that would make
> it work.  It's too late for 8.2 anyway...

Oh, I think ALTER INDEX SET UNIQUE is easy, at least algorithm-wise. We set
the index to be unique, then wait until everyone can see it. Then we scan to
make sure there's only one live tuple for each key value. As long as it's
unique in our snapshot and we can be sure that any concurrent changes will
preserve that property then we can be sure it's good.

Hm, I suppose we have to keep the index marked invalid during this operation
so it doesn't appear as if there's a promise that the data is unique. That's
fine for freshly built concurrent indexes but not so nice for an existing
non-unique index. We might need a new column induniqueisvalid that indicates
that a unique constraint can't be trusted yet.

I suppose it's 8.3 material. And maybe not even the most urgent item either.


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Unless someone's got a brilliant idea, my recommendation at this point
>> is that we restrict the patch to building only non-unique indexes.

> I assume you'll add the check?

Yeah, I'll take care of it.

At the moment it may be moot, because I've realized that validate_index
doesn't work anyway.  It is scanning the index and then assuming that
any tuple inserted into the index subsequent to that scan will still be
INSERT_IN_PROGRESS when the heapscan reaches it.  That's completely
bogus of course --- the tuple could be committed live or even already
deleted by the time the heapscan reaches it.  This would result in
making duplicate index entries, which is unacceptable even in a
nonunique index (it'd cause indexscans to return the same row twice).

I'm trying to work out whether we can fix this by taking an MVCC
snapshot before we scan the index, and then inserting all tuples we find
in the heap that are live according to the snap but are not present in
the indexscan data.  There are still race conditions in this but I think
sufficient delay would fix them.  Considerations:

* If a tuple is good according to the snap, it must have been inserted
by an xact that committed before the snap, therefore there is no
still-in-progress index insertion happening for it.  So if it's not in
the sorted indexscan data we know we must insert it.  If it is in the
indexscan data, of course we don't have to.

* If a tuple is not good according to the snap, there are three
possibilities:

** It was never committed good at all.  We need not index it.

** It was inserted by a transaction that committed post-snap.  We assume
that that transaction has or will index it.

** It was deleted by a transaction that committed pre-snap.  We assume
we need not index it because no remaining transaction will care about it.

The trick is that we must wait long enough to ensure that those two
assumptions are good.  We already know about waiting long enough to
ensure that all live transactions are aware of the index, which takes
care of the first assumption as long as we take the snap after that
wait.  However, the second assumption means that we must be sure there
are no other backends with serializable snapshots older than the snap
we are using for this.  I think this means that we wait for all xacts
to be aware of our index, take the reference snap, then wait for all
xacts except ours to die --- without exception --- and then we can
get on with the second phase of the work.

It might be OK to do the indexscan and sort before we do the second
wait, though, so we could get at least something done.

Comments?  Have I missed any case for a tuple to fail the snap?

Does this analysis change our conclusions about whether we can
support unique index builds?
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
I wrote:
> I'm trying to work out whether we can fix this by taking an MVCC
> snapshot before we scan the index, and then inserting all tuples we find
> in the heap that are live according to the snap but are not present in
> the indexscan data.  There are still race conditions in this but I think
> sufficient delay would fix them.  Considerations:

> * If a tuple is good according to the snap, it must have been inserted
> by an xact that committed before the snap, therefore there is no
> still-in-progress index insertion happening for it.  So if it's not in
> the sorted indexscan data we know we must insert it.  If it is in the
> indexscan data, of course we don't have to.

> * If a tuple is not good according to the snap, there are three
> possibilities:

> ** It was never committed good at all.  We need not index it.

> ** It was inserted by a transaction that committed post-snap.  We assume
> that that transaction has or will index it.

> ** It was deleted by a transaction that committed pre-snap.  We assume
> we need not index it because no remaining transaction will care about it.

> The trick is that we must wait long enough to ensure that those two
> assumptions are good.  We already know about waiting long enough to
> ensure that all live transactions are aware of the index, which takes
> care of the first assumption as long as we take the snap after that
> wait.  However, the second assumption means that we must be sure there
> are no other backends with serializable snapshots older than the snap
> we are using for this.  I think this means that we wait for all xacts
> to be aware of our index, take the reference snap, then wait for all
> xacts except ours to die --- without exception --- and then we can
> get on with the second phase of the work.

After a couple hours' thought I have not found any flaws in the above
analysis; furthermore I believe that the second wait need not
happen until just before we are ready to mark the index valid and
commit.  So that means that after the indexscan and second heapscan,
we wait for any xacts alive before that to end.  That's not too bad.

> Does this analysis change our conclusions about whether we can
> support unique index builds?

I also believe that we can support unique index builds after all.
There are two further changes needed besides the above:

* The first pass has to insert tuples that are good according to a
snapshot, rather than using HeapTupleSatisfiesVacuum.  This prevents
the problem of trying to unique-check multiple versions of the same
row that all chanced to be committed live at the instants we passed
over them.  If ambuild gets a failure it means the dataset was not
unique at the instant of the snap, so the user can hardly complain
about getting the error.  This will somewhat reduce the coverage of
the initial index build, but that seems like a small price.  (Perhaps
the docs should recommend using a smaller-than-normal fillfactor during
concurrent index creation?)

* We have to add the already-discussed logic to bt_check_unique to
recheck liveness of the to-be-inserted tuple immediately before it
raises a unique-index-violation error.  This is a waste of cycles
in the normal case, but since it's in an error path that doesn't
seem like a big problem.  (Is it worth extending the aminsert API
so that aminsert knows whether it's being called by CREATE INDEX
CONCURRENTLY or not?  I'm inclined not to bother, at least not
right now with the bitmap AM patch still unmerged.)

* When about to insert a tuple in pass 2, we can make the code
check liveness of the tuple and suppress the unique check if it's
already committed dead.  This means the bt_check_unique addition
only matters if the tuple goes dead in the interval between starting the
index insert and finishing it.  However that's an important case
to handle the concurrent-UPDATE scenario: if we find a conflicting
entry in the index, we'll wait for its inserter to commit, and that
is the same transaction that will have committed our new entry dead.
So the recheck is necessary and sufficient to cover this case.

Barring objections, I'm off to program this.
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Gregory Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> At the moment it may be moot, because I've realized that validate_index
> doesn't work anyway.  It is scanning the index and then assuming that
> any tuple inserted into the index subsequent to that scan will still be
> INSERT_IN_PROGRESS when the heapscan reaches it.  

EGADS

Boy I feel stupid now. In fairness I think what happened is that the original
plan was, like your new plan, based on snapshots. And I only switched to using
HeapSatisfiesVacuum after several iterations. I guess there were some
assumptions in the original thinking that I never revisited.

Because of the way the AM API works changing how the initial heap scan works
is a bit of a pain. It would require either having some global state or
passing the concurrent flag through the AM methods or alternatively having a
whole new AM method.

I'll have to read (and reread) your description again in the morning

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Tricky bugs in concurrent index build

From
stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
>> Hmmm. Or is that true. The problem may be somewhat easier since at least you
>> can be sure every tuple in the heap is in the index. So if you see a
>> DELETE_IN_PROGRESS either it *was* a constraint violation prior to the delete
>> and failing is reasonable or it's an update in which case maybe it's possible
>> to detect that they're part of the same chain?
>
> Unless we are willing to lock every single tuple while we insert it,
> this seems unfixable to me.  Without a lock, the tuple could become
> DELETE_IN_PROGRESS immediately after we look at it.

I think there's some confusion here. This above paragraph was taken from some
thoughts about Hannu's suggestion of having a separate ALTER INDEX SET UNIQUE
command. That command might have an advantage over CREATE INDEX CONCURRENTLY
because it knows the index is already complete; it doesn't have to worry about
potential conflicts with tuples that it will only find later in the scan.

Effectively this is equivalent to making CREATE UNIQUE INDEX CONCURRENTLY
three phases. The first two phases would be a regular CREATE INDEX
CONCURRENTLY and the third phase would be what ALTER INDEX SET UNIQUE does
which is scan the index and verify that it's unique.

ALTER INDEX SET UNIQUE would have to perform a similar two-transaction dance
though. It would have to set the index unique, wait until everyone has seen
the new constraint. Then verify that the property is indeed unique, possibly
rolling back the constraint creation if it's not.

That would make the whole process of creating a unique index quite long. On
the plus side it would be a useful command in itself. Doing an index scan
might be pretty slow but if the table is mostly clean of dead and recently
dead tuples it won't have to visit the heap much and should still be much
quicker than building a new index. And it would itself be a concurrent
command.

> Actually it's worse than that.  We could examine a tuple, see that
> it's good, include it in the uniqueness check.  Then someone updates
> the tuple and puts the new version near the end of the table.  By
> the time we reach that version, it could be committed good.  There
> is absolutely no way that we could notice an issue without applying
> extremely expensive tests to *every* apparently-good tuple.

I think ALTER INDEX SET UNIQUE would not have this problem. It would only have
to look at tuples using its own snapshot and see if there's a violation. If
there isn't a violation as of its own snapshot then it can be sure later
transactions will preserve this property since the index was always complete
and it waited after creating the constraint.

> [ thinks for a bit... ]  At least, it seems hopeless if we use
> SnapshotNow.  Does it help if we use a real snapshot?  I'm thinking
> pass 1 inserts exactly those tuples that are good according to a
> snap taken at its beginning, and then pass 2 considers only tuples
> that are good according to a snap taken at *its* beginning.  But
> having consumed no caffeine yet this morning, I'm not sure I can
> spot any flaws that might exist in this idea.

What about tuples that are inserted and committed in the window between the
two phases. Ie, they're RECENTLY_DEAD but not in phase2's snapshot.

Or do you mean we use SatisfiesVacuum to determine what to insert but
SatisfiesSnapshot to determine whether to check uniqueness?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> Because of the way the AM API works changing how the initial heap scan works
> is a bit of a pain. It would require either having some global state or
> passing the concurrent flag through the AM methods or alternatively having a
> whole new AM method.

Yeah, I dealt with that by adding a 'concurrent build' flag to IndexInfo.
A bit grotty but it beats changing a lot of function signatures.
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Gregory Stark
Date:
Do we want something like this? I just made this error myself so unless I'm
special (pauses for jokes) I imagine others would be prone to it as well.

I would normally be pretty leery of code like this but it seems unlikely
anyone would actually want an index named "concurrently" and the consequences
if you get it wrong in a production environment are pretty dire. We might even
consider making it an outright error.


--- gram.y    25 Aug 2006 10:14:17 +0100    2.558
+++ gram.y    25 Aug 2006 14:04:54 +0100    
@@ -56,6 +56,7 @@#include "commands/defrem.h"#include "nodes/makefuncs.h"#include "parser/gramparse.h"
+#include "parser/scansup.h"#include "storage/lmgr.h"#include "utils/date.h"#include "utils/datetime.h"
@@ -3653,6 +3654,12 @@            opt_definition OptTableSpace where_clause                {
IndexStmt*n = makeNode(IndexStmt);
 
+
+                    if (!strcmp(downcase_truncate_identifier($4,20,false), "concurrently"))
+                        ereport(WARNING,
+                                (errcode(ERRCODE_SYNTAX_ERROR),
+                                 errmsg("performing non-concurrent index build of index named \"concurrently\"")));
+                    n->unique = $2;                    n->concurrent = false;                    n->idxname = $4;
--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Tricky bugs in concurrent index build

From
Andrew Dunstan
Date:
Gregory Stark wrote:
> Do we want something like this? I just made this error myself so unless I'm
> special (pauses for jokes) I imagine others would be prone to it as well.
>
> I would normally be pretty leery of code like this but it seems unlikely
> anyone would actually want an index named "concurrently" and the consequences
> if you get it wrong in a production environment are pretty dire. We might even
> consider making it an outright error.
>
>
> --- gram.y    25 Aug 2006 10:14:17 +0100    2.558
> +++ gram.y    25 Aug 2006 14:04:54 +0100    
> @@ -56,6 +56,7 @@
>  #include "commands/defrem.h"
>  #include "nodes/makefuncs.h"
>  #include "parser/gramparse.h"
> +#include "parser/scansup.h"
>  #include "storage/lmgr.h"
>  #include "utils/date.h"
>  #include "utils/datetime.h"
> @@ -3653,6 +3654,12 @@
>              opt_definition OptTableSpace where_clause
>                  {
>                      IndexStmt *n = makeNode(IndexStmt);
> +
> +                    if (!strcmp(downcase_truncate_identifier($4,20,false), "concurrently"))
> +                        ereport(WARNING,
> +                                (errcode(ERRCODE_SYNTAX_ERROR),
> +                                 errmsg("performing non-concurrent index build of index named \"concurrently\"")));
> +
>                      n->unique = $2;
>                      n->concurrent = false;
>                      n->idxname = $4;
>   


I see we have:
CREATE index_opt_unique INDEX CONCURRENTLY index_name ...


which explains how this error occurs. But might it not be better to have 
this instead?
 CREATE CONCURRENTLY index_opt_unique INDEX index_name ...

Then ISTM no ambiguity could arise (and it's also closer to grammatical 
English, if that matters).

Just a thought

cheers

andrew


Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> I see we have:
>  CREATE index_opt_unique INDEX CONCURRENTLY index_name ...
> which explains how this error occurs.

Maybe to you, but I'm still caffeine-deprived and don't exactly see what
it was that Greg mistyped.  AFAICS he'd have to type CONCURRENTLY twice
to get into a scenario where the proposed warning would fire.

> But might it not be better to have this instead?
>   CREATE CONCURRENTLY index_opt_unique INDEX index_name ...

When I was fooling with gram.y I was thinking that actually
CREATE [UNIQUE] INDEX indexname [CONCURRENTLY] ...

would be the most grammatical thing.  But I can live with putting
it right after CREATE, too.  Or there was the proposal to put it
first:
[CONCURRENTLY] CREATE [UNIQUE] INDEX indexname ...
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Andrew Dunstan
Date:
Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> I see we have:
>>  CREATE index_opt_unique INDEX CONCURRENTLY index_name ...
>> which explains how this error occurs.
>>     
>
> Maybe to you, but I'm still caffeine-deprived and don't exactly see what
> it was that Greg mistyped.  AFAICS he'd have to type CONCURRENTLY twice
> to get into a scenario where the proposed warning would fire.
>
>   

AAUI, he left off the index name so the first rule was matched rather 
than the second, with "concurrently" being the index name.

cheers

andrew


Re: Tricky bugs in concurrent index build

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> I see we have:
>>  CREATE index_opt_unique INDEX CONCURRENTLY index_name ...
>> which explains how this error occurs.
> 
> Maybe to you, but I'm still caffeine-deprived and don't exactly see what
> it was that Greg mistyped.  AFAICS he'd have to type CONCURRENTLY twice
> to get into a scenario where the proposed warning would fire.

i guess Greg is talking about something like(ie just forgetting to name
the index):


devel=# create index concurrently on foo ( a);
CREATE INDEX
devel=# \d foo     Table "public.foo"Column |  Type   | Modifiers
--------+---------+-----------a      | integer |b      | text    |c      | integer |d      | text    |
Indexes:   "concurrently" btree (a)


Stefan


Re: Tricky bugs in concurrent index build

From
Bruce Momjian
Date:
Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > I see we have:
> >  CREATE index_opt_unique INDEX CONCURRENTLY index_name ...
> > which explains how this error occurs.
> 
> Maybe to you, but I'm still caffeine-deprived and don't exactly see what
> it was that Greg mistyped.  AFAICS he'd have to type CONCURRENTLY twice
> to get into a scenario where the proposed warning would fire.
> 
> > But might it not be better to have this instead?
> >   CREATE CONCURRENTLY index_opt_unique INDEX index_name ...
> 
> When I was fooling with gram.y I was thinking that actually
> 
>     CREATE [UNIQUE] INDEX indexname [CONCURRENTLY] ...
> 
> would be the most grammatical thing.  But I can live with putting

The original thinking was to use CONCURRENT, and CREATE CONCURRENT INDEX
sounded like a different type of index, not a different way to build the
index.  I don't think CONCURRENTLY has that problem, so CREATE
CONCURRENTLY INDEX sounds good.  To read in English, it would be read as
CREATE CONCURRENTLY, INDEX ii.

> it right after CREATE, too.  Or there was the proposal to put it
> first:
> 
>     [CONCURRENTLY] CREATE [UNIQUE] INDEX indexname ...

I think this suggested the command was CONCURRENTLY, which isn't good.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> The original thinking was to use CONCURRENT, and CREATE CONCURRENT INDEX
> sounded like a different type of index, not a different way to build the
> index.  I don't think CONCURRENTLY has that problem, so CREATE
> CONCURRENTLY INDEX sounds good.  To read in English, it would be read as
> CREATE CONCURRENTLY, INDEX ii.

OK, we've got two votes for that, so I'll make it so.
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Gregory Stark
Date:
Bruce Momjian <bruce@momjian.us> writes:

> The original thinking was to use CONCURRENT, and CREATE CONCURRENT INDEX
> sounded like a different type of index, not a different way to build the
> index.  I don't think CONCURRENTLY has that problem, so CREATE
> CONCURRENTLY INDEX sounds good.  To read in English, it would be read as
> CREATE CONCURRENTLY, INDEX ii.

That doesn't sound like English at all to me.

Fwiw, I think the best option was what Tom did. The gotcha I tripped on seems
pretty minor to me.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Tricky bugs in concurrent index build

From
Bruce Momjian
Date:
Gregory Stark wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> 
> > The original thinking was to use CONCURRENT, and CREATE CONCURRENT INDEX
> > sounded like a different type of index, not a different way to build the
> > index.  I don't think CONCURRENTLY has that problem, so CREATE
> > CONCURRENTLY INDEX sounds good.  To read in English, it would be read as
> > CREATE CONCURRENTLY, INDEX ii.
> 
> That doesn't sound like English at all to me.
> 
> Fwiw, I think the best option was what Tom did. The gotcha I tripped on seems
> pretty minor to me.

What bothers me about what we have now is that we have optional keywords
before and after INDEX, rather than only between CREATE and INDEX.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> What bothers me about what we have now is that we have optional keywords
> before and after INDEX, rather than only between CREATE and INDEX.

Yeah, putting them both into that space seems consistent to me, and
it will fix the problem of making an omitted index name look like
a valid command.

I'm not sure I should be opening this can of worms, but do we want to
use a different keyword than CONCURRENTLY to make it read better there?
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > What bothers me about what we have now is that we have optional keywords
> > before and after INDEX, rather than only between CREATE and INDEX.
> 
> Yeah, putting them both into that space seems consistent to me, and
> it will fix the problem of making an omitted index name look like
> a valid command.
> 
> I'm not sure I should be opening this can of worms, but do we want to
> use a different keyword than CONCURRENTLY to make it read better there?

The problem is that what the qualifier is doing is modifying the
operation itself, not the properties of the index to be created, like
UNIQUE, which modifies the index.  So the qualifier should be something
that modifies the CREATE, that is, an adverb (AFAIK).

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> The problem is that what the qualifier is doing is modifying the
> operation itself, not the properties of the index to be created, like
> UNIQUE, which modifies the index.

Right, which was the same point Bruce made earlier.  And really you
can't respect that difference while putting them into the same place in
the word order.  So I'm starting to feel like maybe we should leave
well enough alone.
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Andrew Dunstan
Date:
Gregory Stark wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>
>   
>> The original thinking was to use CONCURRENT, and CREATE CONCURRENT INDEX
>> sounded like a different type of index, not a different way to build the
>> index.  I don't think CONCURRENTLY has that problem, so CREATE
>> CONCURRENTLY INDEX sounds good.  To read in English, it would be read as
>> CREATE CONCURRENTLY, INDEX ii.
>>     
>
> That doesn't sound like English at all to me.
>
> Fwiw, I think the best option was what Tom did. The gotcha I tripped on seems
> pretty minor to me.
>
>   

It's a form of construction my father (who was a noted orator) loved to 
use, maybe a little too much. It is arguably slightly archaic, but 
nevertheless quite grammatical ;-) I agree that these days it is more 
idiomatic to defer the adverb until after the object of the verb in most 
cases.

cheers

andrew


Re: Tricky bugs in concurrent index build

From
"Zeugswetter Andreas DCP SD"
Date:
> > What bothers me about what we have now is that we have optional
> > keywords before and after INDEX, rather than only between
> CREATE and INDEX.
>
> Yeah, putting them both into that space seems consistent to
> me, and it will fix the problem of making an omitted index
> name look like a valid command.
>
> I'm not sure I should be opening this can of worms, but do we
> want to use a different keyword than CONCURRENTLY to make it
> read better there?

precedent syntax (Oracle, Informix) uses the keyword ONLINE at the end:CREATE INDEX blabla_x0 ON blabla (a,b) ONLINE;

I'd stick with that.

Andreas


Re: Tricky bugs in concurrent index build

From
Alvaro Herrera
Date:
Zeugswetter Andreas DCP SD wrote:
> 
> > > What bothers me about what we have now is that we have optional 
> > > keywords before and after INDEX, rather than only between 
> > CREATE and INDEX.
> > 
> > Yeah, putting them both into that space seems consistent to 
> > me, and it will fix the problem of making an omitted index 
> > name look like a valid command.
> > 
> > I'm not sure I should be opening this can of worms, but do we 
> > want to use a different keyword than CONCURRENTLY to make it 
> > read better there?
> 
> precedent syntax (Oracle, Informix) uses the keyword ONLINE at the end:
>  CREATE INDEX blabla_x0 ON blabla (a,b) ONLINE;

That was what the patch originally used, but it was changed because it
made difficult for psql to auto-complete that.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
"Zeugswetter Andreas DCP SD" <ZeugswetterA@spardat.at> writes:
> precedent syntax (Oracle, Informix) uses the keyword ONLINE at the end:
>  CREATE INDEX blabla_x0 ON blabla (a,b) ONLINE;

We rejected that one already ...
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
"Zeugswetter Andreas DCP SD"
Date:
> > precedent syntax (Oracle, Informix) uses the keyword ONLINE
> at the end:
> >  CREATE INDEX blabla_x0 ON blabla (a,b) ONLINE;
>
> That was what the patch originally used, but it was changed
> because it made difficult for psql to auto-complete that.

That is imho not enough of a reason to divert.

Andreas


Re: Tricky bugs in concurrent index build

From
Gregory Stark
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:

> That was what the patch originally used, but it was changed because it
> made difficult for psql to auto-complete that.

Psql has to be able to parse it not for auto-completion but because it needs
to know that it's not a transactional command. The regular CREATE INDEX can be
run from within a transaction but online index builds use two transactions on
their own so psql has to know not to insert a BEGIN and savepoint around it.

I'll use this opportunity to plug that feature again. I think most people
should use autocommit off with on_error_rollack on for most of their daily
use. Being able to double check the results of my ad-hoc updates before
committing them saved me more headaches than I can count with Oracle.
Autocommit off only became practical for interactive use with postgres when
savepoints showed up.


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
"Zeugswetter Andreas DCP SD" <ZeugswetterA@spardat.at> writes:
>> That was what the patch originally used, but it was changed 
>> because it made difficult for psql to auto-complete that.

> That is imho not enough of a reason to divert.

My recollection is that the principal argument against ONLINE was
that it didn't convey the function of the option to anyone who
didn't already know Oracle's usage of the term.

Also, psql's problem is not with auto-completion, it's with
detecting whether the command is allowed inside a transaction
block.  That's not a function we can just blow off.
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
"Jim C. Nasby"
Date:
On Fri, Aug 25, 2006 at 06:57:58PM +0100, Gregory Stark wrote:
> I'll use this opportunity to plug that feature again. I think most people
> should use autocommit off with on_error_rollack on for most of their daily
> use. Being able to double check the results of my ad-hoc updates before
> committing them saved me more headaches than I can count with Oracle.
> Autocommit off only became practical for interactive use with postgres when
> savepoints showed up.

+1
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Tricky bugs in concurrent index build

From
"Jim C. Nasby"
Date:
On Fri, Aug 25, 2006 at 11:25:43AM -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > The problem is that what the qualifier is doing is modifying the
> > operation itself, not the properties of the index to be created, like
> > UNIQUE, which modifies the index.
> 
> Right, which was the same point Bruce made earlier.  And really you
> can't respect that difference while putting them into the same place in
> the word order.  So I'm starting to feel like maybe we should leave
> well enough alone.

Since we might eventually have other 'concurrent commands', perhaps

CONCURRENT CREATE INDEX ...

would be best.

BTW, if we started to consider lazy vacuum a concurrent command we could
ditch the use of FULL, which is always confusing if you're talking about
database-wide vacuums. I know it'd take many versions to fully make that
change, but it seems worth it to me to reduce confusion. There's also an
issue of newbies thinking they should use vacuum full regularly because
it's somehow better than lazyvac.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Tricky bugs in concurrent index build

From
Gregory Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Barring objections, I'm off to program this.

A few concerns

a) The use of ShareUpdateExclusiveLock is supposed to lock out concurrent  vacuums. I just tried it and vacuum seemed
tobe unaffected. I'm going to  retry it with a clean cvs checkout to be sure it isn't something in my  local tree
that'sbroken.
 
  Do we still need to block concurrent vacuums if we're using snapshots?  Obviously we have to block them during phase
1because it won't have a  chance of removing the tuples from our private collection of index tuples  that haven't been
pushedlive yet. But if phase 2 is ignoring tuples too  new to be visible in its snapshot then it shouldn't care if dead
tuplesare  deleted even if those slots are later reused.
 


b) You introduced a LockRelationIdForSession() call (I even didn't realize we  had this capability when I wrote the
patch).Does this introduce the  possibility of a deadlock though? If one of the transactions we're waiting  to finish
hasa shared lock on the relation and is waiting for an exclusive  lock on the relation then it seems we'll wait forever
forit to finish and  never see either of our conditions for continuing. That would be fine  except because we're
waitingmanually the deadlock detection code doesn't  have a chance of firing.
 
  To solve that we would have to replace the pg_sleep call with a  XactLockTableWait. But I'm not clear how to find a
transactionid to wait  on. What we would want to find is any transaction id that has an xmin older  than our xmin. Even
thatisn't ideal since it wouldn't give us a chance to  test our other out so if we choose a transaction to wait on that
doesn't hold even a share lock on our table we could end up stuck longer than  necessary (ie when we would have been
ableto momentarily acquire the  exclusive lock on the table earlier).
 


c) It's a shame we don't support multiple concurrent concurrent index builds.  We could create a ShareUpdateShareLock
thatconflicts with the same list of  locks that ShareUpdateExclusiveLock conflicts with but not itself. From a  UI
pointof view there's no excuse for not doing this, but from an  implementation point of view there's a limit of 10 lock
typesand this  would get up to 9. This is my first time looking at this code so I'm not  sure how hard that limit is.
 
  One caveat is that the two jobs would see each other and that would make it  hard for them to proceed to phase 2. I
thinkwhat would happen is that the  first one to finish phase 1 would be able to continue as soon as the other
finishesphase 1. The second one would have to wait until the first one's  phase 2 finished.
 


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Tricky bugs in concurrent index build

From
David Fetter
Date:
On Fri, Aug 25, 2006 at 08:02:21PM -0500, Jim C. Nasby wrote:
> On Fri, Aug 25, 2006 at 06:57:58PM +0100, Gregory Stark wrote:
> > I'll use this opportunity to plug that feature again. I think most
> > people should use autocommit off with on_error_rollack on for most
> > of their daily use. Being able to double check the results of my
> > ad-hoc updates before committing them saved me more headaches than
> > I can count with Oracle.  Autocommit off only became practical for
> > interactive use with postgres when savepoints showed up.
> 
> +1

+1 here, too :)

Cheers,
D
-- 
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666                             Skype: davidfetter

Remember to vote!


Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> A few concerns

> a) The use of ShareUpdateExclusiveLock is supposed to lock out concurrent
>    vacuums. I just tried it and vacuum seemed to be unaffected.

Can't reproduce such a problem here.

>    Do we still need to block concurrent vacuums if we're using snapshots?

That's an interesting question.  I don't think we need to lock out
vacuum in either pass as far as tuple removal goes: we are only
interested in tuples that are visible to a live transaction (ie ours)
and vacuum shouldn't remove them.  However one of the properties of
ShareUpdateExclusive is supposed to be that it locks out schema changes
... such as creation of a new index.  I wouldn't want to bet that vacuum
works correctly if a new index appears partway through its collection of
info about the relation.  I think we need to keep that lock so that
vacuum is safe against create index, rather than vice versa.

> b) You introduced a LockRelationIdForSession() call (I even didn't realize we
>    had this capability when I wrote the patch). Does this introduce the
>    possibility of a deadlock though?

Indeed, I had noticed this while testing your point (a).  I thought that
ConditionalLockRelation had enough smarts to grant itself a lock if
there would be a deadlock possibility otherwise, but it seems not to be
noticing.  We'll need to look into that.

> c) It's a shame we don't support multiple concurrent concurrent index builds.

I can't get excited about that.  The point of this facility is to build
indexes with minimal impact on production queries, and the I/O load
imposed by even one build is going to make that a bit questionable, let
alone multiple ones.  If you want to do some useful improvement on the
patch, look into making it throttle its I/O rate like vacuum can.
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
I wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
>> b) You introduced a LockRelationIdForSession() call (I even didn't realize we
>> had this capability when I wrote the patch). Does this introduce the
>> possibility of a deadlock though?

> Indeed, I had noticed this while testing your point (a).  I thought that
> ConditionalLockRelation had enough smarts to grant itself a lock if
> there would be a deadlock possibility otherwise, but it seems not to be
> noticing.  We'll need to look into that.

OK, the code I was remembering is the bit in ProcSleep to grant our proc
the lock if we can prove that deadlock detection would do so anyway.
It is probably possible to refactor things so that that's done before
we fail in the ConditionalLockAcquire case, but it would involve
uglifying ProcSleep's API even more.  And it doesn't sound like a
complete solution anyway --- the ProcSleep test is not capable of
detecting deadlocks involving more than one lock, and I'm not sure it
even intends to find all cases involving just one lock.  It was only
ever meant as a simple optimization to avoid a sleep in easy cases.

>> To solve that we would have to replace the pg_sleep call with a
>> XactLockTableWait. But I'm not clear how to find a transaction id to wait
>> on.

I'm toying with the idea of adding a lock manager call defined as
"give me a list of XIDs that currently hold locks conflicting with
lockmode X on object Y" --- but not including XIDs merely waiting
for such a lock.  Then we could get the list of XIDs currently blocking
ExclusiveLock, and do XactLockTableWait on each one.  Once they are
all gone we are safe; we don't care about later acquirers of locks
because they must see the index.  This avoids both the arbitrary
pg_sleep and the need for a check on "am I the oldest remaining XID";
and if we are in a deadlock situation it will be detected.

This looks like it should be easy enough to code (though I haven't
actually tried yet) ... do you see any logical flaws?
        regards, tom lane


Re: Tricky bugs in concurrent index build

From
Tom Lane
Date:
I wrote:
> I'm toying with the idea of adding a lock manager call defined as
> "give me a list of XIDs that currently hold locks conflicting with
> lockmode X on object Y" --- but not including XIDs merely waiting
> for such a lock.  Then we could get the list of XIDs currently blocking
> ExclusiveLock, and do XactLockTableWait on each one.

I've committed a patch along these lines.  I also reduced the strength
of the lock we check for from ExclusiveLock to ShareLock, which seems
sufficient --- did you have a reason for selecting ExclusiveLock in the
original coding?
        regards, tom lane