Thread: Support for REINDEX CONCURRENTLY
Hi all,
One of the outputs on the discussions about the integration of pg_reorg in core
was that Postgres should provide some ways to do REINDEX, CLUSTER and ALTER
TABLE concurrently with low-level locks in a way similar to CREATE INDEX CONCURRENTLY.
The discussions done can be found on this thread:
http://archives.postgresql.org/pgsql-hackers/2012-09/msg00746.php
Well, I spent some spare time working on the implementation of REINDEX CONCURRENTLY.
This basically allows to perform read and write operations on a table whose index(es) are
reindexed at the same time. Pretty useful for a production environment. The caveats of this
feature is that it is slower than normal reindex, and impacts other backends with the extra CPU,
memory and IO it uses to process. The implementation is based on something on the same ideas
as pg_reorg and on an idea of Andres.
Please find attached a version that I consider as a base for the next discussions, perhaps
a version that could be submitted to the commit fest next month. Patch is aligned with postgres
master at commit 09ac603.
With this feature, you can rebuild a table or an index with such commands:
REINDEX INDEX ind CONCURRENTLY;
REINDEX TABLE tab CONCURRENTLY;
The following restrictions are applied.
- REINDEX [ DATABASE | SYSTEM ] cannot be run concurrently.
- REINDEX CONCURRENTLY cannot run inside a transaction block.
- Shared tables cannot be reindexed concurrently
- indexes for exclusion constraints cannot be reindexed concurrently.
- toast relations are reindexed non-concurrently when table reindex is done
and that this table has toast relations
Here is a description of what happens when reorganizing an index concurrently
(the beginning of the process is similar to CREATE INDEX CONCURRENTLY):
1) creation of a new index based on the same columns and restrictions as the
index that is rebuilt (called here old index). This new index has as name
$OLDINDEX_cct. So only a suffix _cct is added. It is marked as invalid and not ready.
2) Take session locks on old and new index(es), and the parent table to prevent
unfortunate drops.
3) Commit and start a new transaction
4) Wait until no running transactions could have the table open with the old list of indexes.
5) Build the new indexes. All the new indexes are marked as indisready.
6) Commit and start a new transaction
7) Wait until no running transactions could have the table open with the old list of indexes.
8) Take a reference snapshot and validate the new indexes
9) Wait for the old snapshots based on the reference snapshot
10) mark the new indexes as indisvalid
11) Commit and start a new transaction. At this point the old and new indexes are both valid
12) Take a new reference snapshot and wait for the old snapshots to insure that old
indexes are not corrupted,
13) Mark the old indexes as invalid
14) Swap new and old indexes, consisting here in switching their names.
15) Old indexes are marked as invalid.
16) Commit and start a new transaction
17) Wait for transactions that might use the old indexes
18) Old indexes are marked as not ready
19) Commit and start a new transaction
20) Drop the old indexes
The following process might be reducible, but I would like that to be decided depending on
the community feedback and experience on such concurrent features.
For the time being I took an approach that looks slower, but secured to my mind with multiple
waits (perhaps sometimes unnecessary?) and subtransactions.
If during the process an error occurs, the table will finish with either the old or new index
as invalid. In this case the user will be in charge to drop the invalid index himself.
The concurrent index can be easily identified with its suffix *_cct.
This patch has required some refactorisation effort as I noticed that the code of index
for concurrent operations was not very generic. In order to do that, I created some
new functions in index.c called index_concurrent_* which are used by CREATE INDEX
and REINDEX in my patch. Some refactoring has also been done regarding the wait processes.
REINDEX TABLE and REINDEX INDEX follow the same code path (ReindexConcurrentIndexes
in indexcmds.c). The patch structure is relying a maximum on the functions of index.c
when creating, building and validating concurrent index.
Based on the comments of this thread, I would like to submit the patch at the next
commit fest. Just let me know if the approach taken by the current implementation
is OK ot if it needs some modifications. That would be really helpful.
The patch includes some regression tests for error checks and also some documentation.
Regressions are passing, code has no whitespaces and no compilation warnings.
I have also done tests checking for read and write operations on index scan of parent table
at each step of the process (by using gdb to stop the reindex process at precise places).
Thanks, and looking forward to your feedback,
--
Michael Paquier
http://michael.otacoo.com
One of the outputs on the discussions about the integration of pg_reorg in core
was that Postgres should provide some ways to do REINDEX, CLUSTER and ALTER
TABLE concurrently with low-level locks in a way similar to CREATE INDEX CONCURRENTLY.
The discussions done can be found on this thread:
http://archives.postgresql.org/pgsql-hackers/2012-09/msg00746.php
Well, I spent some spare time working on the implementation of REINDEX CONCURRENTLY.
This basically allows to perform read and write operations on a table whose index(es) are
reindexed at the same time. Pretty useful for a production environment. The caveats of this
feature is that it is slower than normal reindex, and impacts other backends with the extra CPU,
memory and IO it uses to process. The implementation is based on something on the same ideas
as pg_reorg and on an idea of Andres.
Please find attached a version that I consider as a base for the next discussions, perhaps
a version that could be submitted to the commit fest next month. Patch is aligned with postgres
master at commit 09ac603.
With this feature, you can rebuild a table or an index with such commands:
REINDEX INDEX ind CONCURRENTLY;
REINDEX TABLE tab CONCURRENTLY;
The following restrictions are applied.
- REINDEX [ DATABASE | SYSTEM ] cannot be run concurrently.
- REINDEX CONCURRENTLY cannot run inside a transaction block.
- Shared tables cannot be reindexed concurrently
- indexes for exclusion constraints cannot be reindexed concurrently.
- toast relations are reindexed non-concurrently when table reindex is done
and that this table has toast relations
Here is a description of what happens when reorganizing an index concurrently
(the beginning of the process is similar to CREATE INDEX CONCURRENTLY):
1) creation of a new index based on the same columns and restrictions as the
index that is rebuilt (called here old index). This new index has as name
$OLDINDEX_cct. So only a suffix _cct is added. It is marked as invalid and not ready.
2) Take session locks on old and new index(es), and the parent table to prevent
unfortunate drops.
3) Commit and start a new transaction
4) Wait until no running transactions could have the table open with the old list of indexes.
5) Build the new indexes. All the new indexes are marked as indisready.
6) Commit and start a new transaction
7) Wait until no running transactions could have the table open with the old list of indexes.
8) Take a reference snapshot and validate the new indexes
9) Wait for the old snapshots based on the reference snapshot
10) mark the new indexes as indisvalid
11) Commit and start a new transaction. At this point the old and new indexes are both valid
12) Take a new reference snapshot and wait for the old snapshots to insure that old
indexes are not corrupted,
13) Mark the old indexes as invalid
14) Swap new and old indexes, consisting here in switching their names.
15) Old indexes are marked as invalid.
16) Commit and start a new transaction
17) Wait for transactions that might use the old indexes
18) Old indexes are marked as not ready
19) Commit and start a new transaction
20) Drop the old indexes
The following process might be reducible, but I would like that to be decided depending on
the community feedback and experience on such concurrent features.
For the time being I took an approach that looks slower, but secured to my mind with multiple
waits (perhaps sometimes unnecessary?) and subtransactions.
If during the process an error occurs, the table will finish with either the old or new index
as invalid. In this case the user will be in charge to drop the invalid index himself.
The concurrent index can be easily identified with its suffix *_cct.
This patch has required some refactorisation effort as I noticed that the code of index
for concurrent operations was not very generic. In order to do that, I created some
new functions in index.c called index_concurrent_* which are used by CREATE INDEX
and REINDEX in my patch. Some refactoring has also been done regarding the wait processes.
REINDEX TABLE and REINDEX INDEX follow the same code path (ReindexConcurrentIndexes
in indexcmds.c). The patch structure is relying a maximum on the functions of index.c
when creating, building and validating concurrent index.
Based on the comments of this thread, I would like to submit the patch at the next
commit fest. Just let me know if the approach taken by the current implementation
is OK ot if it needs some modifications. That would be really helpful.
The patch includes some regression tests for error checks and also some documentation.
Regressions are passing, code has no whitespaces and no compilation warnings.
I have also done tests checking for read and write operations on index scan of parent table
at each step of the process (by using gdb to stop the reindex process at precise places).
Thanks, and looking forward to your feedback,
--
Michael Paquier
http://michael.otacoo.com
Attachment
On 3 October 2012 02:14, Michael Paquier <michael.paquier@gmail.com> wrote: > Well, I spent some spare time working on the implementation of REINDEX > CONCURRENTLY. Thanks > The following restrictions are applied. > - REINDEX [ DATABASE | SYSTEM ] cannot be run concurrently. Fair enough > - indexes for exclusion constraints cannot be reindexed concurrently. > - toast relations are reindexed non-concurrently when table reindex is done > and that this table has toast relations Those restrictions are important ones to resolve since they prevent the CONCURRENTLY word from being true in a large proportion of cases. We need to be clear that the remainder of this can be done in user space already, so the proposal doesn't move us forwards very far, except in terms of packaging. IMHO this needs to be more than just moving a useful script into core. > Here is a description of what happens when reorganizing an index > concurrently There are four waits for every index, again similar to what is possible in user space. When we refactor that, I would like to break things down into N discrete steps, if possible. Each time we hit a wait barrier, a top-level process would be able to switch to another task to avoid waiting. This would then allow us to proceed more quickly through the task. I would admit that is a later optimisation, but it would be useful to have the innards refactored to allow for that more easily later. I'd accept Not yet, if doing that becomes a problem in short term. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On Wednesday, October 03, 2012 03:14:17 AM Michael Paquier wrote: > One of the outputs on the discussions about the integration of pg_reorg in > core > was that Postgres should provide some ways to do REINDEX, CLUSTER and ALTER > TABLE concurrently with low-level locks in a way similar to CREATE INDEX > CONCURRENTLY. > > The discussions done can be found on this thread: > http://archives.postgresql.org/pgsql-hackers/2012-09/msg00746.php > > Well, I spent some spare time working on the implementation of REINDEX > CONCURRENTLY. Very cool! > This basically allows to perform read and write operations on a table whose > index(es) are reindexed at the same time. Pretty useful for a production > environment. The caveats of this feature is that it is slower than normal > reindex, and impacts other backends with the extra CPU, memory and IO it > uses to process. The implementation is based on something on the same ideas > as pg_reorg and on an idea of Andres. > The following restrictions are applied. > - REINDEX [ DATABASE | SYSTEM ] cannot be run concurrently. I would like to support something like REINDEX USER TABLES; or similar at some point, but that very well can be a second phase. > - REINDEX CONCURRENTLY cannot run inside a transaction block. > - toast relations are reindexed non-concurrently when table reindex is done > and that this table has toast relations Why that restriction? > Here is a description of what happens when reorganizing an index > concurrently > (the beginning of the process is similar to CREATE INDEX CONCURRENTLY): > 1) creation of a new index based on the same columns and restrictions as > the index that is rebuilt (called here old index). This new index has as > name $OLDINDEX_cct. So only a suffix _cct is added. It is marked as > invalid and not ready. You probably should take a SHARE UPDATE EXCLUSIVE lock on the table at that point already, to prevent schema changes. > 8) Take a reference snapshot and validate the new indexes Hm. Unless you factor in corrupt indices, why should this be needed? > 14) Swap new and old indexes, consisting here in switching their names. I think switching based on their names is not going to work very well because indexes are referenced by oid at several places. Swapping pg_index.indexrelid or pg_class.relfilenode seems to be the better choice to me. We expect relfilenode changes for such commands, but not ::regclass oid changes. Such a behaviour would at least be complicated for pg_depend and pg_constraint. > The following process might be reducible, but I would like that to be > decided depending on the community feedback and experience on such > concurrent features. > For the time being I took an approach that looks slower, but secured to my > mind with multiple waits (perhaps sometimes unnecessary?) and > subtransactions. > If during the process an error occurs, the table will finish with either > the old or new index as invalid. In this case the user will be in charge to > drop the invalid index himself. > The concurrent index can be easily identified with its suffix *_cct. I am not really happy about relying on some arbitrary naming here. That still can result in conflicts and such. > This patch has required some refactorisation effort as I noticed that the > code of index for concurrent operations was not very generic. In order to do > that, I created some new functions in index.c called index_concurrent_* > which are used by CREATE INDEX and REINDEX in my patch. Some refactoring has > also been done regarding the> wait processes. > REINDEX TABLE and REINDEX INDEX follow the same code path > (ReindexConcurrentIndexes in indexcmds.c). The patch structure is relying a > maximum on the functions of index.c when creating, building and validating > concurrent index. I haven't looked at the patch yet, but I was pretty sure that you would need to do quite some refactoring to implement this and this looks like roughly the right direction... > Thanks, and looking forward to your feedback, I am very happy that youre taking this on! Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Oct 3, 2012 at 5:10 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Michael Paquier> This basically allows to perform read and write operations on a table whose
> index(es) are reindexed at the same time. Pretty useful for a production
> environment. The caveats of this feature is that it is slower than normal
> reindex, and impacts other backends with the extra CPU, memory and IO it
> uses to process. The implementation is based on something on the same ideas
> as pg_reorg and on an idea of Andres.> The following restrictions are applied.I would like to support something like REINDEX USER TABLES; or similar at some
> - REINDEX [ DATABASE | SYSTEM ] cannot be run concurrently.
point, but that very well can be a second phase.
This is something out of scope for the time being honestly. Later? why not...
> - REINDEX CONCURRENTLY cannot run inside a transaction block.> - toast relations are reindexed non-concurrently when table reindex is doneWhy that restriction?
> and that this table has toast relations
This is the state of the current version of the patch. And not what the final version should do. I agree that toast relations should also be reindexed concurrently as the others. Regarding this current restriction, my point was just to get some feedback before digging deeper. I should have told that though...
You probably should take a SHARE UPDATE EXCLUSIVE lock on the table at that
> Here is a description of what happens when reorganizing an index
> concurrently
> (the beginning of the process is similar to CREATE INDEX CONCURRENTLY):
> 1) creation of a new index based on the same columns and restrictions as
> the index that is rebuilt (called here old index). This new index has as
> name $OLDINDEX_cct. So only a suffix _cct is added. It is marked as
> invalid and not ready.
point already, to prevent schema changes.Hm. Unless you factor in corrupt indices, why should this be needed?
> 8) Take a reference snapshot and validate the new indexesI think switching based on their names is not going to work very well because
> 14) Swap new and old indexes, consisting here in switching their names.
indexes are referenced by oid at several places. Swapping pg_index.indexrelid
or pg_class.relfilenode seems to be the better choice to me. We expect
relfilenode changes for such commands, but not ::regclass oid changes.
OK, so you mean to create an index, then switch only the relfilenode. Why not. This is largely doable. I think that what is important here is to choose a way of doing an keep it until the end.
Such a behaviour would at least be complicated for pg_depend and
pg_constraint.I am not really happy about relying on some arbitrary naming here. That still
> The following process might be reducible, but I would like that to be
> decided depending on the community feedback and experience on such
> concurrent features.
> For the time being I took an approach that looks slower, but secured to my
> mind with multiple waits (perhaps sometimes unnecessary?) and
> subtransactions.
> If during the process an error occurs, the table will finish with either
> the old or new index as invalid. In this case the user will be in charge to
> drop the invalid index himself.
> The concurrent index can be easily identified with its suffix *_cct.
can result in conflicts and such.
The concurrent names are generated automatically with a function in indexcmds.c, the same way a pkey indexes. Let's imagine that the
reindex concurrently command is run twice after a failure. The second concurrent index will not have *_cct as suffix but _cct1. However I am open to more ideas here. What I feel about the concurrent index is that it needs a pg_class entry, even if it is just temporary, and this entry needs a name.
reindex concurrently command is run twice after a failure. The second concurrent index will not have *_cct as suffix but _cct1. However I am open to more ideas here. What I feel about the concurrent index is that it needs a pg_class entry, even if it is just temporary, and this entry needs a name.
I haven't looked at the patch yet, but I was pretty sure that you would need
> This patch has required some refactorisation effort as I noticed that the
> code of index for concurrent operations was not very generic. In order to do
> that, I created some new functions in index.c called index_concurrent_*
> which are used by CREATE INDEX and REINDEX in my patch. Some refactoring has
> also been done regarding the> wait processes.
> REINDEX TABLE and REINDEX INDEX follow the same code path
> (ReindexConcurrentIndexes in indexcmds.c). The patch structure is relying a
> maximum on the functions of index.c when creating, building and validating
> concurrent index.
to do quite some refactoring to implement this and this looks like roughly the
right direction...
Thanks for spending time on it.
--
--
http://michael.otacoo.com
On 3 October 2012 09:10, Andres Freund <andres@2ndquadrant.com> wrote: >> The following restrictions are applied. >> - REINDEX [ DATABASE | SYSTEM ] cannot be run concurrently. > I would like to support something like REINDEX USER TABLES; or similar at some > point, but that very well can be a second phase. Yes, that would be a nice feature anyway, even without concurrently. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Just for background. The showstopper for REINDEX concurrently was not that it was particularly hard to actually do the reindexing. But it's not obvious how to obtain a lock on both the old and new index without creating a deadlock risk. I don't remember exactly where the deadlock risk lies but there are two indexes to lock and whichever order you obtain the locks it might be possible for someone else to be waiting to obtain them in the opposite order. I'm sure it's possible to solve the problem. But the footwork needed to release locks then reobtain them in the right order and verify that the index hasn't changed out from under you might be a lot of headache. Perhaps a good way to tackle it is to have a generic "verify two indexes are equivalent and swap the underlying relfilenodes" operation that can be called from both regular reindex and reindex concurrently. As long as it's the only function that ever locks two indexes then it can just determine what locking discipline it wants to use. -- greg
On Wednesday, October 03, 2012 12:59:25 PM Greg Stark wrote: > Just for background. The showstopper for REINDEX concurrently was not > that it was particularly hard to actually do the reindexing. But it's > not obvious how to obtain a lock on both the old and new index without > creating a deadlock risk. I don't remember exactly where the deadlock > risk lies but there are two indexes to lock and whichever order you > obtain the locks it might be possible for someone else to be waiting > to obtain them in the opposite order. > > I'm sure it's possible to solve the problem. But the footwork needed > to release locks then reobtain them in the right order and verify that > the index hasn't changed out from under you might be a lot of > headache. Maybe I am missing something here, but reindex concurrently should do 1) BEGIN 2) Lock table in share update exlusive 3) lock old index 3) create new index 4) obtain session locks on table, old index, new index 5) commit 6) process till newindex->insisready (no new locks) 7) process till newindex->indisvalid (no new locks) 8) process till !oldindex->indisvalid (no new locks) 9) process till !oldindex->indisready (no new locks) 10) drop all session locks 11) lock old index exlusively which should be "invisible" now 12) drop old index I don't see where the deadlock danger is hidden in that? I didn't find anything relevant in a quick search of the archives... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Oct 3, 2012 at 8:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:
-- On Wednesday, October 03, 2012 12:59:25 PM Greg Stark wrote:Maybe I am missing something here, but reindex concurrently should do
> Just for background. The showstopper for REINDEX concurrently was not
> that it was particularly hard to actually do the reindexing. But it's
> not obvious how to obtain a lock on both the old and new index without
> creating a deadlock risk. I don't remember exactly where the deadlock
> risk lies but there are two indexes to lock and whichever order you
> obtain the locks it might be possible for someone else to be waiting
> to obtain them in the opposite order.
>
> I'm sure it's possible to solve the problem. But the footwork needed
> to release locks then reobtain them in the right order and verify that
> the index hasn't changed out from under you might be a lot of
> headache.
1) BEGIN
2) Lock table in share update exlusive
3) lock old index
3) create new index
4) obtain session locks on table, old index, new index
5) commit
Build new index.
6) process till newindex->insisready (no new locks)
validate new index
7) process till newindex->indisvalid (no new locks)
Forgot the swap old index/new index.
8) process till !oldindex->indisvalid (no new locks)
9) process till !oldindex->indisready (no new locks)
10) drop all session locks
11) lock old index exclusively which should be "invisible" now
12) drop old index
The code I sent already does that more or less btw. Just that it can be more simplified...
I don't see where the deadlock danger is hidden in that?
I didn't find anything relevant in a quick search of the archives...
About the deadlock issues, do you mean the case where 2 sessions are running REINDEX and/or REINDEX CONCURRENTLY on the same table or index in parallel?
Michael Paquier
http://michael.otacoo.com
On Wednesday, October 03, 2012 01:15:27 PM Michael Paquier wrote: > On Wed, Oct 3, 2012 at 8:08 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > On Wednesday, October 03, 2012 12:59:25 PM Greg Stark wrote: > > > Just for background. The showstopper for REINDEX concurrently was not > > > that it was particularly hard to actually do the reindexing. But it's > > > not obvious how to obtain a lock on both the old and new index without > > > creating a deadlock risk. I don't remember exactly where the deadlock > > > risk lies but there are two indexes to lock and whichever order you > > > obtain the locks it might be possible for someone else to be waiting > > > to obtain them in the opposite order. > > > > > > I'm sure it's possible to solve the problem. But the footwork needed > > > to release locks then reobtain them in the right order and verify that > > > the index hasn't changed out from under you might be a lot of > > > headache. > > > > Maybe I am missing something here, but reindex concurrently should do > > 1) BEGIN > > 12) drop old index > > The code I sent already does that more or less btw. Just that it can be > more simplified... The above just tried to describe the stuff thats relevant for locking, maybe I wasn't clear enough on that ;) > > I don't see where the deadlock danger is hidden in that? > > I didn't find anything relevant in a quick search of the archives... > > About the deadlock issues, do you mean the case where 2 sessions are > running REINDEX and/or REINDEX CONCURRENTLY on the same table or index in > parallel? No idea. The bit about deadlocks originally came from Greg, not me ;) I guess its more the interaction with normal sessions, because the locking used (SHARE UPDATE EXLUSIVE) prevents another CONCURRENT action running at the same time. I don't really see the danger there though because we should never need to acquire locks that we don't already have except the final AccessExclusiveLock but thats after we dropped other locks and after the index is made unusable. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > Maybe I am missing something here, but reindex concurrently should do > 1) BEGIN > 2) Lock table in share update exlusive > 3) lock old index > 3) create new index > 4) obtain session locks on table, old index, new index > 5) commit > 6) process till newindex->insisready (no new locks) > 7) process till newindex->indisvalid (no new locks) > 8) process till !oldindex->indisvalid (no new locks) > 9) process till !oldindex->indisready (no new locks) > 10) drop all session locks > 11) lock old index exlusively which should be "invisible" now > 12) drop old index You can't drop the session locks until you're done. Consider somebody else trying to do a DROP TABLE between steps 10 and 11, for instance. regards, tom lane
On Wednesday, October 03, 2012 04:28:59 PM Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > Maybe I am missing something here, but reindex concurrently should do > > 1) BEGIN > > 2) Lock table in share update exlusive > > 3) lock old index > > 3) create new index > > 4) obtain session locks on table, old index, new index > > 5) commit > > 6) process till newindex->insisready (no new locks) > > 7) process till newindex->indisvalid (no new locks) > > 8) process till !oldindex->indisvalid (no new locks) > > 9) process till !oldindex->indisready (no new locks) > > 10) drop all session locks > > 11) lock old index exlusively which should be "invisible" now > > 12) drop old index > > You can't drop the session locks until you're done. Consider somebody > else trying to do a DROP TABLE between steps 10 and 11, for instance. Yea, the session lock on the table itself probably shouldn't be dropped. If were holding only that one there shouldn't be any additional deadlock dangers when dropping the index due to lock upgrades as were doing the normal dance any DROP INDEX does. They seem pretty unlikely in a !valid !ready table anyway. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2012/10/03, at 23:52, Andres Freund <andres@2ndquadrant.com> wrote: > On Wednesday, October 03, 2012 04:28:59 PM Tom Lane wrote: >> Andres Freund <andres@2ndquadrant.com> writes: >>> Maybe I am missing something here, but reindex concurrently should do >>> 1) BEGIN >>> 2) Lock table in share update exlusive >>> 3) lock old index >>> 3) create new index >>> 4) obtain session locks on table, old index, new index >>> 5) commit >>> 6) process till newindex->insisready (no new locks) >>> 7) process till newindex->indisvalid (no new locks) >>> 8) process till !oldindex->indisvalid (no new locks) >>> 9) process till !oldindex->indisready (no new locks) >>> 10) drop all session locks >>> 11) lock old index exlusively which should be "invisible" now >>> 12) drop old index >> >> You can't drop the session locks until you're done. Consider somebody >> else trying to do a DROP TABLE between steps 10 and 11, for instance. > Yea, the session lock on the table itself probably shouldn't be dropped. If > were holding only that one there shouldn't be any additional deadlock dangers > when dropping the index due to lock upgrades as were doing the normal dance > any DROP INDEX does. They seem pretty unlikely in a !valid !ready table > Just à note... My patch drops the locks on parent table and indexes at the end of process, after dropping the old indexes ;) Michael > > Greetings, > > Andres > -- > Andres Freund http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
On Wednesday, October 03, 2012 10:12:58 PM Michael Paquier wrote: > On 2012/10/03, at 23:52, Andres Freund <andres@2ndquadrant.com> wrote: > > On Wednesday, October 03, 2012 04:28:59 PM Tom Lane wrote: > >> Andres Freund <andres@2ndquadrant.com> writes: > >>> Maybe I am missing something here, but reindex concurrently should do > >>> 1) BEGIN > >>> 2) Lock table in share update exlusive > >>> 3) lock old index > >>> 3) create new index > >>> 4) obtain session locks on table, old index, new index > >>> 5) commit > >>> 6) process till newindex->insisready (no new locks) > >>> 7) process till newindex->indisvalid (no new locks) > >>> 8) process till !oldindex->indisvalid (no new locks) > >>> 9) process till !oldindex->indisready (no new locks) > >>> 10) drop all session locks > >>> 11) lock old index exlusively which should be "invisible" now > >>> 12) drop old index > >> > >> You can't drop the session locks until you're done. Consider somebody > >> else trying to do a DROP TABLE between steps 10 and 11, for instance. > > > > Yea, the session lock on the table itself probably shouldn't be dropped. > > If were holding only that one there shouldn't be any additional deadlock > > dangers when dropping the index due to lock upgrades as were doing the > > normal dance any DROP INDEX does. They seem pretty unlikely in a !valid > > !ready table > > Just à note... > My patch drops the locks on parent table and indexes at the end of process, > after dropping the old indexes ;) I think that might result in deadlocks with concurrent sessions in some circumstances if those other sessions already have a lower level lock on the index. Thats why I think dropping the lock on the index and then reacquiring an access exlusive might be necessary. Its not a too likely scenario, but why not do it right if its just 3 lines... Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2012/10/04, at 5:41, Andres Freund <andres@2ndquadrant.com> wrote: > On Wednesday, October 03, 2012 10:12:58 PM Michael Paquier wrote: >> On 2012/10/03, at 23:52, Andres Freund <andres@2ndquadrant.com> wrote: >>> On Wednesday, October 03, 2012 04:28:59 PM Tom Lane wrote: >>>> Andres Freund <andres@2ndquadrant.com> writes: >>>>> Maybe I am missing something here, but reindex concurrently should do >>>>> 1) BEGIN >>>>> 2) Lock table in share update exlusive >>>>> 3) lock old index >>>>> 3) create new index >>>>> 4) obtain session locks on table, old index, new index >>>>> 5) commit >>>>> 6) process till newindex->insisready (no new locks) >>>>> 7) process till newindex->indisvalid (no new locks) >>>>> 8) process till !oldindex->indisvalid (no new locks) >>>>> 9) process till !oldindex->indisready (no new locks) >>>>> 10) drop all session locks >>>>> 11) lock old index exlusively which should be "invisible" now >>>>> 12) drop old index >>>> >>>> You can't drop the session locks until you're done. Consider somebody >>>> else trying to do a DROP TABLE between steps 10 and 11, for instance. >>> >>> Yea, the session lock on the table itself probably shouldn't be dropped. >>> If were holding only that one there shouldn't be any additional deadlock >>> dangers when dropping the index due to lock upgrades as were doing the >>> normal dance any DROP INDEX does. They seem pretty unlikely in a !valid >>> !ready table >> >> Just à note... >> My patch drops the locks on parent table and indexes at the end of process, >> after dropping the old indexes ;) > I think that might result in deadlocks with concurrent sessions in some > circumstances if those other sessions already have a lower level lock on the > index. Thats why I think dropping the lock on the index and then reacquiring > an access exlusive might be necessary. > Its not a too likely scenario, but why not do it right if its just 3 lines... Tom is right. This scenario does not cover the case where you drop the parent table or you drop the index, which is indeedinvisible, but still has a pg_class and a pg_index entry, from a different session after step 10 and before step 11.So you cannot either drop the locks on indexes until you are done at step 12. > > Andres > -- > Andres Freund http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
On Wednesday, October 03, 2012 11:42:25 PM Michael Paquier wrote: > On 2012/10/04, at 5:41, Andres Freund <andres@2ndquadrant.com> wrote: > > On Wednesday, October 03, 2012 10:12:58 PM Michael Paquier wrote: > >> On 2012/10/03, at 23:52, Andres Freund <andres@2ndquadrant.com> wrote: > >>> On Wednesday, October 03, 2012 04:28:59 PM Tom Lane wrote: > >>>> Andres Freund <andres@2ndquadrant.com> writes: > >>>>> Maybe I am missing something here, but reindex concurrently should do > >>>>> 1) BEGIN > >>>>> 2) Lock table in share update exlusive > >>>>> 3) lock old index > >>>>> 3) create new index > >>>>> 4) obtain session locks on table, old index, new index > >>>>> 5) commit > >>>>> 6) process till newindex->insisready (no new locks) > >>>>> 7) process till newindex->indisvalid (no new locks) > >>>>> 8) process till !oldindex->indisvalid (no new locks) > >>>>> 9) process till !oldindex->indisready (no new locks) > >>>>> 10) drop all session locks > >>>>> 11) lock old index exlusively which should be "invisible" now > >>>>> 12) drop old index > >>>> > >>>> You can't drop the session locks until you're done. Consider somebody > >>>> else trying to do a DROP TABLE between steps 10 and 11, for instance. > >>> > >>> Yea, the session lock on the table itself probably shouldn't be > >>> dropped. If were holding only that one there shouldn't be any > >>> additional deadlock dangers when dropping the index due to lock > >>> upgrades as were doing the normal dance any DROP INDEX does. They seem > >>> pretty unlikely in a !valid !ready table > >> > >> Just à note... > >> My patch drops the locks on parent table and indexes at the end of > >> process, after dropping the old indexes ;) > > > > I think that might result in deadlocks with concurrent sessions in some > > circumstances if those other sessions already have a lower level lock on > > the index. Thats why I think dropping the lock on the index and then > > reacquiring an access exlusive might be necessary. > > Its not a too likely scenario, but why not do it right if its just 3 > > lines... > > Tom is right. This scenario does not cover the case where you drop the > parent table or you drop the index, which is indeed invisible, but still > has a pg_class and a pg_index entry, from a different session after step > 10 and before step 11. So you cannot either drop the locks on indexes > until you are done at step 12. Yep: > Yea, the session lock on the table itself probably shouldn't be dropped. But that does *not* mean you cannot avoid lock upgrade issues by dropping the lower level lock on the index first and only then acquiring the access exlusive lock. Note that dropping an index always includes *first* getting a lock on the table so doing it that way is safe and just the same as a normal drop index. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Oct 3, 2012 at 5:10 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Btw, there is still something I wanted to clarify. You mention in your ideas "old" and "new" indexes.
Such as we create a new index at the begininning and drop the old one at the end. This is not completely true in the case of switching relfilenode. What happens is that we create a new index with a new physical storage, then at swap step, we switch the old storage and the new storage. Once swap is done, the index that needs to be set as invalid and not ready is not the old index. but the index created at the beginning of process that has now the old relfilenode. Then the relation that is indeed dropped at the end of process is also the index with the old relfilenode, so the index created indeed at the beginning of process. I understand that this is playing with the words, but I just wanted to confirm that we are on the same line.
--
Michael Paquier
http://michael.otacoo.com
> 14) Swap new and old indexes, consisting here in switching their names.I think switching based on their names is not going to work very well because
indexes are referenced by oid at several places. Swapping pg_index.indexrelid
or pg_class.relfilenode seems to be the better choice to me. We expect
relfilenode changes for such commands, but not ::regclass oid changes.
OK, if there is a choice to be made, switching relfilenode would be a better choice as it points to the physical storage itself. It looks more straight-forward than switching oids, and takes the switch at the root.
Btw, there is still something I wanted to clarify. You mention in your ideas "old" and "new" indexes.
Such as we create a new index at the begininning and drop the old one at the end. This is not completely true in the case of switching relfilenode. What happens is that we create a new index with a new physical storage, then at swap step, we switch the old storage and the new storage. Once swap is done, the index that needs to be set as invalid and not ready is not the old index. but the index created at the beginning of process that has now the old relfilenode. Then the relation that is indeed dropped at the end of process is also the index with the old relfilenode, so the index created indeed at the beginning of process. I understand that this is playing with the words, but I just wanted to confirm that we are on the same line.
Michael Paquier
http://michael.otacoo.com
Michael Paquier <michael.paquier@gmail.com> writes: > On Wed, Oct 3, 2012 at 5:10 PM, Andres Freund <andres@2ndquadrant.com>wrote: > 14) Swap new and old indexes, consisting here in switching their names. >> I think switching based on their names is not going to work very well >> because >> indexes are referenced by oid at several places. Swapping >> pg_index.indexrelid >> or pg_class.relfilenode seems to be the better choice to me. We expect >> relfilenode changes for such commands, but not ::regclass oid changes. > OK, if there is a choice to be made, switching relfilenode would be a > better choice as it points to the physical storage itself. It looks more > straight-forward than switching oids, and takes the switch at the root. Andres is quite right that "switch by name" is out of the question --- for the most part, the system pays no attention to index names at all. It just gets a list of the OIDs of indexes belonging to a table and works with that. However, I'm pretty suspicious of the idea of switching relfilenodes as well. You generally can't change the relfilenode of a relation (either a table or an index) without taking an exclusive lock on it, because changing the relfilenode *will* break any concurrent operations on the index. And there is not anyplace in the proposed sequence where it's okay to have exclusive lock on both indexes, at least not if the goal is to not block concurrent updates at any time. I think what you'd have to do is drop the old index (relying on the assumption that no one is accessing it anymore after a certain point, so you can take exclusive lock on it now) and then rename the new index to have the old index's name. However, renaming an index without exclusive lock on it still seems a bit risky. Moreover, what if you crash right after committing the drop of the old index? I'm really not convinced that we have a bulletproof solution yet, at least not if you insist on the replacement index having the same name as the original. How badly do we need that? regards, tom lane
On 2012/10/04, at 10:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> On Wed, Oct 3, 2012 at 5:10 PM, Andres Freund <andres@2ndquadrant.com>wrote: >> 14) Swap new and old indexes, consisting here in switching their names. >>> I think switching based on their names is not going to work very well >>> because >>> indexes are referenced by oid at several places. Swapping >>> pg_index.indexrelid >>> or pg_class.relfilenode seems to be the better choice to me. We expect >>> relfilenode changes for such commands, but not ::regclass oid changes. > >> OK, if there is a choice to be made, switching relfilenode would be a >> better choice as it points to the physical storage itself. It looks more >> straight-forward than switching oids, and takes the switch at the root. > > Andres is quite right that "switch by name" is out of the question --- > for the most part, the system pays no attention to index names at all. > It just gets a list of the OIDs of indexes belonging to a table and > works with that. Sure. The switching being done by changing the index name is just the direction taken by the first version of the patch,and only that. I just wrote this version without really looking for a bulletproof solution but only for something todiscuss about. > > However, I'm pretty suspicious of the idea of switching relfilenodes as > well. You generally can't change the relfilenode of a relation (either > a table or an index) without taking an exclusive lock on it, because > changing the relfilenode *will* break any concurrent operations on the > index. And there is not anyplace in the proposed sequence where it's > okay to have exclusive lock on both indexes, at least not if the goal > is to not block concurrent updates at any time. Ok. As the goal is to allow concurrent operations, this is not reliable either. So what is remaining is the method switchingthe OIDs of old and new indexes in pg_index? Any other candidates? > > I think what you'd have to do is drop the old index (relying on the > assumption that no one is accessing it anymore after a certain point, so > you can take exclusive lock on it now) and then rename the new index > to have the old index's name. However, renaming an index without > exclusive lock on it still seems a bit risky. Moreover, what if you > crash right after committing the drop of the old index? > > I'm really not convinced that we have a bulletproof solution yet, > at least not if you insist on the replacement index having the same name as the original. How badly do we need that? And we do not really need such a solution as I am not insisting on the method that switches indexes by changing names. Iam open to a reliable and robust method, and I hope this method could be decided in this thread. Thanks for those arguments, I am feeling it is really leading the discussion to the good direction. Thanks. Michael
On Thu, Oct 4, 2012 at 2:19 AM, Michael Paquier <michael.paquier@gmail.com> wrote: >> I think what you'd have to do is drop the old index (relying on the >> assumption that no one is accessing it anymore after a certain point, so >> you can take exclusive lock on it now) and then rename the new index >> to have the old index's name. However, renaming an index without >> exclusive lock on it still seems a bit risky. Moreover, what if you >> crash right after committing the drop of the old index? I think this would require a new state which is the converse of indisvalid=f. Right now there's no state the index can be in that means the index should be ignored for both scans and maintenance but might have old sessions that might be using it or maintaining it. I'm a bit puzzled why we're so afraid of swapping the relfilenodes when that's what the current REINDEX does. It seems flaky to have two different mechanisms depending on which mode is being used. It seems more conservative to use the same mechanism and just figure out what's required to ensure it's safe in both modes. At least there won't be any bugs from unexpected consequences that aren't locking related if it's using the same mechanics. -- greg
Greg Stark <stark@mit.edu> writes: > I'm a bit puzzled why we're so afraid of swapping the relfilenodes > when that's what the current REINDEX does. Swapping the relfilenodes is fine *as long as you have exclusive lock*. The trick is to make it safe without that. It will definitely not work to do that without exclusive lock, because at the instant you would try it, people will be accessing the new index (by OID). regards, tom lane
On Thu, Oct 4, 2012 at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
-- Greg Stark <stark@mit.edu> writes:Swapping the relfilenodes is fine *as long as you have exclusive lock*.
> I'm a bit puzzled why we're so afraid of swapping the relfilenodes
> when that's what the current REINDEX does.
The trick is to make it safe without that. It will definitely not work
to do that without exclusive lock, because at the instant you would try
it, people will be accessing the new index (by OID).
OK, so index swapping could be done by:
1) Index name switch. This is not thought as safe as the system does not pay attention on index names at all.
2) relfilenode switch. An ExclusiveLock is necessary.The lock that would be taken is not compatible with a concurrent operation, except if we consider that the lock will not be taken for a long time, only during the swap moment. Reindex uses this mechanism, so it would be good for consistency.
3) Switch the OIDs of indexes. Looks safe from the system prospective and it will be necessary to invalidate the cache entries for both relations after swap. Any opinions on this one?
1) Index name switch. This is not thought as safe as the system does not pay attention on index names at all.
2) relfilenode switch. An ExclusiveLock is necessary.The lock that would be taken is not compatible with a concurrent operation, except if we consider that the lock will not be taken for a long time, only during the swap moment. Reindex uses this mechanism, so it would be good for consistency.
3) Switch the OIDs of indexes. Looks safe from the system prospective and it will be necessary to invalidate the cache entries for both relations after swap. Any opinions on this one?
Michael Paquier
http://michael.otacoo.com
On Thursday, October 04, 2012 04:51:29 AM Tom Lane wrote: > Greg Stark <stark@mit.edu> writes: > > I'm a bit puzzled why we're so afraid of swapping the relfilenodes > > when that's what the current REINDEX does. > > Swapping the relfilenodes is fine *as long as you have exclusive lock*. > The trick is to make it safe without that. It will definitely not work > to do that without exclusive lock, because at the instant you would try > it, people will be accessing the new index (by OID). I can understand hesitation around that.. I would like to make sure I understand the problem correctly. When we get to the point where we switch indexes we should be in the following state: - both indexes are indisready - old should be invalid - new index should be valid - have the same indcheckxmin - be locked by us preventing anybody else from making changes Lets assume we have index a_old(relfilenode 1) as the old index and a rebuilt index a_new (relfilenode 2) as the one we just built. If we do it properly nobody will have 'a' open for querying, just for modifications (its indisready) as we had waited for everyone that could have seen a as valid to finish. As far as I understand the code a session using a_new will also have built a relcache entry for a_old. Two problems: * relying on the relcache to be built for both indexes seems hinky * As the relcache is built with SnapshotNow it could read the old definition for a_new and the new one for a_old (or the reverse) and thus end up with both pointing to the same relfilenode. Which would be ungood. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Oct 5, 2012 at 6:58 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Michael Paquierunderstand the problem correctly. When we get to the point where we switchOn Thursday, October 04, 2012 04:51:29 AM Tom Lane wrote:
I can understand hesitation around that.. I would like to make sure I
indexes we should be in the following state:
- both indexes are indisready
- old should be invalid
- new index should be valid
- have the same indcheckxmin
- be locked by us preventing anybody else from making changes
Looks like a good presentation of the problem. I am not sure if marking the new index as valid is necessary though. As long as it is done inside the same transaction as the swap there are no problems, no?
Lets assume we have index a_old(relfilenode 1) as the old index and a rebuilt
index a_new (relfilenode 2) as the one we just built. If we do it properly
nobody will have 'a' open for querying, just for modifications (its indisready)
as we had waited for everyone that could have seen a as valid to finish.
As far as I understand the code a session using a_new will also have built a
relcache entry for a_old.
Two problems:
* relying on the relcache to be built for both indexes seems hinky
* As the relcache is built with SnapshotNow it could read the old definition
for a_new and the new one for a_old (or the reverse) and thus end up with both
pointing to the same relfilenode. Which would be ungood.
OK, so the problem here is that the relcache, as the syscache, are relying on SnapshotNow which cannot be used safely as the false index definition could be read by other backends. So this looks to bring back the discussion to the point where a higher lock level is necessary to perform a safe switch of the indexes.
I assume that the switch phase is not the longest phase of the concurrent operation, as you also need to build and validate the new index at prior steps. I am just wondering if it is acceptable to you guys to take a stronger lock only during this switch phase. This won't make the reindex being concurrently all the time but it would avoid any visibility issues and have an index switch processing which is more consistent with the existing implementation as it could rely on the same mechanism as normal reindex that switches relfilenode.
--
I assume that the switch phase is not the longest phase of the concurrent operation, as you also need to build and validate the new index at prior steps. I am just wondering if it is acceptable to you guys to take a stronger lock only during this switch phase. This won't make the reindex being concurrently all the time but it would avoid any visibility issues and have an index switch processing which is more consistent with the existing implementation as it could rely on the same mechanism as normal reindex that switches relfilenode.
--
http://michael.otacoo.com
Michael Paquier <michael.paquier@gmail.com> writes: > OK, so the problem here is that the relcache, as the syscache, are relying > on SnapshotNow which cannot be used safely as the false index definition > could be read by other backends. That's one problem. It's definitely not the only one, if we're trying to change an index's definition while an index-accessing operation is in progress. > I assume that the switch phase is not the longest phase of the concurrent > operation, as you also need to build and validate the new index at prior > steps. I am just wondering if it is acceptable to you guys to take a > stronger lock only during this switch phase. We might be forced to fall back on such a solution, but it's pretty undesirable. Even though the exclusive lock would only need to be held for a short time, it can create a big hiccup in processing. The key reason is that once the ex-lock request is queued, it blocks ordinary operations coming in behind it. So effectively it's stopping operations not just for the length of time the lock is *held*, but for the length of time it's *awaited*, which could be quite long. Note that allowing subsequent requests to jump the queue would not be a good fix for this; if you do that, it's likely the ex-lock will never be granted, at least not till the next system idle time. Which if you've got one, you don't need a feature like this at all; you might as well just reindex normally during your idle time. regards, tom lane
Tom Lane escribió: > Note that allowing subsequent requests to jump the queue would not be a > good fix for this; if you do that, it's likely the ex-lock will never be > granted, at least not till the next system idle time. Which if you've > got one, you don't need a feature like this at all; you might as well > just reindex normally during your idle time. Not really. The time to run a complete reindex might be several hours. If the idle time is just a few minutes or seconds long, it may be more than enough to complete the switch operation, but not to run the complete reindex. Maybe another idea is that the reindexing is staged: the user would first run a command to create the replacement index, and leave both present until the user runs a second command (which acquires a strong lock) that executes the switch. Somehow similar to a constraint created as NOT VALID (which runs without a strong lock) which can be later validated separately. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Maybe another idea is that the reindexing is staged: the user would > first run a command to create the replacement index, and leave both > present until the user runs a second command (which acquires a strong > lock) that executes the switch. Somehow similar to a constraint created > as NOT VALID (which runs without a strong lock) which can be later > validated separately. Yeah. We could consider CREATE INDEX CONCURRENTLY (already exists) SWAP INDEXES (requires ex-lock, swaps names and constraint dependencies; or maybe just implement as swap of relfilenodes?) DROP INDEX CONCURRENTLY The last might have some usefulness in its own right, anyway. regards, tom lane
On Sat, Oct 6, 2012 at 6:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Until now all the approaches investigated (switch of relfilenode, switch of index OID) need to have an exclusive lock because we try to maintain index OID as consistent. In the patch I submitted, the new index created has a different OID than the old index, and simply switches names. So after the REINDEX CONCURRENTLY the OID of index on the table is different, but seen from user the name is the same. Is it acceptable to consider that a reindex concurrently could change the OID of the index rebuild? Is it a postgres requirement to maintain the object OIDs consistent between DDL operations?Alvaro Herrera <alvherre@2ndquadrant.com> writes:Yeah. We could consider
> Maybe another idea is that the reindexing is staged: the user would
> first run a command to create the replacement index, and leave both
> present until the user runs a second command (which acquires a strong
> lock) that executes the switch. Somehow similar to a constraint created
> as NOT VALID (which runs without a strong lock) which can be later
> validated separately.
CREATE INDEX CONCURRENTLY (already exists)
SWAP INDEXES (requires ex-lock, swaps names and constraint dependencies;
or maybe just implement as swap of relfilenodes?)
DROP INDEX CONCURRENTLY
OK. That is a different approach and would limit strictly the amount of code necessary for the feature, but I feel that it breaks the nature of CONCURRENTLY which should run without any exclusive locks. The possibility to do that in a single command would be also better perhaps seen from the user.
If the OID of old and new index are different, the relcache entries of each index will be completely separated, and this would take care of any visibility problems regarding visibility. pg_reorg for example changes the relation OID of the table reorganized after operation is completed.
Thoughts about that?
--
Michael Paquier
http://michael.otacoo.com
Michael Paquier <michael.paquier@gmail.com> writes: > On Sat, Oct 6, 2012 at 6:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> CREATE INDEX CONCURRENTLY (already exists) >> SWAP INDEXES (requires ex-lock, swaps names and constraint dependencies; >> or maybe just implement as swap of relfilenodes?) >> DROP INDEX CONCURRENTLY > OK. That is a different approach and would limit strictly the amount of > code necessary for the feature, but I feel that it breaks the nature of > CONCURRENTLY which should run without any exclusive locks. Hm? The whole point is that the CONCURRENTLY commands don't require exclusive locks. Only the SWAP command would. > Until now all the approaches investigated (switch of relfilenode, switch of > index OID) need to have an exclusive lock because we try to maintain index > OID as consistent. In the patch I submitted, the new index created has a > different OID than the old index, and simply switches names. So after the > REINDEX CONCURRENTLY the OID of index on the table is different, but seen > from user the name is the same. Is it acceptable to consider that a reindex > concurrently could change the OID of the index rebuild? That is not going to work without ex-lock somewhere. If you change the index's OID then you will have to change pg_constraint and pg_depend entries referencing it, and that creates race condition hazards for other processes looking at those catalogs. I'm not convinced that you can even do a rename safely without ex-lock. Basically, any DDL update on an active index is going to be dangerous and probably impossible without lock, IMO. To answer your question, I don't think anyone would object to the index's OID changing if the operation were safe otherwise. But I don't think that allowing that gets us to a safe solution. regards, tom lane
On Sat, Oct 6, 2012 at 8:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
-- Michael Paquier <michael.paquier@gmail.com> writes:Hm? The whole point is that the CONCURRENTLY commands don't require
> On Sat, Oct 6, 2012 at 6:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> OK. That is a different approach and would limit strictly the amount of
> code necessary for the feature, but I feel that it breaks the nature of
> CONCURRENTLY which should run without any exclusive locks.
exclusive locks. Only the SWAP command would.
Yes, but my point is that it is more user-friendly to have such a functionality with a single command.
By having something without locks, you could use the concurrent APIs to perform a REINDEX automatically in autovacuum for example.
Also, the possibility to perform concurrent operations entirely without exclusive locks is not a problem limited to REINDEX, there would be for sure similar problems if CLUSTER CONCURRENTLY or ALTER TABLE CONCURRENTLY are wanted.
By having something without locks, you could use the concurrent APIs to perform a REINDEX automatically in autovacuum for example.
Also, the possibility to perform concurrent operations entirely without exclusive locks is not a problem limited to REINDEX, there would be for sure similar problems if CLUSTER CONCURRENTLY or ALTER TABLE CONCURRENTLY are wanted.
That is not going to work without ex-lock somewhere. If you change the
> Until now all the approaches investigated (switch of relfilenode, switch of
> index OID) need to have an exclusive lock because we try to maintain index
> OID as consistent. In the patch I submitted, the new index created has a
> different OID than the old index, and simply switches names. So after the
> REINDEX CONCURRENTLY the OID of index on the table is different, but seen
> from user the name is the same. Is it acceptable to consider that a reindex
> concurrently could change the OID of the index rebuild?
index's OID then you will have to change pg_constraint and pg_depend
entries referencing it, and that creates race condition hazards for
other processes looking at those catalogs. I'm not convinced that you
can even do a rename safely without ex-lock. Basically, any DDL update
on an active index is going to be dangerous and probably impossible
without lock, IMO.
In the current version of the patch, at the beginning of process a new index is created. It is a twin of the index it has to replace, meaning that it copies the dependencies of old index and creates twin entries of the old index even in pg_depend and pg_constraint also if necessary. So the old index and the new index have exactly the same data in catalog, they are completely decoupled, and you do not need to worry about the OID replacements and the visibility consequences.
Knowing that both indexes are completely separate entities, isn't this enough to change the new index as the old one with a low-level lock? In the case of my patch only the names are simply exchanged and make the user unaware of what is happening in background. This behaves similarly to pg_reorg, explaining why the OIDs of tables reorganized are changed after being pg_reorg'ed.
Knowing that both indexes are completely separate entities, isn't this enough to change the new index as the old one with a low-level lock? In the case of my patch only the names are simply exchanged and make the user unaware of what is happening in background. This behaves similarly to pg_reorg, explaining why the OIDs of tables reorganized are changed after being pg_reorg'ed.
To answer your question, I don't think anyone would object to the
index's OID changing if the operation were safe otherwise. But I don't
think that allowing that gets us to a safe solution.
OK thanks.
Michael Paquier
http://michael.otacoo.com
On 10/05/2012 09:03 PM, Tom Lane wrote: > Note that allowing subsequent requests to jump the queue would not be a > good fix for this; if you do that, it's likely the ex-lock will never be > granted, at least not till the next system idle time. Offering that option to the admin sounds like a good thing, since (as Alvaro points out) the build of the replacement index could take considerable time but be done without the lock. Then the swap done in the first quiet period (but without further admin action), and the drop started. One size doesn't fit all. It doesn't need to be the only method. -- Cheers, Jeremy
On 10/5/12 9:57 PM, Michael Paquier wrote: > In the current version of the patch, at the beginning of process a new index is created. It is a twin of the index it hasto replace, meaning that it copies the dependencies of old index and creates twin entries of the old index even in pg_dependand pg_constraint also if necessary. So the old index and the new index have exactly the same data in catalog, theyare completely decoupled, and you do not need to worry about the OID replacements and the visibility consequences. Yeah, what's the risk to renaming an index during concurrent access? The only thing I can think of is an "old" backend referringto the wrong index name in an elog. That's certainly not great, but could possibly be dealt with. Are there any other things that are directly tied to the name of an index (or of any object for that matter)? -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Monday, October 08, 2012 11:57:46 PM Jim Nasby wrote: > On 10/5/12 9:57 PM, Michael Paquier wrote: > > In the current version of the patch, at the beginning of process a new > > index is created. It is a twin of the index it has to replace, meaning > > that it copies the dependencies of old index and creates twin entries of > > the old index even in pg_depend and pg_constraint also if necessary. So > > the old index and the new index have exactly the same data in catalog, > > they are completely decoupled, and you do not need to worry about the > > OID replacements and the visibility consequences. > > Yeah, what's the risk to renaming an index during concurrent access? The > only thing I can think of is an "old" backend referring to the wrong index > name in an elog. That's certainly not great, but could possibly be dealt > with. We cannot have two indexes with the same oid in the catalog, so the two different names will have to have different oids. Unfortunately the indexes oid is referred to by other tables (e.g. pg_constraint), so renaming the indexes while differering in the oid isn't really helpful :(... Right now I don't see anything that would make switching oids easier than relfilenodes. Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Jim Nasby <jim@nasby.net> writes: > Yeah, what's the risk to renaming an index during concurrent access? SnapshotNow searches for the pg_class row could get broken by *any* transactional update of that row, whether it's for a change of relname or some other field. A lot of these problems would go away if we rejiggered the definition of SnapshotNow to be more like MVCC. We have discussed that in the past, but IIRC it's not exactly a simple or risk-free change in itself. Still, maybe we should start thinking about doing that instead of trying to make REINDEX CONCURRENTLY safe given the existing infrastructure. regards, tom lane
On 10/8/12 5:08 PM, Andres Freund wrote: > On Monday, October 08, 2012 11:57:46 PM Jim Nasby wrote: >> >On 10/5/12 9:57 PM, Michael Paquier wrote: >>> > >In the current version of the patch, at the beginning of process a new >>> > >index is created. It is a twin of the index it has to replace, meaning >>> > >that it copies the dependencies of old index and creates twin entries of >>> > >the old index even in pg_depend and pg_constraint also if necessary. So >>> > >the old index and the new index have exactly the same data in catalog, >>> > >they are completely decoupled, and you do not need to worry about the >>> > >OID replacements and the visibility consequences. >> > >> >Yeah, what's the risk to renaming an index during concurrent access? The >> >only thing I can think of is an "old" backend referring to the wrong index >> >name in an elog. That's certainly not great, but could possibly be dealt >> >with. > We cannot have two indexes with the same oid in the catalog, so the two > different names will have to have different oids. Unfortunately the indexes oid > is referred to by other tables (e.g. pg_constraint), so renaming the indexes > while differering in the oid isn't really helpful :(... Hrm... the claim was made that everything relating to the index, including pg_depend and pg_contstraint, got duplicated.But I don't know how you could duplicate a constraint without also playing name games. Perhaps name games arebeing played there as well... > Right now I don't see anything that would make switching oids easier than > relfilenodes. Yeah... in order to make either of those schemes work I think there would need to non-trivial internal changes so that weweren't just passing around raw OIDs/filenodes. BTW, it occurs to me that this problem might be easier to deal with if we had support for accessing the catalog with thesame snapshot as the main query was using... IIRC that's been discussed in the past for other issues. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 10/8/12 6:12 PM, Tom Lane wrote: > Jim Nasby <jim@nasby.net> writes: >> Yeah, what's the risk to renaming an index during concurrent access? > > SnapshotNow searches for the pg_class row could get broken by *any* > transactional update of that row, whether it's for a change of relname > or some other field. > > A lot of these problems would go away if we rejiggered the definition of > SnapshotNow to be more like MVCC. We have discussed that in the past, > but IIRC it's not exactly a simple or risk-free change in itself. > Still, maybe we should start thinking about doing that instead of trying > to make REINDEX CONCURRENTLY safe given the existing infrastructure. Yeah, I was just trying to remember what other situations this has come up in. My recollection is that there's been a coupleother cases where that would be useful. My recollection is also that such a change would be rather large... but it might be smaller than all the other work-aroundsthat are needed because we don't have that... -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Tue, Oct 9, 2012 at 8:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
-- Jim Nasby <jim@nasby.net> writes:SnapshotNow searches for the pg_class row could get broken by *any*
> Yeah, what's the risk to renaming an index during concurrent access?
transactional update of that row, whether it's for a change of relname
or some other field.
Does it include updates on the relation names of pg_class, or ready and valid flags in pg_index? Tables refer to the indexes with OIDs only so if the index and its concurrent are completely separated entries in pg_index, pg_constraint and pg_class, what is the problem?
Is it that the Relation fetched from system cache might become inconsistent because of SnapshotNow?
Is it that the Relation fetched from system cache might become inconsistent because of SnapshotNow?
A lot of these problems would go away if we rejiggered the definition of
SnapshotNow to be more like MVCC. We have discussed that in the past,
but IIRC it's not exactly a simple or risk-free change in itself.
Still, maybe we should start thinking about doing that instead of trying
to make REINDEX CONCURRENTLY safe given the existing infrastructure.
+1. This is something to dig if operations like OID switch are envisaged for concurrent operations. This does not concern only REINDEX. Things like CLUSTER, or ALTER TABLE would need something similar.
Michael Paquier
http://michael.otacoo.com
On Tue, Oct 9, 2012 at 8:14 AM, Jim Nasby <jim@nasby.net> wrote:
-- Hrm... the claim was made that everything relating to the index, including pg_depend and pg_contstraint, got duplicated. But I don't know how you could duplicate a constraint without also playing name games. Perhaps name games are being played there as well...
Yes, it is what was originally intended. Please note the pg_constraint entry was not duplicated correctly in the first version of the patch because of a bug I already fixed.
I will provide another version soon if necessary.
I will provide another version soon if necessary.
Yes, it would be better and helpful to have such a mechanism even for other operations.Yeah... in order to make either of those schemes work I think there would need to non-trivial internal changes so that we weren't just passing around raw OIDs/filenodes.
Right now I don't see anything that would make switching oids easier than
relfilenodes.
BTW, it occurs to me that this problem might be easier to deal with if we had support for accessing the catalog with the same snapshot as the main query was using... IIRC that's been discussed in the past for other issues.
Michael Paquier
http://michael.otacoo.com
* Jim Nasby (jim@nasby.net) wrote: > Yeah, I was just trying to remember what other situations this has come up in. My recollection is that there's been a coupleother cases where that would be useful. Yes, I've run into similar issues in the past also. It'd be really neat to somehow make the SnapshotNow (and I'm guessing the whole SysCache system) behave more like MVCC. > My recollection is also that such a change would be rather large... but it might be smaller than all the other work-aroundsthat are needed because we don't have that... Perhaps.. Seems like it'd be a lot of work tho, to do it 'right', and I suspect there's a lot of skeletons out there that we'd run into.. Thanks, Stephen
Hi all,
Please find attached the version 2 of the patch for this feature, it corrects the following things:
- toast relations are now rebuilt concurrently as well as other indexes
- concurrent constraint indexes (PRIMARY KEY, UNIQUE, EXCLUSION) are dropped correctly at the end of process
- exclusion constraints are supported, at least it looks to work correctly.
- Fixed a couple of bugs when constraint indexes were involved in process.
I am adding this version to the commit fest of next month for review.
Regards,
--
Michael Paquier
http://michael.otacoo.com
Please find attached the version 2 of the patch for this feature, it corrects the following things:
- toast relations are now rebuilt concurrently as well as other indexes
- concurrent constraint indexes (PRIMARY KEY, UNIQUE, EXCLUSION) are dropped correctly at the end of process
- exclusion constraints are supported, at least it looks to work correctly.
- Fixed a couple of bugs when constraint indexes were involved in process.
I am adding this version to the commit fest of next month for review.
Regards,
--
Michael Paquier
http://michael.otacoo.com
Attachment
Hi all,
Long time this thread has not been updated...
Please find attached the version 3 of the patch for support of REINDEX CONCURRENTLY.
The code has been realigned with master up to commit da07a1e (6th December).
Here are the things modified:
- Improve code to use index_set_state_flag introduced by Tom in commit 3c84046
- One transaction is used for each index swap (N transactions if N indexes reindexed at the same time)
- Fixed a bug to drop the old indexes concurrently at the end of process
The index swap is managed by switching the names of the new and old indexes using RenameRelationInternal several times. This API takes an exclusive lock on the relation that is renamed until the end of the transaction managing the swap. This has been discussed in this thread and other threads, but it is important to mention it for people who have not read the patch.
There are still two things that are missing in this patch, but I would like to have more feedback before moving forward:
- REINDEX CONCURRENTLY needs tests in src/test/isolation
- There is still a problem with toast indexes. If the concurrent reindex of a toast index fails for a reason or another, pg_relation will finish with invalid toast index entries. I am still wondering about how to clean up that. Any ideas?
Comments?
--
Michael Paquier
http://michael.otacoo.com
Long time this thread has not been updated...
Please find attached the version 3 of the patch for support of REINDEX CONCURRENTLY.
The code has been realigned with master up to commit da07a1e (6th December).
Here are the things modified:
- Improve code to use index_set_state_flag introduced by Tom in commit 3c84046
- One transaction is used for each index swap (N transactions if N indexes reindexed at the same time)
- Fixed a bug to drop the old indexes concurrently at the end of process
The index swap is managed by switching the names of the new and old indexes using RenameRelationInternal several times. This API takes an exclusive lock on the relation that is renamed until the end of the transaction managing the swap. This has been discussed in this thread and other threads, but it is important to mention it for people who have not read the patch.
There are still two things that are missing in this patch, but I would like to have more feedback before moving forward:
- REINDEX CONCURRENTLY needs tests in src/test/isolation
- There is still a problem with toast indexes. If the concurrent reindex of a toast index fails for a reason or another, pg_relation will finish with invalid toast index entries. I am still wondering about how to clean up that. Any ideas?
Comments?
--
Michael Paquier
http://michael.otacoo.com
Attachment
On 7 December 2012 12:37, Michael Paquier <michael.paquier@gmail.com> wrote: > There are still two things that are missing in this patch, but I would like > to have more feedback before moving forward: > - REINDEX CONCURRENTLY needs tests in src/test/isolation Yes, it needs those > - There is still a problem with toast indexes. If the concurrent reindex of > a toast index fails for a reason or another, pg_relation will finish with > invalid toast index entries. I am still wondering about how to clean up > that. Any ideas? Build another toast index, rather than reindexing the existing one, then just use the new oid. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-12-07 21:37:06 +0900, Michael Paquier wrote: > Hi all, > > Long time this thread has not been updated... > Please find attached the version 3 of the patch for support of REINDEX > CONCURRENTLY. > The code has been realigned with master up to commit da07a1e (6th December). > > Here are the things modified: > - Improve code to use index_set_state_flag introduced by Tom in commit > 3c84046 > - One transaction is used for each index swap (N transactions if N indexes > reindexed at the same time) > - Fixed a bug to drop the old indexes concurrently at the end of process > > The index swap is managed by switching the names of the new and old indexes > using RenameRelationInternal several times. This API takes an exclusive > lock on the relation that is renamed until the end of the transaction > managing the swap. This has been discussed in this thread and other > threads, but it is important to mention it for people who have not read the > patch. Won't working like this cause problems when dependencies towards that index exist? E.g. an index-based constraint? As you have an access exlusive lock you should be able to just switch the relfilenodes of both and concurrently drop the *_cci index with the old relfilenode afterwards, that would preserve the index states. Right now I think clearing checkxmin is all you would need to other than that. We know we don't need it in the concurrent context. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On 7 December 2012 12:37, Michael Paquier <michael.paquier@gmail.com> wrote: >> - There is still a problem with toast indexes. If the concurrent reindex of >> a toast index fails for a reason or another, pg_relation will finish with >> invalid toast index entries. I am still wondering about how to clean up >> that. Any ideas? > Build another toast index, rather than reindexing the existing one, > then just use the new oid. Um, I don't think you can swap in a new toast index OID without taking exclusive lock on the parent table at some point. One sticking point is the need to update pg_class.reltoastidxid. I wonder how badly we need that field though --- could we get rid of it and treat toast-table indexes just the same as normal ones? (Whatever code is looking at the field could perhaps instead rely on RelationGetIndexList.) regards, tom lane
On 2012-12-07 12:01:52 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On 7 December 2012 12:37, Michael Paquier <michael.paquier@gmail.com> wrote: > >> - There is still a problem with toast indexes. If the concurrent reindex of > >> a toast index fails for a reason or another, pg_relation will finish with > >> invalid toast index entries. I am still wondering about how to clean up > >> that. Any ideas? > > > Build another toast index, rather than reindexing the existing one, > > then just use the new oid. Thats easier said than done in the first place. toast_save_datum() explicitly opens/modifies the one index it needs and updates it. > Um, I don't think you can swap in a new toast index OID without taking > exclusive lock on the parent table at some point. The whole swapping issue isn't solved satisfyingly as whole yet :(. If we just swap the index relfilenodes in the pg_index entries itself, we wouldn't need to modify the main table's pg_class at all. > One sticking point is the need to update pg_class.reltoastidxid. I > wonder how badly we need that field though --- could we get rid of it > and treat toast-table indexes just the same as normal ones? (Whatever > code is looking at the field could perhaps instead rely on > RelationGetIndexList.) We could probably just set Relation->rd_toastidx when building the relcache entry for the toast table so it doesn't have to search the whole indexlist all the time. Not that that would be too big, but... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 7 December 2012 17:19, Andres Freund <andres@2ndquadrant.com> wrote: > On 2012-12-07 12:01:52 -0500, Tom Lane wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >> > On 7 December 2012 12:37, Michael Paquier <michael.paquier@gmail.com> wrote: >> >> - There is still a problem with toast indexes. If the concurrent reindex of >> >> a toast index fails for a reason or another, pg_relation will finish with >> >> invalid toast index entries. I am still wondering about how to clean up >> >> that. Any ideas? >> >> > Build another toast index, rather than reindexing the existing one, >> > then just use the new oid. > > Thats easier said than done in the first place. toast_save_datum() > explicitly opens/modifies the one index it needs and updates it. Well, yeh, I know what I'm saying: it would need to maintain 2 indexes for a while. The point is to use the same trick we do manually now, which works fine for normal indexes and can be made to work for toast indexes also. >> Um, I don't think you can swap in a new toast index OID without taking >> exclusive lock on the parent table at some point. > > The whole swapping issue isn't solved satisfyingly as whole yet :(. > > If we just swap the index relfilenodes in the pg_index entries itself, > we wouldn't need to modify the main table's pg_class at all. yes -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Dec 7, 2012 at 10:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
-- On 7 December 2012 12:37, Michael Paquier <michael.paquier@gmail.com> wrote:Build another toast index, rather than reindexing the existing one,
> - There is still a problem with toast indexes. If the concurrent reindex of
> a toast index fails for a reason or another, pg_relation will finish with
> invalid toast index entries. I am still wondering about how to clean up
> that. Any ideas?
then just use the new oid.
Hum? The patch already does that. It creates concurrently a new index which is a duplicate of the existing one, then the old and new indexes are swapped. Finally the old index is dropped concurrently.
The problem I still see is the following one:
If a toast index, or a relation having a toast index, is being reindexed concurrently, and that the server crashes during the process, there will be invalid toast indexes in the server. If the crash happens before the swap, the new toast index is invalid. If the crash happens after the swap, the old toast index is invalid.
I am not sure the user is able to clean up such invalid toast indexes manually as they are not visible to him.
The problem I still see is the following one:
If a toast index, or a relation having a toast index, is being reindexed concurrently, and that the server crashes during the process, there will be invalid toast indexes in the server. If the crash happens before the swap, the new toast index is invalid. If the crash happens after the swap, the old toast index is invalid.
I am not sure the user is able to clean up such invalid toast indexes manually as they are not visible to him.
Michael Paquier
http://michael.otacoo.com
On Sat, Dec 8, 2012 at 2:19 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Michael PaquierOn 2012-12-07 12:01:52 -0500, Tom Lane wrote:Thats easier said than done in the first place. toast_save_datum()
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On 7 December 2012 12:37, Michael Paquier <michael.paquier@gmail.com> wrote:
> >> - There is still a problem with toast indexes. If the concurrent reindex of
> >> a toast index fails for a reason or another, pg_relation will finish with
> >> invalid toast index entries. I am still wondering about how to clean up
> >> that. Any ideas?
>
> > Build another toast index, rather than reindexing the existing one,
> > then just use the new oid.
explicitly opens/modifies the one index it needs and updates it.The whole swapping issue isn't solved satisfyingly as whole yet :(.
> Um, I don't think you can swap in a new toast index OID without taking
> exclusive lock on the parent table at some point.
If we just swap the index relfilenodes in the pg_index entries itself,
we wouldn't need to modify the main table's pg_class at all.
I think you are mistaking here, relfilenode is a column of pg_class and not pg_index.
So whatever the method used for swapping: relfilenode switch or relname switch, you need to modify the pg_class entry of the old and new indexes.
--
So whatever the method used for swapping: relfilenode switch or relname switch, you need to modify the pg_class entry of the old and new indexes.
--
http://michael.otacoo.com
On Sat, Dec 8, 2012 at 2:01 AM, Tom Lane <span dir="ltr"><<a href="mailto:tgl@sss.pgh.pa.us" target="_blank">tgl@sss.pgh.pa.us</a>></span>wrote:<br /><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:00 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Um, I don't think you can swap in a new toast indexOID without taking<br /> exclusive lock on the parent table at some point.<br /><br /> One sticking point is the needto update pg_class.reltoastidxid. I<br /> wonder how badly we need that field though --- could we get rid of it<br />and treat toast-table indexes just the same as normal ones? (Whatever<br /> code is looking at the field could perhapsinstead rely on<br /> RelationGetIndexList.)<br /></blockquote>Yes. reltoastidxid refers to the index of the toasttable so it is necessary to take a lock on the parent relation in this case. I haven't thought of that. I also do notreally know how far this is used by the toast process, but just by thinking safety taking a lock on the parent relationwould be better.<br /> For a normal index, locking the parent table is not necessary as we do not need to modifyanything in the parent relation entry in pg_class.<br clear="all" /></div>-- <br />Michael Paquier<br /><a href="http://michael.otacoo.com"target="_blank">http://michael.otacoo.com</a><br />
On 2012-12-08 21:24:47 +0900, Michael Paquier wrote: > On Sat, Dec 8, 2012 at 2:19 AM, Andres Freund <andres@2ndquadrant.com>wrote: > > > On 2012-12-07 12:01:52 -0500, Tom Lane wrote: > > > Simon Riggs <simon@2ndQuadrant.com> writes: > > > > On 7 December 2012 12:37, Michael Paquier <michael.paquier@gmail.com> > > wrote: > > > >> - There is still a problem with toast indexes. If the concurrent > > reindex of > > > >> a toast index fails for a reason or another, pg_relation will finish > > with > > > >> invalid toast index entries. I am still wondering about how to clean > > up > > > >> that. Any ideas? > > > > > > > Build another toast index, rather than reindexing the existing one, > > > > then just use the new oid. > > > > Thats easier said than done in the first place. toast_save_datum() > > explicitly opens/modifies the one index it needs and updates it. > > > > > Um, I don't think you can swap in a new toast index OID without taking > > > exclusive lock on the parent table at some point. > > > > The whole swapping issue isn't solved satisfyingly as whole yet :(. > > > > If we just swap the index relfilenodes in the pg_index entries itself, > > we wouldn't need to modify the main table's pg_class at all. > > > I think you are mistaking here, relfilenode is a column of pg_class and not > pg_index. > So whatever the method used for swapping: relfilenode switch or relname > switch, you need to modify the pg_class entry of the old and new indexes. The point is that with a relname switch the pg_class.oid of the index changes. Which is a bad idea because it will possibly be referred to by pg_depend entries. Relfilenodes - which certainly live in pg_class too, thats not the point - aren't referred to externally though. So if everything else in pg_class/pg_index stays the same a relfilenode switch imo saves you a lot of trouble. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2012-12-08 21:24:47 +0900, Michael Paquier wrote: >> So whatever the method used for swapping: relfilenode switch or relname >> switch, you need to modify the pg_class entry of the old and new indexes. > The point is that with a relname switch the pg_class.oid of the index > changes. Which is a bad idea because it will possibly be referred to by > pg_depend entries. Relfilenodes - which certainly live in pg_class too, > thats not the point - aren't referred to externally though. So if > everything else in pg_class/pg_index stays the same a relfilenode switch > imo saves you a lot of trouble. I do not believe that it is safe to modify an index's relfilenode *nor* its OID without exclusive lock; both of those are going to be in use to identify and access the index in concurrent sessions. The only things we could possibly safely swap in a REINDEX CONCURRENTLY are the index relnames, which are not used for identification by the system itself. (I think. It's possible that even this breaks something.) Even then, any such update of the pg_class rows is dependent on switching to MVCC-style catalog access, which frankly is pie in the sky at the moment; the last time pgsql-hackers talked seriously about that, there seemed to be multiple hard problems besides mere performance. If you want to wait for that, it's a safe bet that we won't see this feature for a few years. I'm tempted to propose that REINDEX CONCURRENTLY simply not try to preserve the index name exactly. Something like adding or removing trailing underscores would probably serve to generate a nonconflicting name that's not too unsightly. Or just generate a new name using the same rules that CREATE INDEX would when no name is specified. Yeah, it's a hack, but what about the CONCURRENTLY commands isn't a hack? regards, tom lane
On 2012-12-08 09:40:43 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2012-12-08 21:24:47 +0900, Michael Paquier wrote: > >> So whatever the method used for swapping: relfilenode switch or relname > >> switch, you need to modify the pg_class entry of the old and new indexes. > > > The point is that with a relname switch the pg_class.oid of the index > > changes. Which is a bad idea because it will possibly be referred to by > > pg_depend entries. Relfilenodes - which certainly live in pg_class too, > > thats not the point - aren't referred to externally though. So if > > everything else in pg_class/pg_index stays the same a relfilenode switch > > imo saves you a lot of trouble. > > I do not believe that it is safe to modify an index's relfilenode *nor* > its OID without exclusive lock; both of those are going to be in use to > identify and access the index in concurrent sessions. The only things > we could possibly safely swap in a REINDEX CONCURRENTLY are the index > relnames, which are not used for identification by the system itself. > (I think. It's possible that even this breaks something.) Well, the patch currently *does* take an exlusive lock in an extra transaction just for the swapping. In that case it should actually be safe. Although that obviously removes part of the usefulness of the feature. > Even then, any such update of the pg_class rows is dependent on > switching to MVCC-style catalog access, which frankly is pie in the sky > at the moment; the last time pgsql-hackers talked seriously about that, > there seemed to be multiple hard problems besides mere performance. > If you want to wait for that, it's a safe bet that we won't see this > feature for a few years. Yea :( > I'm tempted to propose that REINDEX CONCURRENTLY simply not try to > preserve the index name exactly. Something like adding or removing > trailing underscores would probably serve to generate a nonconflicting > name that's not too unsightly. Or just generate a new name using the > same rules that CREATE INDEX would when no name is specified. Yeah, > it's a hack, but what about the CONCURRENTLY commands isn't a hack? I have no problem with ending up with a new name or something like that. If that is what it takes: fine, no problem. The issue I raised above is just about keeping the pg_depend entries pointing to something valid... And not changing the indexes pg_class.oid seems to be the easiest solution for that. I have some vague schemes in my had that we can solve the swapping issue with 3 entries for the index in pg_class, but they all only seem to come to my head while I don't have anything to write them down, so they are probably bogus. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > The issue I raised above is just about keeping the pg_depend entries > pointing to something valid... And not changing the indexes pg_class.oid > seems to be the easiest solution for that. Yeah, we would have to update pg_depend, pg_constraint, maybe some other places if we go with that. I think that would be safe because we'd be holding ShareRowExclusive lock on the parent table throughout, so nobody else should be doing anything that's critically dependent on seeing such rows. But it'd be a lot of ugly code, for sure. Maybe the best way is to admit that we need a short-term exclusive lock for the swapping step. Or we could wait for MVCC catalog access ... regards, tom lane
On 8 December 2012 15:14, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Maybe the best way is to admit that we need a short-term exclusive lock > for the swapping step. Which wouldn't be so bad if this is just for the toast index, since in many cases the index itself is completely empty anyway, which must offer opportunities for optimization. > Or we could wait for MVCC catalog access ... If there was a published design for that, it would help believe in it more. Do you think one exists? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On 8 December 2012 15:14, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Or we could wait for MVCC catalog access ... > If there was a published design for that, it would help believe in it more. > Do you think one exists? Well, there have been discussion threads about it in the past. I don't recall whether any insoluble issues were raised. I think the concerns were mostly about performance, if we start taking many more snapshots than we have in the past. The basic idea isn't hard: anytime a catalog scan is requested with SnapshotNow, replace that with a freshly taken MVCC snapshot. I think we'd agreed that this could safely be optimized to "only take a new snapshot if any new heavyweight lock has been acquired since the last one". But that'll still be a lot of snapshots, and we know the snapshot-getting code is a bottleneck already. I think the discussions mostly veered off at this point into how to make snapshots cheaper. regards, tom lane
I have updated the patch (v4) to take care of updating reltoastidxid for toast parent relations at the swap step by using index_update_stats. In prior versions of the patch this was done when concurrent index was built, leading to toast relations using invalid indexes if there was a failure before the swap phase. The update of reltoastidxids of toast relation is done with RowExclusiveLock.
I also added a couple of tests in src/test/isolation. Btw, as for the time being the swap step uses AccessExclusiveLock to switch old and new relnames, it does not have any meaning to run them...
Michael Paquier
http://michael.otacoo.com
I also added a couple of tests in src/test/isolation. Btw, as for the time being the swap step uses AccessExclusiveLock to switch old and new relnames, it does not have any meaning to run them...
On Sat, Dec 8, 2012 at 11:55 PM, Andres Freund <andres@2ndquadrant.com> wrote:
-- On 2012-12-08 09:40:43 -0500, Tom Lane wrote:I have no problem with ending up with a new name or something like
> Andres Freund <andres@2ndquadrant.com> writes:
> I'm tempted to propose that REINDEX CONCURRENTLY simply not try to
> preserve the index name exactly. Something like adding or removing
> trailing underscores would probably serve to generate a nonconflicting
> name that's not too unsightly. Or just generate a new name using the
> same rules that CREATE INDEX would when no name is specified. Yeah,
> it's a hack, but what about the CONCURRENTLY commands isn't a hack?
that. If that is what it takes: fine, no problem.
For the indexes that are created internally by the system like toast or internal primary keys this is acceptable. However in the case of indexes that have been created externally I do not think it is acceptable as this impacts the user that created those indexes with a specific name.
pg_reorg itself also uses the relname switch method when rebuilding indexes and people using it did not complain about the heavy lock taken at swap phase, but praised it as it really helps in reducing the lock taken for reindex at index rebuild and validation, which are the phases that take the largest amount of time in the REINDEX process btw.
pg_reorg itself also uses the relname switch method when rebuilding indexes and people using it did not complain about the heavy lock taken at swap phase, but praised it as it really helps in reducing the lock taken for reindex at index rebuild and validation, which are the phases that take the largest amount of time in the REINDEX process btw.
Michael Paquier
http://michael.otacoo.com
Attachment
On 10 December 2012 06:03, Michael Paquier <michael.paquier@gmail.com> wrote: >> On 2012-12-08 09:40:43 -0500, Tom Lane wrote: >> > Andres Freund <andres@2ndquadrant.com> writes: >> > I'm tempted to propose that REINDEX CONCURRENTLY simply not try to >> > preserve the index name exactly. Something like adding or removing >> > trailing underscores would probably serve to generate a nonconflicting >> > name that's not too unsightly. Or just generate a new name using the >> > same rules that CREATE INDEX would when no name is specified. Yeah, >> > it's a hack, but what about the CONCURRENTLY commands isn't a hack? >> >> I have no problem with ending up with a new name or something like >> that. If that is what it takes: fine, no problem. > > For the indexes that are created internally by the system like toast or > internal primary keys this is acceptable. However in the case of indexes > that have been created externally I do not think it is acceptable as this > impacts the user that created those indexes with a specific name. If I have to choose between (1) keeping the same name OR (2) avoiding an AccessExclusiveLock then I would choose (2). Most other people would also, especially when all we would do is add/remove an underscore. Even if that is user visible. And if it is we can support a LOCK option that does (1) instead. If we make it an additional constraint on naming, it won't be a problem... namely that you can't create an index with/without an underscore at the end, if a similar index already exists that has an identical name apart from the suffix. There are few, if any, commands that need the index name to remain the same. For those, I think we can bend them to accept the index name and then add/remove the underscore to get that to work. That's all a little bit crappy, but this is too small a problem with an important feature to allow us to skip. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
-- Michael Paquier http://michael.otacoo.com On 2012/12/10, at 18:28, Simon Riggs <simon@2ndQuadrant.com> wrote: > On 10 December 2012 06:03, Michael Paquier <michael.paquier@gmail.com> wrote: >>> On 2012-12-08 09:40:43 -0500, Tom Lane wrote: >>>> Andres Freund <andres@2ndquadrant.com> writes: >>>> I'm tempted to propose that REINDEX CONCURRENTLY simply not try to >>>> preserve the index name exactly. Something like adding or removing >>>> trailing underscores would probably serve to generate a nonconflicting >>>> name that's not too unsightly. Or just generate a new name using the >>>> same rules that CREATE INDEX would when no name is specified. Yeah, >>>> it's a hack, but what about the CONCURRENTLY commands isn't a hack? >>> >>> I have no problem with ending up with a new name or something like >>> that. If that is what it takes: fine, no problem. >> >> For the indexes that are created internally by the system like toast or >> internal primary keys this is acceptable. However in the case of indexes >> that have been created externally I do not think it is acceptable as this >> impacts the user that created those indexes with a specific name. > > If I have to choose between (1) keeping the same name OR (2) avoiding > an AccessExclusiveLock then I would choose (2). Most other people > would also, especially when all we would do is add/remove an > underscore. Even if that is user visible. And if it is we can support > a LOCK option that does (1) instead. > > If we make it an additional constraint on naming, it won't be a > problem... namely that you can't create an index with/without an > underscore at the end, if a similar index already exists that has an > identical name apart from the suffix. > > There are few, if any, commands that need the index name to remain the > same. For those, I think we can bend them to accept the index name and > then add/remove the underscore to get that to work. > > That's all a little bit crappy, but this is too small a problem with > an important feature to allow us to skip. Ok. Removing the switch name part is only deleting 10 lines of code in index_concurrent_swap. Then, do you guys have a preferred format for the concurrent index name? For the time being an inelegant _cct suffix is used.The underscore at the end? Michael
On 2012-12-10 15:03:59 +0900, Michael Paquier wrote: > I have updated the patch (v4) to take care of updating reltoastidxid for > toast parent relations at the swap step by using index_update_stats. In > prior versions of the patch this was done when concurrent index was built, > leading to toast relations using invalid indexes if there was a failure > before the swap phase. The update of reltoastidxids of toast relation is > done with RowExclusiveLock. > I also added a couple of tests in src/test/isolation. Btw, as for the time > being the swap step uses AccessExclusiveLock to switch old and new > relnames, it does not have any meaning to run them... Btw, as an example of the problems caused by renaming: postgres=# CREATE TABLE a (id serial primary key); CREATE TABLE b(id serial primary key, a_id int REFERENCES a); CREATE TABLE Time: 137.840 ms CREATE TABLE Time: 143.500 ms postgres=# \d b Table "public.b"Column | Type | Modifiers --------+---------+------------------------------------------------id | integer | not null default nextval('b_id_seq'::regclass)a_id | integer | Indexes: "b_pkey" PRIMARY KEY, btree (id) Foreign-key constraints: "b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id) postgres=# REINDEX TABLE a CONCURRENTLY; NOTICE: drop cascades to constraint b_a_id_fkey on table b REINDEX Time: 248.992 ms postgres=# \d b Table "public.b"Column | Type | Modifiers --------+---------+------------------------------------------------id | integer | not null default nextval('b_id_seq'::regclass)a_id | integer | Indexes: "b_pkey" PRIMARY KEY, btree (id) Looking at the patch for a bit now. Regards, Andres --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-12-10 15:51:40 +0100, Andres Freund wrote: > On 2012-12-10 15:03:59 +0900, Michael Paquier wrote: > > I have updated the patch (v4) to take care of updating reltoastidxid for > > toast parent relations at the swap step by using index_update_stats. In > > prior versions of the patch this was done when concurrent index was built, > > leading to toast relations using invalid indexes if there was a failure > > before the swap phase. The update of reltoastidxids of toast relation is > > done with RowExclusiveLock. > > I also added a couple of tests in src/test/isolation. Btw, as for the time > > being the swap step uses AccessExclusiveLock to switch old and new > > relnames, it does not have any meaning to run them... > > Btw, as an example of the problems caused by renaming: > Looking at the patch for a bit now. Some review comments: * Some of the added !is_reindex in index_create don't seem safe to me. Why do we now support reindexing exlusion constraints? * REINDEX DATABASE .. CONCURRENTLY doesn't work, a variant that does the concurrent reindexing for user-tables and non-concurrentfor system tables would be very useful. E.g. for the upgrade from 9.1.5->9.1.6... * ISTM index_concurrent_swap should get exlusive locks on the relation *before* printing their names. This shouldn't be requiredbecause we have a lock prohibiting schema changes on the parent table, but it feels safer. * temporary index names during swapping should also be named via ChooseIndexName * why does create_toast_table pass an unconditional 'is_reindex' to index_create? * would be nice (but thats probably a step #2 thing) to do the individual steps of concurrent reindex over multiple relationsto avoid too much overall waiting for other transactions. * ReindexConcurrentIndexes: * says " Such indexes are simply bypassed if caller has not specified anything." but ERROR's. Imo ERROR is fine, but thecomment should be adjusted... * should perhaps be names ReindexIndexesConcurrently? * Imo the PHASE 1 comment should be after gathering/validitating the chosen indexes * It seems better to me to do use individual transactions + snapshots for each index, no need to keep very long transactionsopen (PHASE 2/3) * s/same whing/same thing/ * Shouldn't a CacheInvalidateRelcacheByRelid be done after PHASE 2 and 5 as well? * PHASE 6 should acquire exlusive locks on the indexes * can some of index_concurrent_* infrastructure be reused for DROP INDEX CONCURRENTLY? * in CREATE/DROP INDEX CONCURRENTLY 'CONCURRENTLY comes before the object name, should we keep that conventioN? Thats all I have for now. Very nice work! Imo the code looks cleaner after your patch... Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Michael Paquier <michael.paquier@gmail.com> writes: > On 2012/12/10, at 18:28, Simon Riggs <simon@2ndQuadrant.com> wrote: >> If I have to choose between (1) keeping the same name OR (2) avoiding >> an AccessExclusiveLock then I would choose (2). Most other people >> would also, especially when all we would do is add/remove an >> underscore. Even if that is user visible. And if it is we can support >> a LOCK option that does (1) instead. > Ok. Removing the switch name part is only deleting 10 lines of code in index_concurrent_swap. > Then, do you guys have a preferred format for the concurrent index name? For the time being an inelegant _cct suffix isused. The underscore at the end? You still need to avoid conflicting name assignments, so my recommendation would really be to use the select-a-new-name code already in use for CREATE INDEX without an index name. The underscore idea is cute, but I doubt it's worth the effort to implement, document, or explain it in a way that copes with repeated REINDEXes and conflicts. regards, tom lane
On 12/8/12 9:40 AM, Tom Lane wrote: > I'm tempted to propose that REINDEX CONCURRENTLY simply not try to > preserve the index name exactly. Something like adding or removing > trailing underscores would probably serve to generate a nonconflicting > name that's not too unsightly. If you think you can rename an index without an exclusive lock, then why not rename it back to the original name when you're done?
On 10 December 2012 22:18, Peter Eisentraut <peter_e@gmx.net> wrote: > On 12/8/12 9:40 AM, Tom Lane wrote: >> I'm tempted to propose that REINDEX CONCURRENTLY simply not try to >> preserve the index name exactly. Something like adding or removing >> trailing underscores would probably serve to generate a nonconflicting >> name that's not too unsightly. > > If you think you can rename an index without an exclusive lock, then why > not rename it back to the original name when you're done? Because the index isn't being renamed. An alternate equivalent index is being created instead. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/10/12 5:21 PM, Simon Riggs wrote: > On 10 December 2012 22:18, Peter Eisentraut <peter_e@gmx.net> wrote: >> On 12/8/12 9:40 AM, Tom Lane wrote: >>> I'm tempted to propose that REINDEX CONCURRENTLY simply not try to >>> preserve the index name exactly. Something like adding or removing >>> trailing underscores would probably serve to generate a nonconflicting >>> name that's not too unsightly. >> >> If you think you can rename an index without an exclusive lock, then why >> not rename it back to the original name when you're done? > > Because the index isn't being renamed. An alternate equivalent index > is being created instead. Right, basically, you can do this right now using CREATE INDEX CONCURRENTLY ${name}_tmp ... DROP INDEX CONCURRENTLY ${name}; ALTER INDEX ${name}_tmp RENAME TO ${name}; The only tricks here are if ${name}_tmp is already taken, in which case you might as well just error out (or try a few different names), and if ${name} is already in use by the time you get to the last line, in which case you can log a warning or an error. What am I missing?
On 10 December 2012 22:27, Peter Eisentraut <peter_e@gmx.net> wrote: > On 12/10/12 5:21 PM, Simon Riggs wrote: >> On 10 December 2012 22:18, Peter Eisentraut <peter_e@gmx.net> wrote: >>> On 12/8/12 9:40 AM, Tom Lane wrote: >>>> I'm tempted to propose that REINDEX CONCURRENTLY simply not try to >>>> preserve the index name exactly. Something like adding or removing >>>> trailing underscores would probably serve to generate a nonconflicting >>>> name that's not too unsightly. >>> >>> If you think you can rename an index without an exclusive lock, then why >>> not rename it back to the original name when you're done? >> >> Because the index isn't being renamed. An alternate equivalent index >> is being created instead. > > Right, basically, you can do this right now using > > CREATE INDEX CONCURRENTLY ${name}_tmp ... > DROP INDEX CONCURRENTLY ${name}; > ALTER INDEX ${name}_tmp RENAME TO ${name}; > > The only tricks here are if ${name}_tmp is already taken, in which case > you might as well just error out (or try a few different names), and if > ${name} is already in use by the time you get to the last line, in which > case you can log a warning or an error. > > What am I missing? That this is already recorded in my book> ;-) And also that REINDEX CONCURRENTLY doesn't work like that, yet. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-12-10 17:27:45 -0500, Peter Eisentraut wrote: > On 12/10/12 5:21 PM, Simon Riggs wrote: > > On 10 December 2012 22:18, Peter Eisentraut <peter_e@gmx.net> wrote: > >> On 12/8/12 9:40 AM, Tom Lane wrote: > >>> I'm tempted to propose that REINDEX CONCURRENTLY simply not try to > >>> preserve the index name exactly. Something like adding or removing > >>> trailing underscores would probably serve to generate a nonconflicting > >>> name that's not too unsightly. > >> > >> If you think you can rename an index without an exclusive lock, then why > >> not rename it back to the original name when you're done? > > > > Because the index isn't being renamed. An alternate equivalent index > > is being created instead. > > Right, basically, you can do this right now using > > CREATE INDEX CONCURRENTLY ${name}_tmp ... > DROP INDEX CONCURRENTLY ${name}; > ALTER INDEX ${name}_tmp RENAME TO ${name}; > > The only tricks here are if ${name}_tmp is already taken, in which case > you might as well just error out (or try a few different names), and if > ${name} is already in use by the time you get to the last line, in which > case you can log a warning or an error. > > What am I missing? I don't think this is the problematic side of the patch. The question is rather how to transfer the dependencies without too much ugliness or how to swap oids without a race. Either by accepting an exlusive lock or by playing some games, the latter possibly being easier with renaming... Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-12-10 22:33:50 +0000, Simon Riggs wrote: > On 10 December 2012 22:27, Peter Eisentraut <peter_e@gmx.net> wrote: > > On 12/10/12 5:21 PM, Simon Riggs wrote: > >> On 10 December 2012 22:18, Peter Eisentraut <peter_e@gmx.net> wrote: > >>> On 12/8/12 9:40 AM, Tom Lane wrote: > >>>> I'm tempted to propose that REINDEX CONCURRENTLY simply not try to > >>>> preserve the index name exactly. Something like adding or removing > >>>> trailing underscores would probably serve to generate a nonconflicting > >>>> name that's not too unsightly. > >>> > >>> If you think you can rename an index without an exclusive lock, then why > >>> not rename it back to the original name when you're done? > >> > >> Because the index isn't being renamed. An alternate equivalent index > >> is being created instead. > > > > Right, basically, you can do this right now using > > > > CREATE INDEX CONCURRENTLY ${name}_tmp ... > > DROP INDEX CONCURRENTLY ${name}; > > ALTER INDEX ${name}_tmp RENAME TO ${name}; > > > > The only tricks here are if ${name}_tmp is already taken, in which case > > you might as well just error out (or try a few different names), and if > > ${name} is already in use by the time you get to the last line, in which > > case you can log a warning or an error. > > > > What am I missing? > > That this is already recorded in my book> ;-) > > And also that REINDEX CONCURRENTLY doesn't work like that, yet. The last submitted patch works pretty similar: CREATE INDEX CONCURRENTLY $name_cct; ALTER INDEX $name RENAME TO cct_$name; ALTER INDEX $name_tmp RENAME TO $tmp; ALTER INDEX $name_tmp RENAME TO $name_cct; DROP INDEX CONURRENCTLY $name_cct; It does that under an exlusive locks, but doesn't handle dependencies yet... Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Dec 10, 2012 at 11:51 PM, Andres Freund <andres@2ndquadrant.com> wrote:
-- Btw, as an example of the problems caused by renaming:
postgres=# CREATE TABLE a (id serial primary key); CREATE TABLE b(id
serial primary key, a_id int REFERENCES a);
CREATE TABLE
Time: 137.840 ms
CREATE TABLE
Time: 143.500 ms
postgres=# \d b
Table "public.b"
Column | Type | Modifiers
--------+---------+------------------------------------------------
id | integer | not null default nextval('b_id_seq'::regclass)
a_id | integer |
Indexes:
"b_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
"b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)
postgres=# REINDEX TABLE a CONCURRENTLY;
NOTICE: drop cascades to constraint b_a_id_fkey on table b
REINDEX
Time: 248.992 ms
postgres=# \d b
Table "public.b"
Column | Type | Modifiers
--------+---------+------------------------------------------------
id | integer | not null default nextval('b_id_seq'::regclass)
a_id | integer |
Indexes:
"b_pkey" PRIMARY KEY, btree (id)
Oops. I will fix that in the next version of the patch. There should be an elegant way to change the dependencies at the swap phase.
Michael Paquier
http://michael.otacoo.com
On Mon, Dec 10, 2012 at 5:18 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 12/8/12 9:40 AM, Tom Lane wrote: >> I'm tempted to propose that REINDEX CONCURRENTLY simply not try to >> preserve the index name exactly. Something like adding or removing >> trailing underscores would probably serve to generate a nonconflicting >> name that's not too unsightly. > > If you think you can rename an index without an exclusive lock, then why > not rename it back to the original name when you're done? Yeah... and also, why do you think that? I thought the idea that we could do any such thing had been convincingly refuted. Frankly, I think that if REINDEX CONCURRENTLY is just shorthand for "CREATE INDEX CONCURRENTLY with a different name and then DROP INDEX CONCURRENTLY on the old name", it's barely worth doing. People can do that already, and do, and then we don't have to explain the wart that the name changes under you. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2012-12-11 15:23:52 -0500, Robert Haas wrote: > On Mon, Dec 10, 2012 at 5:18 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > > On 12/8/12 9:40 AM, Tom Lane wrote: > >> I'm tempted to propose that REINDEX CONCURRENTLY simply not try to > >> preserve the index name exactly. Something like adding or removing > >> trailing underscores would probably serve to generate a nonconflicting > >> name that's not too unsightly. > > > > If you think you can rename an index without an exclusive lock, then why > > not rename it back to the original name when you're done? > > Yeah... and also, why do you think that? I thought the idea that we > could do any such thing had been convincingly refuted. > > Frankly, I think that if REINDEX CONCURRENTLY is just shorthand for > "CREATE INDEX CONCURRENTLY with a different name and then DROP INDEX > CONCURRENTLY on the old name", it's barely worth doing. People can do > that already, and do, and then we don't have to explain the wart that > the name changes under you. Its fundamentally different in that you can do it with constraints referencing the index present. And that it works with toast tables. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Thanks for all your comments.
The new version (v5) of this patch fixes the error you found when reindexing indexes being referenced in foreign keys.
The fix is done with switchIndexConstraintOnForeignKey:pg_constraint.c, in charge of scanning pg_constraint for foreign keys that refer the parent relation (confrelid) of the index being swapped and then switch conindid to the new index if the old index was referenced.
This API also takes care of switching the dependency between the foreign key and the old index by calling changeDependencyFor.
I also added a regression test for this purpose.
Michael Paquier
http://michael.otacoo.com
The new version (v5) of this patch fixes the error you found when reindexing indexes being referenced in foreign keys.
The fix is done with switchIndexConstraintOnForeignKey:pg_constraint.c, in charge of scanning pg_constraint for foreign keys that refer the parent relation (confrelid) of the index being swapped and then switch conindid to the new index if the old index was referenced.
This API also takes care of switching the dependency between the foreign key and the old index by calling changeDependencyFor.
I also added a regression test for this purpose.
On Tue, Dec 11, 2012 at 12:28 AM, Andres Freund <andres@2ndquadrant.com> wrote:
I renamed ReindexConcurrentIndexes to ReindexRelationsConcurrently and changed the arguments it used to something more generic:
ReindexRelationsConcurrently(List *relationIds)
relationIds is a list of relation Oids that can be include tables and/or indexes Oid.
Based on this list of relation Oid, we build the list of indexes that are rebuilt, including the toast indexes if necessary.
-- Some review comments:
* Some of the added !is_reindex in index_create don't seem safe to
me.
This is added to control concurrent index relation for toast indexes. If we do not add an additional flag for that it will not be possible to reindex concurrently a toast index.
* Why do we now support reindexing exclusion constraints?
CREATE INDEX CONCURRENTLY is not supported for exclusive constraints but I played around with exclusion constraints with my patch and did not particularly see any problems in supporting them as for example index_build performs a second scan of the heap when running so it looks enough solid for that. Is it because the structure of REINDEX CONCURRENTLY patch is different? Honestly I think no so is there something I am not aware of?
* REINDEX DATABASE .. CONCURRENTLY doesn't work, a variant that does the
concurrent reindexing for user-tables and non-concurrent for system
tables would be very useful. E.g. for the upgrade from 9.1.5->9.1.6...
OK. I thought that this was out of scope for the time being. I haven't done anything about that yet. Supporting that will not be complicated as ReindexRelationsConcurrently (new API) is more flexible now, the only thing needed is to gather the list of relations that need to be reindexed.
* ISTM index_concurrent_swap should get exlusive locks on the relation
*before* printing their names. This shouldn't be required because we
have a lock prohibiting schema changes on the parent table, but it
feels safer.
Done. AccessExclusiveLock is taken before calling RenameRelationInternal now.
* temporary index names during swapping should also be named via
ChooseIndexName
Done. I used instead ChooseRelationName which is externalized through defrem.h.
* why does create_toast_table pass an unconditional 'is_reindex' to
index_create?
Done. The flag is changed to false.
* would be nice (but thats probably a step #2 thing) to do the
individual steps of concurrent reindex over multiple relations to
avoid too much overall waiting for other transactions.
I think I did that by now using one transaction per index for each operation except the drop phase...
* ReindexConcurrentIndexes:
I renamed ReindexConcurrentIndexes to ReindexRelationsConcurrently and changed the arguments it used to something more generic:
ReindexRelationsConcurrently(List *relationIds)
relationIds is a list of relation Oids that can be include tables and/or indexes Oid.
Based on this list of relation Oid, we build the list of indexes that are rebuilt, including the toast indexes if necessary.
* says " Such indexes are simply bypassed if caller has not specified
anything." but ERROR's. Imo ERROR is fine, but the comment should be
adjusted...
Done.
* should perhaps be names ReindexIndexesConcurrently?
Kind of done.
* Imo the PHASE 1 comment should be after gathering/validitating the
chosen indexes
Comment is moved. Thanks.
* It seems better to me to do use individual transactions + snapshots
for each index, no need to keep very long transactions open (PHASE
2/3)
Good point. I did that. Now individual transactions are used for each index.
* s/same whing/same thing/
Done.
* Shouldn't a CacheInvalidateRelcacheByRelid be done after PHASE 2 and
5 as well?
Done. Nice catch.
* PHASE 6 should acquire exlusive locks on the indexes
The necessary lock is taken when calling index_drop through performMultipleDeletion. Do you think it is not enough and that i should add an Exclusive lock inside index_concurrent_drop?
* can some of index_concurrent_* infrastructure be reused for
DROP INDEX CONCURRENTLY?
Indeed. After looking at the code I found that that 2 steps are done in a concurrent context: invalidating the index and set it as dead.
As REINDEX CONCURRENTLY does the following 2 steps in batch for a list of indexes, I added index_concurrent_set_dead to set up the dropped indexes as dead, and index_concurrent_clear_valid. Those 2 functions are used by both REINDEX CONCURRENTLY and DROP INDEX CONCURRENTLY.
As REINDEX CONCURRENTLY does the following 2 steps in batch for a list of indexes, I added index_concurrent_set_dead to set up the dropped indexes as dead, and index_concurrent_clear_valid. Those 2 functions are used by both REINDEX CONCURRENTLY and DROP INDEX CONCURRENTLY.
* in CREATE/DROP INDEX CONCURRENTLY 'CONCURRENTLY comes before the
object name, should we keep that conventioN?
Good point. I changed the grammar to REINDEX obj [ CONCURRENTLY ] objname.
Thanks,
Thanks,
Michael Paquier
http://michael.otacoo.com
Attachment
On 2012-12-17 11:44:00 +0900, Michael Paquier wrote: > Thanks for all your comments. > The new version (v5) of this patch fixes the error you found when > reindexing indexes being referenced in foreign keys. > The fix is done with switchIndexConstraintOnForeignKey:pg_constraint.c, in > charge of scanning pg_constraint for foreign keys that refer the parent > relation (confrelid) of the index being swapped and then switch conindid to > the new index if the old index was referenced. > This API also takes care of switching the dependency between the foreign > key and the old index by calling changeDependencyFor. > I also added a regression test for this purpose. Ok. Are there no other depencencies towards indexes? I don't know of any right now, but I have the feeling there were some other cases. > On Tue, Dec 11, 2012 at 12:28 AM, Andres Freund <andres@2ndquadrant.com>wrote: > > > Some review comments: > > > > * Some of the added !is_reindex in index_create don't seem safe to > > me. > > > This is added to control concurrent index relation for toast indexes. If we > do not add an additional flag for that it will not be possible to reindex > concurrently a toast index. I think some of them were added for cases that didn't seem to be related to that. I'll recheck in the current version. > > * Why do we now support reindexing exclusion constraints? > > > CREATE INDEX CONCURRENTLY is not supported for exclusive constraints but I > played around with exclusion constraints with my patch and did not > particularly see any problems in supporting them as for example index_build > performs a second scan of the heap when running so it looks enough solid > for that. Is it because the structure of REINDEX CONCURRENTLY patch is > different? Honestly I think no so is there something I am not aware of? I think I asked because you had added an && !is_reindex to one of the checks. If I recall the reason why concurrent index builds couldn't support exclusion constraints correctly - namely that we cannot use them to check for new row versions when the index is in the ready && !valid state - that shouldn't be a problem when we have a valid version of an old index arround because that enforces everything. It would maybe need an appropriate if (!isvalid) in the exclusion constraint code, but that should be it. > * REINDEX DATABASE .. CONCURRENTLY doesn't work, a variant that does the > > concurrent reindexing for user-tables and non-concurrent for system > > tables would be very useful. E.g. for the upgrade from 9.1.5->9.1.6... > > > OK. I thought that this was out of scope for the time being. I haven't done > anything about that yet. Supporting that will not be complicated as > ReindexRelationsConcurrently (new API) is more flexible now, the only thing > needed is to gather the list of relations that need to be reindexed. Imo that so greatly reduces the usability of this patch that you should treat it as in scope ;). Especially as you say, it really shouldn't be that much work with all the groundwork built. > > * would be nice (but thats probably a step #2 thing) to do the > > individual steps of concurrent reindex over multiple relations to > > avoid too much overall waiting for other transactions. > > > I think I did that by now using one transaction per index for each > operation except the drop phase... Without yet having read the new version, I think thats not what I meant. There currently is a wait for concurrent transactions to end after most of the phases for every relation, right? If you have a busy database with somewhat longrunning transactions thats going to slow everything down with waiting quite bit. I wondered whether it would make sense to do PHASE1 for all indexes in all relations, then wait once, then PHASE2... That obviously has some space and index maintainece overhead issues, but its probably sensible anyway in many cases. > > * PHASE 6 should acquire exlusive locks on the indexes > > > The necessary lock is taken when calling index_drop through > performMultipleDeletion. Do you think it is not enough and that i should > add an Exclusive lock inside index_concurrent_drop? It seems to be safer to acquire it earlier, otherwise the likelihood for deadlocks seems to be slightly higher as youre increasing the lock severity. And it shouldn't cause any disadvantages,s o ... Starts to look really nice now! Isn't the following block content thats mostly available somewhere else already? > + <refsect2 id="SQL-REINDEX-CONCURRENTLY"> > + <title id="SQL-REINDEX-CONCURRENTLY-title">Rebuilding Indexes Concurrently</title> > + > + <indexterm zone="SQL-REINDEX-CONCURRENTLY"> > + <primary>index</primary> > + <secondary>rebuilding concurrently</secondary> > + </indexterm> > + > + <para> > + Rebuilding an index can interfere with regular operation of a database. > + Normally <productname>PostgreSQL</> locks the table whose index is rebuilt > + against writes and performs the entire index build with a single scan of the > + table. Other transactions can still read the table, but if they try to > + insert, update, or delete rows in the table they will block until the > + index rebuild is finished. This could have a severe effect if the system is > + a live production database. Very large tables can take many hours to be > + indexed, and even for smaller tables, an index rebuild can lock out writers > + for periods that are unacceptably long for a production system. > + </para> ... > + <para> > + Regular index builds permit other regular index builds on the > + same table to occur in parallel, but only one concurrent index build > + can occur on a table at a time. In both cases, no other types of schema > + modification on the table are allowed meanwhile. Another difference > + is that a regular <command>REINDEX TABLE</> or <command>REINDEX INDEX</> > + command can be performed within a transaction block, but > + <command>REINDEX CONCURRENTLY</> cannot. <command>REINDEX DATABASE</> is > + by default not allowed to run inside a transaction block, so in this case > + <command>CONCURRENTLY</> is not supported. > + </para> > + > - if (concurrent && is_exclusion) > + if (concurrent && is_exclusion && !is_reindex) > ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg_internal("concurrent index creation for exclusion constraints is not supported"))); This is what I referred to above wrt reindex and CONCURRENTLY. We shouldn't pass concurrently if we don't deem it to be safe for exlusion constraints. > +/* > + * index_concurrent_drop > + * > + * Drop a list of indexes as the last step of a concurrent process. Deletion is > + * done through performDeletion or dependencies of the index are not dropped. > + * At this point all the indexes are already considered as invalid and dead so > + * they can be dropped without using any concurrent options. > + */ > +void > +index_concurrent_drop(List *indexIds) > +{ > + ListCell *lc; > + ObjectAddresses *objects = new_object_addresses(); > + > + Assert(indexIds != NIL); > + > + /* Scan the list of indexes and build object list for normal indexes */ > + foreach(lc, indexIds) > + { > + Oid indexOid = lfirst_oid(lc); > + Oid constraintOid = get_index_constraint(indexOid); > + ObjectAddress object; > + > + /* Register constraint or index for drop */ > + if (OidIsValid(constraintOid)) > + { > + object.classId = ConstraintRelationId; > + object.objectId = constraintOid; > + } > + else > + { > + object.classId = RelationRelationId; > + object.objectId = indexOid; > + } > + > + object.objectSubId = 0; > + > + /* Add object to list */ > + add_exact_object_address(&object, objects); > + } > + > + /* Perform deletion for normal and toast indexes */ > + performMultipleDeletions(objects, > + DROP_RESTRICT, > + 0); > +} Just for warm and fuzzy feeling I think it would be a good idea to recheck that indexes are !indislive here. > diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c > index 5e8c6da..55c092d 100644 > + > +/* > + * switchIndexConstraintOnForeignKey > + * > + * Switch foreign keys references for a given index to a new index created > + * concurrently. This process is used when swapping indexes for a concurrent > + * process. All the constraints that are not referenced externally like primary > + * keys or unique indexes should be switched using the structure of index.c for > + * concurrent index creation and drop. > + * This function takes care of also switching the dependencies of the foreign > + * key from the old index to the new index in pg_depend. > + * > + * In order to complete this process, the following process is done: > + * 1) Scan pg_constraint and extract the list of foreign keys that refer to the > + * parent relation of the index being swapped as conrelid. > + * 2) Check in this list the foreign keys that use the old index as reference > + * here with conindid > + * 3) Update field conindid to the new index Oid on all the foreign keys > + * 4) Switch dependencies of the foreign key to the new index > + */ > +void > +switchIndexConstraintOnForeignKey(Oid parentOid, > + Oid oldIndexOid, > + Oid newIndexOid) > +{ > + ScanKeyData skey[1]; > + SysScanDesc conscan; > + Relation conRel; > + HeapTuple htup; > + > + /* > + * Search pg_constraint for the foreign key constraints associated > + * with the index by scanning using conrelid. > + */ > + ScanKeyInit(&skey[0], > + Anum_pg_constraint_confrelid, > + BTEqualStrategyNumber, F_OIDEQ, > + ObjectIdGetDatum(parentOid)); > + > + conRel = heap_open(ConstraintRelationId, AccessShareLock); > + conscan = systable_beginscan(conRel, ConstraintForeignRelidIndexId, > + true, SnapshotNow, 1, skey); > + > + while (HeapTupleIsValid(htup = systable_getnext(conscan))) > + { > + Form_pg_constraint contuple = (Form_pg_constraint) GETSTRUCT(htup); > + > + /* Check if a foreign constraint uses the index being swapped */ > + if (contuple->contype == CONSTRAINT_FOREIGN && > + contuple->confrelid == parentOid && > + contuple->conindid == oldIndexOid) > + { > + /* Found an index, so update its pg_constraint entry */ > + contuple->conindid = newIndexOid; > + /* And write it back in place */ > + heap_inplace_update(conRel, htup); I am pretty doubtful that using heap_inplace_update is the correct thing to do here. What if we fail later? Even if there's some justification for it being safe it deserves a big comment. The other cases where heap_inplace_update is used in the context of CONCURRENTLY are pretty careful about where to do it and have special state flags of indicating that this has been done... > > +bool > +ReindexRelationsConcurrently(List *relationIds) > +{ > + foreach(lc, relationIds) > + { > + Oid relationOid = lfirst_oid(lc); > + > + switch (get_rel_relkind(relationOid)) > + { > + case RELKIND_RELATION: > + { > + /* > + * In the case of a relation, find all its indexes > + * including toast indexes. > + */ > + Relation heapRelation = heap_open(relationOid, > + ShareUpdateExclusiveLock); > + > + /* Relation on which is based index cannot be shared */ > + if (heapRelation->rd_rel->relisshared) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("concurrent reindex is not supported for shared relations"))); > + > + /* Add all the valid indexes of relation to list */ > + foreach(lc2, RelationGetIndexList(heapRelation)) > + { > + Oid cellOid = lfirst_oid(lc2); > + Relation indexRelation = index_open(cellOid, > + ShareUpdateExclusiveLock); > + > + if (!indexRelation->rd_index->indisvalid) > + ereport(WARNING, > + (errcode(ERRCODE_INDEX_CORRUPTED), > + errmsg("cannot reindex concurrently invalid index \"%s.%s\", bypassing", > + get_namespace_name(get_rel_namespace(cellOid)), > + get_rel_name(cellOid)))); > + else > + indexIds = list_append_unique_oid(indexIds, > + cellOid); > + > + index_close(indexRelation, ShareUpdateExclusiveLock); > + } Why are we releasing the locks here if we are going to reindex the relations? They might change inbetween. I think we should take an appropriate lock here, including the locks on the parent relations. Yes, its slightly more duplicative code, and not acquiring locks multiple times is somewhat complicated, but I think its required. I think you should also explicitly do the above in a transaction... > + /* > + * Phase 2 of REINDEX CONCURRENTLY > + * > + * Build concurrent indexes in a separate transaction for each index to > + * avoid having open transactions for an unnecessary long time. We also > + * need to wait until no running transactions could have the parent table > + * of index open. A concurrent build is done for each concurrent > + * index that will replace the old indexes. > + */ > + > + /* Get the first element of concurrent index list */ > + lc2 = list_head(concurrentIndexIds); > + > + foreach(lc, indexIds) > + { > + Relation indexRel; > + Oid indOid = lfirst_oid(lc); > + Oid concurrentOid = lfirst_oid(lc2); > + Oid relOid; > + bool primary; > + LOCKTAG *heapLockTag = NULL; > + ListCell *cell; > + > + /* Move to next concurrent item */ > + lc2 = lnext(lc2); > + > + /* Start new transaction for this index concurrent build */ > + StartTransactionCommand(); > + > + /* Get the parent relation Oid */ > + relOid = IndexGetRelation(indOid, false); > + > + /* > + * Find the locktag of parent table for this index, we need to wait for > + * locks on it. > + */ > + foreach(cell, lockTags) > + { > + LOCKTAG *localTag = (LOCKTAG *) lfirst(cell); > + if (relOid == localTag->locktag_field2) > + heapLockTag = localTag; > + } > + > + Assert(heapLockTag && heapLockTag->locktag_field2 != InvalidOid); > + WaitForVirtualLocks(*heapLockTag, ShareLock); Why do we have to do the WaitForVirtualLocks here? Shouldn't we do this once for all relations after each phase? Otherwise the waiting time will really start to hit when you do this on a somewhat busy server. > + /* > + * Invalidate the relcache for the table, so that after this commit all > + * sessions will refresh any cached plans taht might reference the index. > + */ > + CacheInvalidateRelcacheByRelid(relOid); I am not sure whether I suggested adding a CacheInvalidateRelcacheByRelid here, but afaics its not required yet, the plan isn't valid yet, so no need for replanning. > + indexRel = index_open(indOid, ShareUpdateExclusiveLock); I wonder we should directly open it exlusive here given its going to opened exclusively in a bit anyway. Not that that will really reduce the deadlock likelihood since we already hold the ShareUpdateExclusiveLock in session mode ... > + /* > + * Phase 5 of REINDEX CONCURRENTLY > + * > + * The old indexes need to be marked as not ready. We need also to wait for > + * transactions that might use them. Each operation is performed with a > + * separate transaction. > + */ > + > + /* Mark the old indexes as not ready */ > + foreach(lc, indexIds) > + { > + LOCKTAG *heapLockTag; > + Oid indOid = lfirst_oid(lc); > + Oid relOid; > + > + StartTransactionCommand(); > + relOid = IndexGetRelation(indOid, false); > + > + /* > + * Find the locktag of parent table for this index, we need to wait for > + * locks on it. > + */ > + foreach(lc2, lockTags) > + { > + LOCKTAG *localTag = (LOCKTAG *) lfirst(lc2); > + if (relOid == localTag->locktag_field2) > + heapLockTag = localTag; > + } > + > + Assert(heapLockTag && heapLockTag->locktag_field2 != InvalidOid); > + > + /* Finish the index invalidation and set it as dead */ > + index_concurrent_set_dead(indOid, relOid, *heapLockTag); > + > + /* Commit this transaction to make the update visible. */ > + CommitTransactionCommand(); > + } No waiting here? > + StartTransactionCommand(); > + > + /* Get fresh snapshot for next step */ > + PushActiveSnapshot(GetTransactionSnapshot()); > + > + /* > + * Phase 6 of REINDEX CONCURRENTLY > + * > + * Drop the old indexes. This needs to be done through performDeletion > + * or related dependencies will not be dropped for the old indexes. The > + * internal mechanism of DROP INDEX CONCURRENTLY is not used as here the > + * indexes are already considered as dead and invalid, so they will not > + * be used by other backends. > + */ > + index_concurrent_drop(indexIds); > + > + /* > + * Last thing to do is release the session-level lock on the parent table > + * and the indexes of table. > + */ > + foreach(lc, relationLocks) > + { > + LockRelId lockRel = * (LockRelId *) lfirst(lc); > + UnlockRelationIdForSession(&lockRel, ShareUpdateExclusiveLock); > + } > + > + /* We can do away with our snapshot */ > + PopActiveSnapshot(); I think I would do the drop in individual transactions as well. More at another time, shouldn't have started doing this now... Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
OK. I am back to this patch after a too long time.
Please find an updated version of the patch attached (v6). I address all the previous comments, except regarding the support for REINDEX DATABASE CONCURRENTLY. I am working on that precisely but I am not sure it is that straight-forward...
Michael Paquier
http://michael.otacoo.com
Please find an updated version of the patch attached (v6). I address all the previous comments, except regarding the support for REINDEX DATABASE CONCURRENTLY. I am working on that precisely but I am not sure it is that straight-forward...
On Wed, Dec 19, 2012 at 11:24 AM, Andres Freund <andres@2ndquadrant.com> wrote:
-- On 2012-12-17 11:44:00 +0900, Michael Paquier wrote:Ok. Are there no other depencencies towards indexes? I don't know of any
> Thanks for all your comments.
> The new version (v5) of this patch fixes the error you found when
> reindexing indexes being referenced in foreign keys.
> The fix is done with switchIndexConstraintOnForeignKey:pg_constraint.c, in
> charge of scanning pg_constraint for foreign keys that refer the parent
> relation (confrelid) of the index being swapped and then switch conindid to
> the new index if the old index was referenced.
> This API also takes care of switching the dependency between the foreign
> key and the old index by calling changeDependencyFor.
> I also added a regression test for this purpose.
right now, but I have the feeling there were some other cases.
The patch cover the cases of PRIMARY, UNIQUE, normal indexes, exclusion constraints and foreign keys. Just based on the docs, I don't think there is something missing.
http://www.postgresql.org/docs/9.2/static/ddl-constraints.html
http://www.postgresql.org/docs/9.2/static/ddl-constraints.html
> * REINDEX DATABASE .. CONCURRENTLY doesn't work, a variant that does theImo that so greatly reduces the usability of this patch that you should
> > concurrent reindexing for user-tables and non-concurrent for system
> > tables would be very useful. E.g. for the upgrade from 9.1.5->9.1.6...
> >
> OK. I thought that this was out of scope for the time being. I haven't done
> anything about that yet. Supporting that will not be complicated as
> ReindexRelationsConcurrently (new API) is more flexible now, the only thing
> needed is to gather the list of relations that need to be reindexed.
treat it as in scope ;). Especially as you say, it really shouldn't be
that much work with all the groundwork built.
OK. So... What should we do when a REINDEX DATABASE CONCURRENTLY is done?
- only reindex user tables and bypass system tables?
- reindex user tables concurrently and system tables non-concurrently?
- forbid this operation when this operation is done on a database having system tables?
Some input?
Btw, the attached version of the patch does not include this feature yet but I am working on it.
- only reindex user tables and bypass system tables?
- reindex user tables concurrently and system tables non-concurrently?
- forbid this operation when this operation is done on a database having system tables?
Some input?
Btw, the attached version of the patch does not include this feature yet but I am working on it.
Without yet having read the new version, I think thats not what I
> > * would be nice (but thats probably a step #2 thing) to do the
> > individual steps of concurrent reindex over multiple relations to
> > avoid too much overall waiting for other transactions.
> >
> I think I did that by now using one transaction per index for each
> operation except the drop phase...
meant. There currently is a wait for concurrent transactions to end
after most of the phases for every relation, right? If you have a busy
database with somewhat longrunning transactions thats going to slow
everything down with waiting quite bit. I wondered whether it would make
sense to do PHASE1 for all indexes in all relations, then wait once,
then PHASE2...
That obviously has some space and index maintainece overhead issues, but
its probably sensible anyway in many cases.
OK, phase 1 is done with only one transaction for all the indexes. Do you mean that we should do that with a single transaction for each index?
Isn't the following block content thats mostly available somewhere else
already?
[... doc extract ...]
Yes, this portion of the docs is pretty similar to what is findable in CREATE INDEX CONCURRENTLY. Why not creating a new common documentation section that CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY could refer to? I think we should first work on the code and then do the docs properly though.
> - if (concurrent && is_exclusion)
> + if (concurrent && is_exclusion && !is_reindex)
> ereport(ERROR,
> (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> errmsg_internal("concurrent index creation for exclusion constraints is not supported")));
This is what I referred to above wrt reindex and CONCURRENTLY. We
shouldn't pass concurrently if we don't deem it to be safe for exlusion
constraints.
So does that mean that it is not possible to create an exclusive constraint in a concurrent context? Code path used by REINDEX concurrently permits to create an index in parallel of an existing one and not a completely new index. Shouldn't this work for indexes used by exclusion indexes also?
> +/*
> + * index_concurrent_drop
> + *
> + * Drop a list of indexes as the last step of a concurrent process. Deletion is
> + * done through performDeletion or dependencies of the index are not dropped.
> + * At this point all the indexes are already considered as invalid and dead so
> + * they can be dropped without using any concurrent options.
> + */
> +void
> +index_concurrent_drop(List *indexIds)
> +{
> + ListCell *lc;
> + ObjectAddresses *objects = new_object_addresses();
> +
> + Assert(indexIds != NIL);
> +
> + /* Scan the list of indexes and build object list for normal indexes */
> + foreach(lc, indexIds)
> + {
> + Oid indexOid = lfirst_oid(lc);
> + Oid constraintOid = get_index_constraint(indexOid);
> + ObjectAddress object;
> +
> + /* Register constraint or index for drop */
> + if (OidIsValid(constraintOid))
> + {
> + object.classId = ConstraintRelationId;
> + object.objectId = constraintOid;
> + }
> + else
> + {
> + object.classId = RelationRelationId;
> + object.objectId = indexOid;
> + }
> +
> + object.objectSubId = 0;
> +
> + /* Add object to list */
> + add_exact_object_address(&object, objects);
> + }
> +
> + /* Perform deletion for normal and toast indexes */
> + performMultipleDeletions(objects,
> + DROP_RESTRICT,
> + 0);
> +}
Just for warm and fuzzy feeling I think it would be a good idea to
recheck that indexes are !indislive here.
OK done. The indexes with indislive set at true are not bypassed now.
> diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
> index 5e8c6da..55c092d 100644
> +
> +/*
> + * switchIndexConstraintOnForeignKey
> + *
> + * Switch foreign keys references for a given index to a new index created
> + * concurrently. This process is used when swapping indexes for a concurrent
> + * process. All the constraints that are not referenced externally like primary
> + * keys or unique indexes should be switched using the structure of index.c for
> + * concurrent index creation and drop.
> + * This function takes care of also switching the dependencies of the foreign
> + * key from the old index to the new index in pg_depend.
> + *
> + * In order to complete this process, the following process is done:
> + * 1) Scan pg_constraint and extract the list of foreign keys that refer to the
> + * parent relation of the index being swapped as conrelid.
> + * 2) Check in this list the foreign keys that use the old index as reference
> + * here with conindid
> + * 3) Update field conindid to the new index Oid on all the foreign keys
> + * 4) Switch dependencies of the foreign key to the new index
> + */
> +void
> +switchIndexConstraintOnForeignKey(Oid parentOid,
> + Oid oldIndexOid,
> + Oid newIndexOid)
> +{
> + ScanKeyData skey[1];
> + SysScanDesc conscan;
> + Relation conRel;
> + HeapTuple htup;
> +
> + /*
> + * Search pg_constraint for the foreign key constraints associated
> + * with the index by scanning using conrelid.
> + */
> + ScanKeyInit(&skey[0],
> + Anum_pg_constraint_confrelid,
> + BTEqualStrategyNumber, F_OIDEQ,
> + ObjectIdGetDatum(parentOid));
> +
> + conRel = heap_open(ConstraintRelationId, AccessShareLock);
> + conscan = systable_beginscan(conRel, ConstraintForeignRelidIndexId,
> + true, SnapshotNow, 1, skey);
> +
> + while (HeapTupleIsValid(htup = systable_getnext(conscan)))
> + {
> + Form_pg_constraint contuple = (Form_pg_constraint) GETSTRUCT(htup);
> +
> + /* Check if a foreign constraint uses the index being swapped */
> + if (contuple->contype == CONSTRAINT_FOREIGN &&
> + contuple->confrelid == parentOid &&
> + contuple->conindid == oldIndexOid)
> + {
> + /* Found an index, so update its pg_constraint entry */
> + contuple->conindid = newIndexOid;
> + /* And write it back in place */
> + heap_inplace_update(conRel, htup);
I am pretty doubtful that using heap_inplace_update is the correct thing
to do here. What if we fail later? Even if there's some justification
for it being safe it deserves a big comment.
The other cases where heap_inplace_update is used in the context of
CONCURRENTLY are pretty careful about where to do it and have special
state flags of indicating that this has been done...
Oops, fixed. I changed it to simple_heap_update.
>
> +bool
> +ReindexRelationsConcurrently(List *relationIds)
> +{
> + foreach(lc, relationIds)
> + {
> + Oid relationOid = lfirst_oid(lc);
> +
> + switch (get_rel_relkind(relationOid))
> + {
> + case RELKIND_RELATION:
> + {
> + /*
> + * In the case of a relation, find all its indexes
> + * including toast indexes.
> + */
> + Relation heapRelation = heap_open(relationOid,
> + ShareUpdateExclusiveLock);
> +
> + /* Relation on which is based index cannot be shared */
> + if (heapRelation->rd_rel->relisshared)
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("concurrent reindex is not supported for shared relations")));
> +
> + /* Add all the valid indexes of relation to list */
> + foreach(lc2, RelationGetIndexList(heapRelation))
> + {
> + Oid cellOid = lfirst_oid(lc2);
> + Relation indexRelation = index_open(cellOid,
> + ShareUpdateExclusiveLock);
> +
> + if (!indexRelation->rd_index->indisvalid)
> + ereport(WARNING,
> + (errcode(ERRCODE_INDEX_CORRUPTED),
> + errmsg("cannot reindex concurrently invalid index \"%s.%s\", bypassing",
> + get_namespace_name(get_rel_namespace(cellOid)),
> + get_rel_name(cellOid))));
> + else
> + indexIds = list_append_unique_oid(indexIds,
> + cellOid);
> +
> + index_close(indexRelation, ShareUpdateExclusiveLock);
> + }
Why are we releasing the locks here if we are going to reindex the
relations? They might change inbetween. I think we should take an
appropriate lock here, including the locks on the parent relations. Yes,
its slightly more duplicative code, and not acquiring locks multiple
times is somewhat complicated, but I think its required.
OK, the locks are now maintained until the end of the transaction and when the session locks are taken on those relations, so it will not be possible to have schema changes between the moment where the list of indexes is built and the moment the session locks are taken.
I think you should also explicitly do the above in a transaction...
I am not sure I get your point here. This phase is in place to gather the list of all the indexes to reindex based on the list of relations given by caller.
> + /*
> + * Phase 2 of REINDEX CONCURRENTLY
> + *
> + * Build concurrent indexes in a separate transaction for each index to
> + * avoid having open transactions for an unnecessary long time. We also
> + * need to wait until no running transactions could have the parent table
> + * of index open. A concurrent build is done for each concurrent
> + * index that will replace the old indexes.
> + */
> +
> + /* Get the first element of concurrent index list */
> + lc2 = list_head(concurrentIndexIds);
> +
> + foreach(lc, indexIds)
> + {
> + Relation indexRel;
> + Oid indOid = lfirst_oid(lc);
> + Oid concurrentOid = lfirst_oid(lc2);
> + Oid relOid;
> + bool primary;
> + LOCKTAG *heapLockTag = NULL;
> + ListCell *cell;
> +
> + /* Move to next concurrent item */
> + lc2 = lnext(lc2);
> +
> + /* Start new transaction for this index concurrent build */
> + StartTransactionCommand();
> +
> + /* Get the parent relation Oid */
> + relOid = IndexGetRelation(indOid, false);
> +
> + /*
> + * Find the locktag of parent table for this index, we need to wait for
> + * locks on it.
> + */
> + foreach(cell, lockTags)
> + {
> + LOCKTAG *localTag = (LOCKTAG *) lfirst(cell);
> + if (relOid == localTag->locktag_field2)
> + heapLockTag = localTag;
> + }
> +
> + Assert(heapLockTag && heapLockTag->locktag_field2 != InvalidOid);
> + WaitForVirtualLocks(*heapLockTag, ShareLock);
Why do we have to do the WaitForVirtualLocks here? Shouldn't we do this
once for all relations after each phase? Otherwise the waiting time will
really start to hit when you do this on a somewhat busy server.
Each new index is built and set as ready in a separate single transaction, so doesn't it make sense to wait for the parent relation each time. It is possible to wait for a parent relation only once during this phase but in this case all the indexes of the same relation need to be set as ready in the same transaction. So here the choice is either to wait for the same relation multiple times for a single index or wait once for a parent relation but we build all the concurrent indexes within the same transaction. Choice 1 makes the code clearer and more robust to my mind as the phase 2 is done clearly for each index separately. Thoughts?
> + /*
> + * Invalidate the relcache for the table, so that after this commit all
> + * sessions will refresh any cached plans taht might reference the index.
> + */
> + CacheInvalidateRelcacheByRelid(relOid);
I am not sure whether I suggested adding a
CacheInvalidateRelcacheByRelid here, but afaics its not required yet,
the plan isn't valid yet, so no need for replanning.
Sure I removed it.
> + indexRel = index_open(indOid, ShareUpdateExclusiveLock);
I wonder we should directly open it exlusive here given its going to
opened exclusively in a bit anyway. Not that that will really reduce the
deadlock likelihood since we already hold the ShareUpdateExclusiveLock
in session mode ...
I tried to use an AccessExclusiveLock here but it happens that this is not compatible with index_set_state_flags. Does taking an exclusive lock increments the transaction ID of running transaction? Because what I am seeing is that taking AccessExclusiveLock on this index does a transaction update.
For those reasons current code sticks with ShareUpdateExclusiveLock. Not a big deal btw...
For those reasons current code sticks with ShareUpdateExclusiveLock. Not a big deal btw...
> + /*
> + * Phase 5 of REINDEX CONCURRENTLY
> + *
> + * The old indexes need to be marked as not ready. We need also to wait for
> + * transactions that might use them. Each operation is performed with a
> + * separate transaction.
> + */
> +
> + /* Mark the old indexes as not ready */
> + foreach(lc, indexIds)
> + {
> + LOCKTAG *heapLockTag;
> + Oid indOid = lfirst_oid(lc);
> + Oid relOid;
> +
> + StartTransactionCommand();
> + relOid = IndexGetRelation(indOid, false);
> +
> + /*
> + * Find the locktag of parent table for this index, we need to wait for
> + * locks on it.
> + */
> + foreach(lc2, lockTags)
> + {
> + LOCKTAG *localTag = (LOCKTAG *) lfirst(lc2);
> + if (relOid == localTag->locktag_field2)
> + heapLockTag = localTag;
> + }
> +
> + Assert(heapLockTag && heapLockTag->locktag_field2 != InvalidOid);
> +
> + /* Finish the index invalidation and set it as dead */
> + index_concurrent_set_dead(indOid, relOid, *heapLockTag);
> +
> + /* Commit this transaction to make the update visible. */
> + CommitTransactionCommand();
> + }
No waiting here?
A wait phase is done inside index_concurrent_set_dead, so no problem.
> + StartTransactionCommand();
> +
> + /* Get fresh snapshot for next step */
> + PushActiveSnapshot(GetTransactionSnapshot());
> +
> + /*
> + * Phase 6 of REINDEX CONCURRENTLY
> + *
> + * Drop the old indexes. This needs to be done through performDeletion
> + * or related dependencies will not be dropped for the old indexes. The
> + * internal mechanism of DROP INDEX CONCURRENTLY is not used as here the
> + * indexes are already considered as dead and invalid, so they will not
> + * be used by other backends.
> + */
> + index_concurrent_drop(indexIds);
> +
> + /*
> + * Last thing to do is release the session-level lock on the parent table
> + * and the indexes of table.
> + */
> + foreach(lc, relationLocks)
> + {
> + LockRelId lockRel = * (LockRelId *) lfirst(lc);
> + UnlockRelationIdForSession(&lockRel, ShareUpdateExclusiveLock);
> + }
> +
> + /* We can do away with our snapshot */
> + PopActiveSnapshot();
I think I would do the drop in individual transactions as well.
Done. Each drop is now done in a single transaction.
Michael Paquier
http://michael.otacoo.com
Attachment
Hi,
Please find attached v7 of this patch, adding support for REINDEX DATABASE CONCURRENTLY.
When using REINDEX DATABASE with CONCURRENTLY, non-system tables are reindexed concurrently and system tables are reindexed in the normal way, ie non-concurrently.
Thanks,
--
Michael Paquier
http://michael.otacoo.com
Please find attached v7 of this patch, adding support for REINDEX DATABASE CONCURRENTLY.
When using REINDEX DATABASE with CONCURRENTLY, non-system tables are reindexed concurrently and system tables are reindexed in the normal way, ie non-concurrently.
Thanks,
--
Michael Paquier
http://michael.otacoo.com
Attachment
On 2013-01-15 18:16:59 +0900, Michael Paquier wrote: > OK. I am back to this patch after a too long time. Dito ;) > > > > * would be nice (but thats probably a step #2 thing) to do the > > > > individual steps of concurrent reindex over multiple relations to > > > > avoid too much overall waiting for other transactions. > > > > > > > I think I did that by now using one transaction per index for each > > > operation except the drop phase... > > > > Without yet having read the new version, I think thats not what I > > meant. There currently is a wait for concurrent transactions to end > > after most of the phases for every relation, right? If you have a busy > > database with somewhat longrunning transactions thats going to slow > > everything down with waiting quite bit. I wondered whether it would make > > sense to do PHASE1 for all indexes in all relations, then wait once, > > then PHASE2... > > That obviously has some space and index maintainece overhead issues, but > > its probably sensible anyway in many cases. > > > OK, phase 1 is done with only one transaction for all the indexes. Do you > mean that we should do that with a single transaction for each index? Yes. > > Isn't the following block content thats mostly available somewhere else > > already? > > [... doc extract ...] > > > Yes, this portion of the docs is pretty similar to what is findable in > CREATE INDEX CONCURRENTLY. Why not creating a new common documentation > section that CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY could refer > to? I think we should first work on the code and then do the docs properly > though. Agreed. I just noticed it when scrolling through the patch. > > > - if (concurrent && is_exclusion) > > > + if (concurrent && is_exclusion && !is_reindex) > > > ereport(ERROR, > > > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > > > errmsg_internal("concurrent index > > creation for exclusion constraints is not supported"))); > > > > This is what I referred to above wrt reindex and CONCURRENTLY. We > > shouldn't pass concurrently if we don't deem it to be safe for exlusion > > constraints. > > > So does that mean that it is not possible to create an exclusive constraint > in a concurrent context? Yes, its currently not safe in the general case. > Code path used by REINDEX concurrently permits to > create an index in parallel of an existing one and not a completely new > index. Shouldn't this work for indexes used by exclusion indexes also? But that fact might safe things. I don't immediately see any reason that adding a if (!indisvalid) return; to check_exclusion_constraint wouldn't be sufficient if there's another index with an equivalent definition. > > > + /* > > > + * Phase 2 of REINDEX CONCURRENTLY > > > + * > > > + * Build concurrent indexes in a separate transaction for each > > index to > > > + * avoid having open transactions for an unnecessary long time. > > We also > > > + * need to wait until no running transactions could have the > > parent table > > > + * of index open. A concurrent build is done for each concurrent > > > + * index that will replace the old indexes. > > > + */ > > > + > > > + /* Get the first element of concurrent index list */ > > > + lc2 = list_head(concurrentIndexIds); > > > + > > > + foreach(lc, indexIds) > > > + { > > > + Relation indexRel; > > > + Oid indOid = lfirst_oid(lc); > > > + Oid concurrentOid = lfirst_oid(lc2); > > > + Oid relOid; > > > + bool primary; > > > + LOCKTAG *heapLockTag = NULL; > > > + ListCell *cell; > > > + > > > + /* Move to next concurrent item */ > > > + lc2 = lnext(lc2); > > > + > > > + /* Start new transaction for this index concurrent build */ > > > + StartTransactionCommand(); > > > + > > > + /* Get the parent relation Oid */ > > > + relOid = IndexGetRelation(indOid, false); > > > + > > > + /* > > > + * Find the locktag of parent table for this index, we > > need to wait for > > > + * locks on it. > > > + */ > > > + foreach(cell, lockTags) > > > + { > > > + LOCKTAG *localTag = (LOCKTAG *) lfirst(cell); > > > + if (relOid == localTag->locktag_field2) > > > + heapLockTag = localTag; > > > + } > > > + > > > + Assert(heapLockTag && heapLockTag->locktag_field2 != > > InvalidOid); > > > + WaitForVirtualLocks(*heapLockTag, ShareLock); > > > > Why do we have to do the WaitForVirtualLocks here? Shouldn't we do this > > once for all relations after each phase? Otherwise the waiting time will > > really start to hit when you do this on a somewhat busy server. > > > Each new index is built and set as ready in a separate single transaction, > so doesn't it make sense to wait for the parent relation each time. It is > possible to wait for a parent relation only once during this phase but in > this case all the indexes of the same relation need to be set as ready in > the same transaction. So here the choice is either to wait for the same > relation multiple times for a single index or wait once for a parent > relation but we build all the concurrent indexes within the same > transaction. Choice 1 makes the code clearer and more robust to my mind as > the phase 2 is done clearly for each index separately. Thoughts? As far as I understand that code its purpose is to enforce that all potential users have an up2date definition available. For that we acquire a lock on all virtualxids of users using that table thus waiting for them to finish. Consider the scenario where you have a workload where most transactions are fairly long (say 10min) and use the same tables (a,b)/indexes(a_1, a_2, b_1, b_2). With the current strategy you will do: WaitForVirtualLocks(a_1) -- wait up to 10min index_build(a_1) WaitForVirtualLocks(a_2) -- wait up to 10min index_build(a_2) ... So instead of waiting up 10 minutes for that phase you have to wait up to 40. > > > + indexRel = index_open(indOid, ShareUpdateExclusiveLock); > > > > I wonder we should directly open it exlusive here given its going to > > opened exclusively in a bit anyway. Not that that will really reduce the > > deadlock likelihood since we already hold the ShareUpdateExclusiveLock > > in session mode ... > > > I tried to use an AccessExclusiveLock here but it happens that this is not > compatible with index_set_state_flags. Does taking an exclusive lock > increments the transaction ID of running transaction? Because what I am > seeing is that taking AccessExclusiveLock on this index does a transaction > update. Yep, it does when wal_level = hot_standby because it logs the exclusive lock to wal so the startup process on the standby can acquire it. Imo that Assert needs to be moved to the existing callsites if there isn't an equivalent one already. > For those reasons current code sticks with ShareUpdateExclusiveLock. Not a > big deal btw... Well, lock upgrades make deadlocks more likely. Ok, of to v7: + */ +void +index_concurrent_swap(Oid newIndexOid, Oid oldIndexOid) ... + /* + * Take a lock on the old and new index before switching their names. This + * avoids having index swapping relying on relation renaming mechanism to + * get a lock on the relations involved. + */ + oldIndexRel = relation_open(oldIndexOid, AccessExclusiveLock); + newIndexRel = relation_open(newIndexOid, AccessExclusiveLock); .. + /* + * If the index swapped is a toast index, take an exclusive lock on its + * parent toast relation and then update reltoastidxid to the new index Oid + * value. + */ + if (get_rel_relkind(parentOid) == RELKIND_TOASTVALUE) + { + Relation pg_class; + + /* Open pg_class and fetch a writable copy of the relation tuple */ + pg_class = heap_open(parentOid, RowExclusiveLock); + + /* Update the statistics of this pg_class entry with new toast index Oid */ + index_update_stats(pg_class, false, false, newIndexOid, -1.0); + + /* Close parent relation */ + heap_close(pg_class, RowExclusiveLock); + } ISTM the RowExclusiveLock on the toast table should be acquired before the locks on the indexes. +index_concurrent_set_dead(Oid indexId, Oid heapId, LOCKTAG locktag) +{ + Relation heapRelation; + Relation indexRelation; + + /* + * Now we must wait until no running transaction could be using the + * index for a query. To do this, inquire which xacts currently would + * conflict with AccessExclusiveLock on the table -- ie, which ones + * have a lock of any kind on the table. Then wait for each of these + * xacts to commit or abort. Note we do not need to worry about xacts + * that open the table for reading after this point; they will see the + * index as invalid when they open the relation. + * + * Note: the reason we use actual lock acquisition here, rather than + * just checking the ProcArray and sleeping, is that deadlock is + * possible if one of the transactions in question is blocked trying + * to acquire an exclusive lock on our table. The lock code will + * detect deadlock and error out properly. + * + * Note: GetLockConflicts() never reports our own xid, hence we need + * not check for that. Also, prepared xacts are not reported, which + * is fine since they certainly aren't going to do anything more. + */ + WaitForVirtualLocks(locktag, AccessExclusiveLock); Most of that comment seems to belong to WaitForVirtualLocks instead of this specific caller of WaitForVirtualLocks. A comment in the header that it is doing the waiting would also be good. In ReindexRelationsConcurrently I suggest s/bypassing/skipping/. Btw, seing that we have an indisvalid check the toast table's index, do we have any way to cleanup such a dead index? I don't think its allowed to drop the index of a toast table. I.e. we possibly need to relax that check for invalid indexes :/. I think the usage of list_append_unique_oids in ReindexRelationsConcurrently might get too expensive in larger schemas. Its O(n^2) in the current usage and schemas with lots of relations/indexes aren't unlikely candidates for this feature. The easist solution probably is to use a hashtable. ReindexRelationsConcurrently should do a CHECK_FOR_INTERRUPTS() every once in a while, its currently not gracefully interruptible which probably is bad in a bigger schema. Thats all I have for now. This patch is starting to look seriously cool and it seems realistic to get into a ready state for 9.3. I somewhat dislike the fact that CONCURRENTLY isn't really concurrent here (for the listeners: swapping the indexes acquires exlusive locks) , but I don't see any other naming being better. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund escribió: > I somewhat dislike the fact that CONCURRENTLY isn't really concurrent > here (for the listeners: swapping the indexes acquires exlusive locks) , > but I don't see any other naming being better. REINDEX ALMOST CONCURRENTLY? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 24/01/13 07:45, Alvaro Herrera wrote: > Andres Freund escribió: > >> I somewhat dislike the fact that CONCURRENTLY isn't really concurrent >> here (for the listeners: swapping the indexes acquires exlusive locks) , >> but I don't see any other naming being better. > REINDEX ALMOST CONCURRENTLY? > REINDEX BEST EFFORT CONCURRENTLY?
On Wed, Jan 23, 2013 at 1:45 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Andres Freund escribió: >> I somewhat dislike the fact that CONCURRENTLY isn't really concurrent >> here (for the listeners: swapping the indexes acquires exlusive locks) , >> but I don't see any other naming being better. > > REINDEX ALMOST CONCURRENTLY? I'm kind of unconvinced of the value proposition of this patch. I mean, you can DROP INDEX CONCURRENTLY and CREATE INDEX CONCURRENTLY today, so ... how is this better? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 24, 2013 at 01:29:56PM -0500, Robert Haas wrote: > On Wed, Jan 23, 2013 at 1:45 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Andres Freund escribió: > >> I somewhat dislike the fact that CONCURRENTLY isn't really concurrent > >> here (for the listeners: swapping the indexes acquires exlusive locks) , > >> but I don't see any other naming being better. > > > > REINDEX ALMOST CONCURRENTLY? > > I'm kind of unconvinced of the value proposition of this patch. I > mean, you can DROP INDEX CONCURRENTLY and CREATE INDEX CONCURRENTLY > today, so ... how is this better? This has been on the TODO list for a while, and I don't think the renaming in a transaction work needed to use drop/create is really something we want to force on users. In addition, doing that for all tables in a database is even more work, so I would be disappointed _not_ to get this feature in 9.3. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> writes: > On Thu, Jan 24, 2013 at 01:29:56PM -0500, Robert Haas wrote: >> I'm kind of unconvinced of the value proposition of this patch. I >> mean, you can DROP INDEX CONCURRENTLY and CREATE INDEX CONCURRENTLY >> today, so ... how is this better? > This has been on the TODO list for a while, and I don't think the > renaming in a transaction work needed to use drop/create is really > something we want to force on users. In addition, doing that for all > tables in a database is even more work, so I would be disappointed _not_ > to get this feature in 9.3. I haven't given the current patch a look, but based on previous discussions, this isn't going to be more than a macro for things that users can do already --- that is, it's going to be basically DROP CONCURRENTLY plus CREATE CONCURRENTLY plus ALTER INDEX RENAME, including the fact that the RENAME step will transiently need an exclusive lock. (If that's not what it's doing, it's broken.) So there's some convenience argument for it, but it's hardly amounting to a stellar improvement. I'm kind of inclined to put it off till after we fix the SnapshotNow race condition problems; at that point it should be possible to do REINDEX CONCURRENTLY more simply and without any exclusive lock anywhere. regards, tom lane
On 2013-01-24 13:29:56 -0500, Robert Haas wrote: > On Wed, Jan 23, 2013 at 1:45 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Andres Freund escribió: > >> I somewhat dislike the fact that CONCURRENTLY isn't really concurrent > >> here (for the listeners: swapping the indexes acquires exlusive locks) , > >> but I don't see any other naming being better. > > > > REINDEX ALMOST CONCURRENTLY? > > I'm kind of unconvinced of the value proposition of this patch. I > mean, you can DROP INDEX CONCURRENTLY and CREATE INDEX CONCURRENTLY > today, so ... how is this better? In the wake of beb850e1d873f8920a78b9b9ee27e9f87c95592f I wrote a script to do this and it really is harder than one might think: * you cannot do it in the database as CONCURRENTLY cannot be used in a TX * you cannot do it to toast tables (this is currently broken in the patch but should be fixable) * you cannot legally do it when foreign key reference your unique key * you cannot do it to exclusion constraints or non-immediate indexes All of those are fixable (and most are) within REINDEX CONCURRENLY, so I find that to be a major feature even if its not as good as it could be. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
All the comments are addressed in version 8 attached, except for the hashtable part, which requires some heavy changes.
Michael Paquier
http://michael.otacoo.com
On Thu, Jan 24, 2013 at 3:41 AM, Andres Freund <andres@2ndquadrant.com> wrote:
-- On 2013-01-15 18:16:59 +0900, Michael Paquier wrote:But that fact might safe things. I don't immediately see any reason that
> Code path used by REINDEX concurrently permits to
> create an index in parallel of an existing one and not a completely new
> index. Shouldn't this work for indexes used by exclusion indexes also?
adding a
if (!indisvalid)
return;
to check_exclusion_constraint wouldn't be sufficient if there's another
index with an equivalent definition.
Indeed, this might be enough as for CREATE INDEX CONCURRENTLY this code path cannot be taken and only indexes created concurrently can be invalid. Hence I am adding that in the patch with a comment explaining why.
As far as I understand that code its purpose is to enforce that all
> > > + /*
> > > + * Phase 2 of REINDEX CONCURRENTLY
> > > + *
> > > + * Build concurrent indexes in a separate transaction for each
> > index to
> > > + * avoid having open transactions for an unnecessary long time.
> > We also
> > > + * need to wait until no running transactions could have the
> > parent table
> > > + * of index open. A concurrent build is done for each concurrent
> > > + * index that will replace the old indexes.
> > > + */
> > > +
> > > + /* Get the first element of concurrent index list */
> > > + lc2 = list_head(concurrentIndexIds);
> > > +
> > > + foreach(lc, indexIds)
> > > + {
> > > + Relation indexRel;
> > > + Oid indOid = lfirst_oid(lc);
> > > + Oid concurrentOid = lfirst_oid(lc2);
> > > + Oid relOid;
> > > + bool primary;
> > > + LOCKTAG *heapLockTag = NULL;
> > > + ListCell *cell;
> > > +
> > > + /* Move to next concurrent item */
> > > + lc2 = lnext(lc2);
> > > +
> > > + /* Start new transaction for this index concurrent build */
> > > + StartTransactionCommand();
> > > +
> > > + /* Get the parent relation Oid */
> > > + relOid = IndexGetRelation(indOid, false);
> > > +
> > > + /*
> > > + * Find the locktag of parent table for this index, we
> > need to wait for
> > > + * locks on it.
> > > + */
> > > + foreach(cell, lockTags)
> > > + {
> > > + LOCKTAG *localTag = (LOCKTAG *) lfirst(cell);
> > > + if (relOid == localTag->locktag_field2)
> > > + heapLockTag = localTag;
> > > + }
> > > +
> > > + Assert(heapLockTag && heapLockTag->locktag_field2 !=
> > InvalidOid);
> > > + WaitForVirtualLocks(*heapLockTag, ShareLock);
> >
> > Why do we have to do the WaitForVirtualLocks here? Shouldn't we do this
> > once for all relations after each phase? Otherwise the waiting time will
> > really start to hit when you do this on a somewhat busy server.
> >
> Each new index is built and set as ready in a separate single transaction,
> so doesn't it make sense to wait for the parent relation each time. It is
> possible to wait for a parent relation only once during this phase but in
> this case all the indexes of the same relation need to be set as ready in
> the same transaction. So here the choice is either to wait for the same
> relation multiple times for a single index or wait once for a parent
> relation but we build all the concurrent indexes within the same
> transaction. Choice 1 makes the code clearer and more robust to my mind as
> the phase 2 is done clearly for each index separately. Thoughts?
potential users have an up2date definition available. For that we
acquire a lock on all virtualxids of users using that table thus waiting
for them to finish.
Consider the scenario where you have a workload where most transactions
are fairly long (say 10min) and use the same tables (a,b)/indexes(a_1,
a_2, b_1, b_2). With the current strategy you will do:
WaitForVirtualLocks(a_1) -- wait up to 10min
index_build(a_1)
WaitForVirtualLocks(a_2) -- wait up to 10min
index_build(a_2)
...
So instead of waiting up 10 minutes for that phase you have to wait up
to 40.
This is necessary if you want to process each index entry in a different transaction as WaitForVirtualLocks needs to wait for the locks held on the parent table. If you want to fo this wait once per transaction, the solution would be to group the index builds in the same transaction for all the indexes of the relation. One index per transaction looks more solid in this case if there is a failure during a process only one index will be incorrectly built. Also, when you run a REINDEX CONCURRENTLY, you should not need to worry about the time it takes. The point is that this operation is done in background and that the tables are still accessible during this time.
Yep, it does when wal_level = hot_standby because it logs the exclusive
> > > + indexRel = index_open(indOid, ShareUpdateExclusiveLock);
> >
> > I wonder we should directly open it exlusive here given its going to
> > opened exclusively in a bit anyway. Not that that will really reduce the
> > deadlock likelihood since we already hold the ShareUpdateExclusiveLock
> > in session mode ...
> >
> I tried to use an AccessExclusiveLock here but it happens that this is not
> compatible with index_set_state_flags. Does taking an exclusive lock
> increments the transaction ID of running transaction? Because what I am
> seeing is that taking AccessExclusiveLock on this index does a transaction
> update.
lock to wal so the startup process on the standby can acquire it.
Imo that Assert needs to be moved to the existing callsites if there
isn't an equivalent one already.
OK. Letting the assertion inside index_set_state_flags makethes code more consistent with CREATE INDEX CONCURRENTLY, so the existing behavior is fine.
Well, lock upgrades make deadlocks more likely.
> For those reasons current code sticks with ShareUpdateExclusiveLock. Not a
> big deal btw...
Ok, of to v7:
+ */
+void
+index_concurrent_swap(Oid newIndexOid, Oid oldIndexOid)
...
+ /*
+ * Take a lock on the old and new index before switching their names. This
+ * avoids having index swapping relying on relation renaming mechanism to
+ * get a lock on the relations involved.
+ */
+ oldIndexRel = relation_open(oldIndexOid, AccessExclusiveLock);
+ newIndexRel = relation_open(newIndexOid, AccessExclusiveLock);
..
+ /*
+ * If the index swapped is a toast index, take an exclusive lock
on its
+ * parent toast relation and then update reltoastidxid to the
new index Oid
+ * value.
+ */
+ if (get_rel_relkind(parentOid) == RELKIND_TOASTVALUE)
+ {
+ Relation pg_class;
+
+ /* Open pg_class and fetch a writable copy of the relation tuple */
+ pg_class = heap_open(parentOid, RowExclusiveLock);
+
+ /* Update the statistics of this pg_class entry with new toast index Oid */
+ index_update_stats(pg_class, false, false, newIndexOid, -1.0);
+
+ /* Close parent relation */
+ heap_close(pg_class, RowExclusiveLock);
+ }
ISTM the RowExclusiveLock on the toast table should be acquired before
the locks on the indexes.
Done.
+index_concurrent_set_dead(Oid indexId, Oid heapId, LOCKTAG locktag)
+{
+ Relation heapRelation;
+ Relation indexRelation;
+
+ /*
+ * Now we must wait until no running transaction could be using the
+ * index for a query. To do this, inquire which xacts currently would
+ * conflict with AccessExclusiveLock on the table -- ie, which ones
+ * have a lock of any kind on the table. Then wait for each of these
+ * xacts to commit or abort. Note we do not need to worry about xacts
+ * that open the table for reading after this point; they will see the
+ * index as invalid when they open the relation.
+ *
+ * Note: the reason we use actual lock acquisition here, rather than
+ * just checking the ProcArray and sleeping, is that deadlock is
+ * possible if one of the transactions in question is blocked trying
+ * to acquire an exclusive lock on our table. The lock code will
+ * detect deadlock and error out properly.
+ *
+ * Note: GetLockConflicts() never reports our own xid, hence we need
+ * not check for that. Also, prepared xacts are not reported, which
+ * is fine since they certainly aren't going to do anything more.
+ */
+ WaitForVirtualLocks(locktag, AccessExclusiveLock);
Most of that comment seems to belong to WaitForVirtualLocks instead of
this specific caller of WaitForVirtualLocks.
Done.
A comment in the header that it is doing the waiting would also be good.
In ReindexRelationsConcurrently I suggest s/bypassing/skipping/.
Done.
Btw, seing that we have an indisvalid check the toast table's index, do
we have any way to cleanup such a dead index? I don't think its allowed
to drop the index of a toast table. I.e. we possibly need to relax that
check for invalid indexes :/.
For the time being, no I don't think so, except by doing a manual cleanup and remove the invalid pg_class entry in catalogs. One way to do thath cleanly could be to have autovacuum remove the invalid toast indexes automatically, but it is not dedicated to that and this is another discussion.
I think the usage of list_append_unique_oids in
ReindexRelationsConcurrently might get too expensive in larger
schemas. Its O(n^2) in the current usage and schemas with lots of
relations/indexes aren't unlikely candidates for this feature.
The easist solution probably is to use a hashtable.
Hum... This requires some thinking that will change the basics inside ReindexRelationsConcurrently...
Let me play a bit with the hashtable APIs and I'll come back to that later.
Let me play a bit with the hashtable APIs and I'll come back to that later.
ReindexRelationsConcurrently should do a CHECK_FOR_INTERRUPTS() every
once in a while, its currently not gracefully interruptible which
probably is bad in a bigger schema.
Done. I added some checks at each phase before beginning a new transaction.
Michael Paquier
http://michael.otacoo.com
Attachment
On Thu, Jan 24, 2013 at 3:41 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Michael Paquier
http://michael.otacoo.com
I think the usage of list_append_unique_oids in
ReindexRelationsConcurrently might get too expensive in larger
schemas. Its O(n^2) in the current usage and schemas with lots of
relations/indexes aren't unlikely candidates for this feature.
The easist solution probably is to use a hashtable.
I just had a look at the hashtable APIs and I do not think it is adapted to establish the list of unique index OIDs that need to be built concurrently. It would be of a better use in case of mapping the indexOids with something else, like the concurrent Oids, but still even with that the code would be more readable if let as is.
--
--
http://michael.otacoo.com
On 2013-01-25 14:11:39 +0900, Michael Paquier wrote: > On Thu, Jan 24, 2013 at 3:41 AM, Andres Freund <andres@2ndquadrant.com>wrote: > > > I think the usage of list_append_unique_oids in > > ReindexRelationsConcurrently might get too expensive in larger > > schemas. Its O(n^2) in the current usage and schemas with lots of > > relations/indexes aren't unlikely candidates for this feature. > > The easist solution probably is to use a hashtable. > > > I just had a look at the hashtable APIs and I do not think it is adapted to > establish the list of unique index OIDs that need to be built concurrently. > It would be of a better use in case of mapping the indexOids with something > else, like the concurrent Oids, but still even with that the code would be > more readable if let as is. It sure isn't optimal, but it should do the trick if you use the hash_seq stuff to iterate the hash afterwards. And you could use it to map to the respective locks et al. If you prefer other ways to implement it I guess the other easy solution is to add the values without preventing duplicates and then sort & remove duplicates in the end. Probably ends up being slightly more code, but I am not sure. I don't think we can leave the quadratic part in there as-is. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-01-25 13:48:50 +0900, Michael Paquier wrote: > All the comments are addressed in version 8 attached, except for the > hashtable part, which requires some heavy changes. > > On Thu, Jan 24, 2013 at 3:41 AM, Andres Freund <andres@2ndquadrant.com>wrote: > > > On 2013-01-15 18:16:59 +0900, Michael Paquier wrote: > > > Code path used by REINDEX concurrently permits to > > > create an index in parallel of an existing one and not a completely new > > > index. Shouldn't this work for indexes used by exclusion indexes also? > > > > But that fact might safe things. I don't immediately see any reason that > > adding a > > if (!indisvalid) > > return; > > to check_exclusion_constraint wouldn't be sufficient if there's another > > index with an equivalent definition. > > > Indeed, this might be enough as for CREATE INDEX CONCURRENTLY this code > path cannot be taken and only indexes created concurrently can be invalid. > Hence I am adding that in the patch with a comment explaining why. I don't really know anything about those mechanics, so some input from somebody who does would be very much appreciated. > > > > > + /* > > > > > + * Phase 2 of REINDEX CONCURRENTLY > > > > > + */ > > > > > + > > > > > + /* Get the first element of concurrent index list */ > > > > > + lc2 = list_head(concurrentIndexIds); > > > > > + > > > > > + foreach(lc, indexIds) > > > > > + { > > > > > + WaitForVirtualLocks(*heapLockTag, ShareLock); > > > > > > > > Why do we have to do the WaitForVirtualLocks here? Shouldn't we do this > > > > once for all relations after each phase? Otherwise the waiting time will > > > > really start to hit when you do this on a somewhat busy server. > > > > > > > Each new index is built and set as ready in a separate single > > transaction, > > > so doesn't it make sense to wait for the parent relation each time. It is > > > possible to wait for a parent relation only once during this phase but in > > > this case all the indexes of the same relation need to be set as ready in > > > the same transaction. So here the choice is either to wait for the same > > > relation multiple times for a single index or wait once for a parent > > > relation but we build all the concurrent indexes within the same > > > transaction. Choice 1 makes the code clearer and more robust to my mind > > as > > > the phase 2 is done clearly for each index separately. Thoughts? > > > > As far as I understand that code its purpose is to enforce that all > > potential users have an up2date definition available. For that we > > acquire a lock on all virtualxids of users using that table thus waiting > > for them to finish. > > Consider the scenario where you have a workload where most transactions > > are fairly long (say 10min) and use the same tables (a,b)/indexes(a_1, > > a_2, b_1, b_2). With the current strategy you will do: > > > > WaitForVirtualLocks(a_1) -- wait up to 10min > > index_build(a_1) > > WaitForVirtualLocks(a_2) -- wait up to 10min > > index_build(a_2) > > > ... > > > > So instead of waiting up 10 minutes for that phase you have to wait up > > to 40. > > > This is necessary if you want to process each index entry in a different > transaction as WaitForVirtualLocks needs to wait for the locks held on the > parent table. If you want to fo this wait once per transaction, the > solution would be to group the index builds in the same transaction for all > the indexes of the relation. One index per transaction looks more solid in > this case if there is a failure during a process only one index will be > incorrectly built. I cannot really follow you here. The reason why we need to wait here is *only* to make sure that nobody still has the old list of indexes around (which probably could even be relaxed for reindex concurrently, but thats a separate optimization). So if we wait for all relevant transactions to end before starting phase 2 proper, we are fine, independent of how many indexes we build in a single transaction. > Also, when you run a REINDEX CONCURRENTLY, you should > not need to worry about the time it takes. The point is that this operation > is done in background and that the tables are still accessible during this > time. I don't think that arguments holds that much water. Having open transactions for too long *does* incur a rather noticeable overhead. And you definitely do want such operations to finish as quickly as possible, even if its just because you can go home only afterwards ;) Really, imagine doing this too 100 indexes on a system where transactions regularly take 30 minutes (only needs one at a time). Minus the actual build-time thats very approx 4h against like half a month. > > Btw, seing that we have an indisvalid check the toast table's index, do > > we have any way to cleanup such a dead index? I don't think its allowed > > to drop the index of a toast table. I.e. we possibly need to relax that > > check for invalid indexes :/. > > > For the time being, no I don't think so, except by doing a manual cleanup > and remove the invalid pg_class entry in catalogs. One way to do thath > cleanly could be to have autovacuum remove the invalid toast indexes > automatically, but it is not dedicated to that and this is another > discussion. Hm. Don't think thats acceptable :/ As I mentioned somewhere else, I don't see how to do an concurrent build of the toast index at all, given there is exactly one index hardcoded in tuptoaster.c so the second index won't get updated before the switch has been made. Haven't yet looked at the new patch - do you plan to provide an updated version addressing some of the remaining issues soon? Don't want to review this if you nearly have the next version available. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Jan 27, 2013 at 1:37 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Michael PaquierIt sure isn't optimal, but it should do the trick if you use theOn 2013-01-25 14:11:39 +0900, Michael Paquier wrote:
hash_seq stuff to iterate the hash afterwards. And you could use it to
map to the respective locks et al.
If you prefer other ways to implement it I guess the other easy solution
is to add the values without preventing duplicates and then sort &
remove duplicates in the end. Probably ends up being slightly more code,
but I am not sure.
Indeed, I began playing with the HTAB functions and it looks that the only correct way to use that would be to use a hash table using as key the index OID with as entry:
- the index OID itself
- the concurrent OID
And a second hash table with parent relation OID as key and as output the LOCKTAG for each parent relation.
- the index OID itself
- the concurrent OID
And a second hash table with parent relation OID as key and as output the LOCKTAG for each parent relation.
I don't think we can leave the quadratic part in there as-is.
Sure, that is understandable.
--
--
http://michael.otacoo.com
On Sun, Jan 27, 2013 at 1:52 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-01-25 13:48:50 +0900, Michael Paquier wrote:I don't really know anything about those mechanics, so some input from
> All the comments are addressed in version 8 attached, except for the
> hashtable part, which requires some heavy changes.
>
> On Thu, Jan 24, 2013 at 3:41 AM, Andres Freund <andres@2ndquadrant.com>wrote:
>
> > On 2013-01-15 18:16:59 +0900, Michael Paquier wrote:
> > > Code path used by REINDEX concurrently permits to
> > > create an index in parallel of an existing one and not a completely new
> > > index. Shouldn't this work for indexes used by exclusion indexes also?
> >
> > But that fact might safe things. I don't immediately see any reason that
> > adding a
> > if (!indisvalid)
> > return;
> > to check_exclusion_constraint wouldn't be sufficient if there's another
> > index with an equivalent definition.
> >
> Indeed, this might be enough as for CREATE INDEX CONCURRENTLY this code
> path cannot be taken and only indexes created concurrently can be invalid.
> Hence I am adding that in the patch with a comment explaining why.
somebody who does would be very much appreciated.
> > > > > + /*
> > > > > + * Phase 2 of REINDEX CONCURRENTLY> > > > > + */
> > > > > +
> > > > > + /* Get the first element of concurrent index list */
> > > > > + lc2 = list_head(concurrentIndexIds);
> > > > > +
> > > > > + foreach(lc, indexIds)
> > > > > + {I cannot really follow you here.> > > > > + WaitForVirtualLocks(*heapLockTag, ShareLock);
> > > >
> > > > Why do we have to do the WaitForVirtualLocks here? Shouldn't we do this
> > > > once for all relations after each phase? Otherwise the waiting time will
> > > > really start to hit when you do this on a somewhat busy server.
> > > >
> > > Each new index is built and set as ready in a separate single
> > transaction,
> > > so doesn't it make sense to wait for the parent relation each time. It is
> > > possible to wait for a parent relation only once during this phase but in
> > > this case all the indexes of the same relation need to be set as ready in
> > > the same transaction. So here the choice is either to wait for the same
> > > relation multiple times for a single index or wait once for a parent
> > > relation but we build all the concurrent indexes within the same
> > > transaction. Choice 1 makes the code clearer and more robust to my mind
> > as
> > > the phase 2 is done clearly for each index separately. Thoughts?
> >
> > As far as I understand that code its purpose is to enforce that all
> > potential users have an up2date definition available. For that we
> > acquire a lock on all virtualxids of users using that table thus waiting
> > for them to finish.
> > Consider the scenario where you have a workload where most transactions
> > are fairly long (say 10min) and use the same tables (a,b)/indexes(a_1,
> > a_2, b_1, b_2). With the current strategy you will do:
> >
> > WaitForVirtualLocks(a_1) -- wait up to 10min
> > index_build(a_1)
> > WaitForVirtualLocks(a_2) -- wait up to 10min
> > index_build(a_2)
> >
> ...
> >
> > So instead of waiting up 10 minutes for that phase you have to wait up
> > to 40.
> >
> This is necessary if you want to process each index entry in a different
> transaction as WaitForVirtualLocks needs to wait for the locks held on the
> parent table. If you want to fo this wait once per transaction, the
> solution would be to group the index builds in the same transaction for all
> the indexes of the relation. One index per transaction looks more solid in
> this case if there is a failure during a process only one index will be
> incorrectly built.
OK let's be more explicit...
The reason why we need to wait here is
*only* to make sure that nobody still has the old list of indexes
around (which probably could even be relaxed for reindex concurrently,
but thats a separate optimization).
In order to do that, you need to wait for the *parent relations* and not the index themselves, no?
Based on 2 facts:
- each index build is done in a single transaction
- a wait needs to be done on the parent relation before each transaction
You need to wait for the parent relation multiple times depending on the number of indexes in it. You could optimize that by building all the indexes of the *same parent relation* in a single transaction.
So, for example in the case of this table:
CREATE TABLE tab (col1 PRIMARY KEY, col2 int);
CREATE INDEX int ON tab (col2);
If the primary key index and the second index on col2 are built in a unique transaction, you could wait only once for the locks on the parent relation 'tab' only once.
Based on 2 facts:
- each index build is done in a single transaction
- a wait needs to be done on the parent relation before each transaction
You need to wait for the parent relation multiple times depending on the number of indexes in it. You could optimize that by building all the indexes of the *same parent relation* in a single transaction.
So, for example in the case of this table:
CREATE TABLE tab (col1 PRIMARY KEY, col2 int);
CREATE INDEX int ON tab (col2);
If the primary key index and the second index on col2 are built in a unique transaction, you could wait only once for the locks on the parent relation 'tab' only once.
So if we wait for all relevant transactions to end before starting phase
2 proper, we are fine, independent of how many indexes we build in a
single transaction.
The reason why all the index builds are done in a single transaction is that you mentioned in a previous review (v3?) that we should do the builds in a single transaction for *each* index. What looked fair based on the fact that the transaction time for each index could be reduced, the downside being that you wait more on the parent relation.
> > Btw, seing that we have an indisvalid check the toast table's index, doHm. Don't think thats acceptable :/
> > we have any way to cleanup such a dead index? I don't think its allowed
> > to drop the index of a toast table. I.e. we possibly need to relax that
> > check for invalid indexes :/.
> >
> For the time being, no I don't think so, except by doing a manual cleanup
> and remove the invalid pg_class entry in catalogs. One way to do thath
> cleanly could be to have autovacuum remove the invalid toast indexes
> automatically, but it is not dedicated to that and this is another
> discussion.
As I mentioned somewhere else, I don't see how to do an concurrent build
of the toast index at all, given there is exactly one index hardcoded in
tuptoaster.c so the second index won't get updated before the switch has
been made.
Haven't yet looked at the new patch - do you plan to provide an updated
version addressing some of the remaining issues soon? Don't want to
review this if you nearly have the next version available.
Before providing more effort in coding, I think it is better to be clear about the strategy to use on the 2 following points:
1) At the index build phase, is it better to build each index in a single separate transaction? Or group the builds in a transaction for each parent table? This is solvable but the strategy should be clear.
2) Find a solution for invalid toast indexes, which is not that easy. One solution could be to use an autovacuum process to clean up the invalid indexes of toast tables automatically. Another solution is to skip the reindex for toast indexes, making the feature less usable.
1) At the index build phase, is it better to build each index in a single separate transaction? Or group the builds in a transaction for each parent table? This is solvable but the strategy should be clear.
2) Find a solution for invalid toast indexes, which is not that easy. One solution could be to use an autovacuum process to clean up the invalid indexes of toast tables automatically. Another solution is to skip the reindex for toast indexes, making the feature less usable.
If a solution or an agreement is not found for those 2 points, I think it will be fair to simply reject the patch.
It looks that this feature has still too many disadvantages compared to the advantages it could bring in the current infrastructure (SnapshotNow problems, what to do with invalid toast indexes, etc.), so I would tend to agree with Tom and postpone this feature once infrastructure is more mature, one of the main things being the non-MVCC'ed catalogs.
--
Michael Paquier
http://michael.otacoo.com
On 2013-01-27 07:54:43 +0900, Michael Paquier wrote: > On Sun, Jan 27, 2013 at 1:52 AM, Andres Freund <andres@2ndquadrant.com>wrote: > > On 2013-01-25 13:48:50 +0900, Michael Paquier wrote: > > > > As far as I understand that code its purpose is to enforce that all > > > > potential users have an up2date definition available. For that we > > > > acquire a lock on all virtualxids of users using that table thus waiting > > > > for them to finish. > > > > Consider the scenario where you have a workload where most transactions > > > > are fairly long (say 10min) and use the same tables (a,b)/indexes(a_1, > > > > a_2, b_1, b_2). With the current strategy you will do: > > > > > > > > WaitForVirtualLocks(a_1) -- wait up to 10min > > > > index_build(a_1) > > > > WaitForVirtualLocks(a_2) -- wait up to 10min > > > > index_build(a_2) > > > > > > > ... > > > > > > > > So instead of waiting up 10 minutes for that phase you have to wait up > > > > to 40. > > > > > > > This is necessary if you want to process each index entry in a different > > > transaction as WaitForVirtualLocks needs to wait for the locks held on the > > > parent table. If you want to fo this wait once per transaction, the > > > solution would be to group the index builds in the same transaction for all > > > the indexes of the relation. One index per transaction looks more solid in > > > this case if there is a failure during a process only one index will be > > > incorrectly built. > > > > I cannot really follow you here. > > > OK let's be more explicit... > > The reason why we need to wait here is > > *only* to make sure that nobody still has the old list of indexes > > around (which probably could even be relaxed for reindex concurrently, > > but thats a separate optimization). > > > In order to do that, you need to wait for the *parent relations* and not > the index themselves, no? > Based on 2 facts: > - each index build is done in a single transaction > - a wait needs to be done on the parent relation before each transaction > You need to wait for the parent relation multiple times depending on the > number of indexes in it. You could optimize that by building all the > indexes of the *same parent relation* in a single transaction. I think youre misunderstanding how this part works a bit. We don't acquire locks on the table itself, but we get a list of all transactions we would conflict with if we were to acquire a lock of a certain strength on the table (GetLockConflicts(locktag, mode)). We then wait for each transaction in the resulting list via the VirtualXact mechanism (VirtualXactLock(*lockholder)). It doesn't matter all that waiting happens in the same transaction the initial index build is done in as long as we keep the session locks preventing other schema modifications. Nobody can go back and see an older index list after we've done the above wait once. So the following should be perfectly fine: StartTransactionCommand(); BuildListOfIndexes(); foreach(index in indexes) DefineNewIndex(index); CommitTransactionCommand(); StartTransactionCommand(); foreach(table in tables) GetLockConflicts() foreach(conflict in conflicts) VirtualXactLocks() CommitTransactionCommand(); foreach(index in indexes) StartTransactionCommand(); InitialIndexBuild(index) CommitTransactionCommand(); ... > It looks that this feature has still too many disadvantages compared to the > advantages it could bring in the current infrastructure (SnapshotNow > problems, what to do with invalid toast indexes, etc.), so I would tend to > agree with Tom and postpone this feature once infrastructure is more > mature, one of the main things being the non-MVCC'ed catalogs. I think while catalog mvcc snapshots would make this easier, most problems, basically all but the switching of relations, are pretty much independent from that fact. All the waiting etc, will still be there. I can see an argument for pushing it to the next CF because its not really there yet... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jan 28, 2013 at 7:39 PM, Andres Freund <andres@2ndquadrant.com> wrote:
-- On 2013-01-27 07:54:43 +0900, Michael Paquier wrote:I think you're misunderstanding how this part works a bit. We don't
acquire locks on the table itself, but we get a list of all transactions
we would conflict with if we were to acquire a lock of a certain
strength on the table (GetLockConflicts(locktag, mode)). We then wait
for each transaction in the resulting list via the VirtualXact mechanism
(VirtualXactLock(*lockholder)).
It doesn't matter all that waiting happens in the same transaction the
initial index build is done in as long as we keep the session locks
preventing other schema modifications. Nobody can go back and see an
older index list after we've done the above wait once.
Don't worry I got it. I just thought that it was necessary to wait for the locks taken on the parent relation by other backends just *before* building the index. It seemed more stable.
So the following should be perfectly fine:
StartTransactionCommand();
BuildListOfIndexes();
foreach(index in indexes)
DefineNewIndex(index);
CommitTransactionCommand();
StartTransactionCommand();
foreach(table in tables)
GetLockConflicts()
foreach(conflict in conflicts)
VirtualXactLocks()
CommitTransactionCommand();
foreach(index in indexes)
StartTransactionCommand();
InitialIndexBuild(index)
CommitTransactionCommand();
So you're point is simply to wait for all the locks currently taken on each table in a different transaction only once and for all, independently from the build and validation phases. Correct?
> It looks that this feature has still too many disadvantages compared to the> advantages it could bring in the current infrastructure (SnapshotNowI think while catalog mvcc snapshots would make this easier, most
> problems, what to do with invalid toast indexes, etc.), so I would tend to
> agree with Tom and postpone this feature once infrastructure is more
> mature, one of the main things being the non-MVCC'ed catalogs.
problems, basically all but the switching of relations, are pretty much
independent from that fact. All the waiting etc, will still be there.
I can see an argument for pushing it to the next CF because its not
really there yet...
Even if we get this patch in a shape that you think is sufficient to make it reviewable by a committer within a couple of days, there are still many doubts from many people regarding this feature, so this is going to take far more time to put it in a shape that would satisfy a vast majority. So it is honestly wiser to work on that later.
Another argument that would be enough for a rejection of this patch by a committer is the problem of invalid toast indexes that cannot be removed up cleanly by an operator. As long as there is not a clean solution for that...
Another argument that would be enough for a rejection of this patch by a committer is the problem of invalid toast indexes that cannot be removed up cleanly by an operator. As long as there is not a clean solution for that...
Michael Paquier
http://michael.otacoo.com
Hi, On 2013-01-28 20:31:48 +0900, Michael Paquier wrote: > On Mon, Jan 28, 2013 at 7:39 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > > On 2013-01-27 07:54:43 +0900, Michael Paquier wrote: > > I think you're misunderstanding how this part works a bit. We don't > > acquire locks on the table itself, but we get a list of all transactions > > we would conflict with if we were to acquire a lock of a certain > > strength on the table (GetLockConflicts(locktag, mode)). We then wait > > for each transaction in the resulting list via the VirtualXact mechanism > > (VirtualXactLock(*lockholder)). > > It doesn't matter all that waiting happens in the same transaction the > > initial index build is done in as long as we keep the session locks > > preventing other schema modifications. Nobody can go back and see an > > older index list after we've done the above wait once. > > > Don't worry I got it. I just thought that it was necessary to wait for the > locks taken on the parent relation by other backends just *before* building > the index. It seemed more stable. I don't see any need for that. Its really only about making sure their relcache entry for the indexlist - and by extension rd_indexattr - in all other transactions that could possibly write to the table is up2date. As a relation_open with a lock (which is done for every write) will always drain the invalidations thats guaranteed if we wait that way. > So the following should be perfectly fine: > > > > StartTransactionCommand(); > > BuildListOfIndexes(); > > foreach(index in indexes) > > DefineNewIndex(index); > > CommitTransactionCommand(); > > > > StartTransactionCommand(); > > foreach(table in tables) > > GetLockConflicts() > > foreach(conflict in conflicts) > > VirtualXactLocks() > > CommitTransactionCommand(); > > > > foreach(index in indexes) > > StartTransactionCommand(); > > InitialIndexBuild(index) > > CommitTransactionCommand(); > > > So you're point is simply to wait for all the locks currently taken on each > table in a different transaction only once and for all, independently from > the build and validation phases. Correct? Exactly. That will batch the wait for the transactions together and thus will greatly decrease the overhead of doing a concurrent reindex (wall, not cpu-clock wise). > > > It looks that this feature has still too many disadvantages compared to the > > > advantages it could bring in the current infrastructure (SnapshotNow > > > problems, what to do with invalid toast indexes, etc.), so I would tend to > > > agree with Tom and postpone this feature once infrastructure is more > > > mature, one of the main things being the non-MVCC'ed catalogs. > > > > I think while catalog mvcc snapshots would make this easier, most > > problems, basically all but the switching of relations, are pretty much > > independent from that fact. All the waiting etc, will still be there. > > > > I can see an argument for pushing it to the next CF because its not > > really there yet... > > > Even if we get this patch in a shape that you think is sufficient to make > it reviewable by a committer within a couple of days, there are still many > doubts from many people regarding this feature, so this is going to take > far more time to put it in a shape that would satisfy a vast majority. So > it is honestly wiser to work on that later. I really haven't heard too many arguments from other after the initial round. Right now I "only" recall Tom and Robert doubting the usefulness, right? I think most of the work in this patch is completely independent from the snapshot stuff, so I really don't see much of an argument to make it dependent on catalog snapshots. > Another argument that would be enough for a rejection of this patch by a > committer is the problem of invalid toast indexes that cannot be removed up > cleanly by an operator. As long as there is not a clean solution for > that... I think that part is relatively easy to fix, I wouldn't worry too much. The more complex part is how to get tuptoaster.c to update the concurrently created index. Thats what I worry about. Its not going through the normal executor paths but manually updates the toast index - which means it won't update the indisready && !indisvalid index... Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jan 28, 2013 at 8:44 PM, Andres Freund <andres@anarazel.de> wrote:
-- > Another argument that would be enough for a rejection of this patch by a> committer is the problem of invalid toast indexes that cannot be removed upI think that part is relatively easy to fix, I wouldn't worry too
> cleanly by an operator. As long as there is not a clean solution for
> that...
much.
The more complex part is how to get tuptoaster.c to update the
concurrently created index. That's what I worry about. Its not going
through the normal executor paths but manually updates the toast
index - which means it won't update the indisready && !indisvalid
index...
I included in the patch some stuff to update the reltoastidxid of the parent relation of the toast index. Have a look at index.c:index_concurrent_swap. The particular case I had in mind was if there is a failure of the server during the concurrent reindex of a toast index. When server restarts, the toast relation will have an invalid index and this cannot be dropped by an operator via SQL.
Michael Paquier
http://michael.otacoo.com
On 2013-01-28 20:50:21 +0900, Michael Paquier wrote: > On Mon, Jan 28, 2013 at 8:44 PM, Andres Freund <andres@anarazel.de> wrote: > > > > Another argument that would be enough for a rejection of this patch by a > > > committer is the problem of invalid toast indexes that cannot be removed > > up > > > cleanly by an operator. As long as there is not a clean solution for > > > that... > > > > I think that part is relatively easy to fix, I wouldn't worry too > > much. > > The more complex part is how to get tuptoaster.c to update the > > concurrently created index. That's what I worry about. Its not going > > through the normal executor paths but manually updates the toast > > index - which means it won't update the indisready && !indisvalid > > index... > > > I included in the patch some stuff to update the reltoastidxid of the > parent relation of the toast index. Have a look at > index.c:index_concurrent_swap. The particular case I had in mind was if > there is a failure of the server during the concurrent reindex of a toast > index. Thats not enough unfortunately. The problem scenario is the following: toast table: pg_toast.pg_toast_16384 toast index (via reltoastidxid): pg_toast.pg_toast_16384_index REINDEX CONCURRENTLY PHASE #1 REINDEX CONCURRENTLY PHASE #2 toast table: pg_toast.pg_toast_16384 toast index (via reltoastidxid): pg_toast.pg_toast_16384_index, ready & valid toast index (via pg_index): pg_toast.pg_toast_16384_index_tmp, ready & !valid If a tuple gets toasted in this state tuptoaster.c will update 16384_index but not 16384_index_tmp. In normal tables this works because nodeModifyTable uses ExecInsertIndexTuples which updates all ready indexes. tuptoaster.c does something different though, it calls index_insert exactly on the one expected index, not on the other ones. Makes sense? > When server restarts, the toast relation will have an invalid index > and this cannot be dropped by an operator via SQL. That requires about two lines of special case code in RangeVarCallbackForDropRelation, that doesn't seem to be too bad to me. I.e. allow the case where its IsSystemClass(classform) && relkind == RELKIND_INDEX && !indisvalid. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jan 28, 2013 at 8:59 PM, Andres Freund <andres@2ndquadrant.com> wrote:
-- On 2013-01-28 20:50:21 +0900, Michael Paquier wrote:Thats not enough unfortunately. The problem scenario is the following:
> On Mon, Jan 28, 2013 at 8:44 PM, Andres Freund <andres@anarazel.de> wrote:
>
> > > Another argument that would be enough for a rejection of this patch by a
> > > committer is the problem of invalid toast indexes that cannot be removed
> > up
> > > cleanly by an operator. As long as there is not a clean solution for
> > > that...
> >
> > I think that part is relatively easy to fix, I wouldn't worry too
> > much.
> > The more complex part is how to get tuptoaster.c to update the
> > concurrently created index. That's what I worry about. Its not going
> > through the normal executor paths but manually updates the toast
> > index - which means it won't update the indisready && !indisvalid
> > index...
> >
> I included in the patch some stuff to update the reltoastidxid of the
> parent relation of the toast index. Have a look at
> index.c:index_concurrent_swap. The particular case I had in mind was if
> there is a failure of the server during the concurrent reindex of a toast
> index.
toast table: pg_toast.pg_toast_16384
toast index (via reltoastidxid): pg_toast.pg_toast_16384_index
REINDEX CONCURRENTLY PHASE #1
REINDEX CONCURRENTLY PHASE #2
toast table: pg_toast.pg_toast_16384
toast index (via reltoastidxid): pg_toast.pg_toast_16384_index, ready & valid
toast index (via pg_index): pg_toast.pg_toast_16384_index_tmp, ready & !valid
If a tuple gets toasted in this state tuptoaster.c will update
16384_index but not 16384_index_tmp. In normal tables this works because
nodeModifyTable uses ExecInsertIndexTuples which updates all ready
indexes. tuptoaster.c does something different though, it calls
index_insert exactly on the one expected index, not on the other ones.
Makes sense?
I didn't know toast indexes followed this code path. Thanks for the details.
That requires about two lines of special case code in
> When server restarts, the toast relation will have an invalid index
> and this cannot be dropped by an operator via SQL.
RangeVarCallbackForDropRelation, that doesn't seem to be too bad to me.
I.e. allow the case where its IsSystemClass(classform) && relkind ==
RELKIND_INDEX && !indisvalid.
OK, I thought it was more complicated.
Michael Paquier
http://michael.otacoo.com
Hi,
Please find attached a patch fixing 3 of the 4 problems reported before (the patch does not contain docs).
1) Removal of the quadratic dependency with list_append_unique_oid
2) Minimization of the wait phase for parent relations, this is done in a single transaction before phase 2
3) Authorization of the drop for invalid system indexes
The problem remaining is related to toast indexes. In current master code, tuptoastter.c assumes that the index attached to the toast relation is unique
This creates a problem when running concurrent reindex on toast indexes, because after phase 2, there is this problem:
pg_toast_index valid && ready
pg_toast_index_cct valid && !ready
The concurrent toast index went though index_build is set as valid. So at this instant, the index can be used when inserting new entries.
However, when inserting a new entry in the toast index, only the index registered in reltoastidxid is used for insertion in tuptoaster.c:toast_save_datum.
toastidx = index_open(toastrel->rd_rel->reltoastidxid, RowExclusiveLock);
This cannot work when there are concurrent toast indexes as in this case the toast index is thought as unique.
In order to fix that, it is necessary to extend toast_save_datum to insert index data to the other concurrent indexes as well, and I am currently thinking about two possible approaches:
1) Change reltoastidxid from oid type to oidvector to be able to manage multiple toast index inserts. The concurrent indexes would be added in this vector once built and all the indexes in this vector would be used by tuptoaster.c:toast_save_datum. Not backward compatible but does it matter for toast relations?
2) Add new oidvector column in pg_class containing a vector of concurrent toast index Oids built but not validated. toast_save_datum would scan this vector and insert entries in index if there are any present in vector.
Comments as well as other ideas are welcome.
Thanks,
--
Michael
Please find attached a patch fixing 3 of the 4 problems reported before (the patch does not contain docs).
1) Removal of the quadratic dependency with list_append_unique_oid
2) Minimization of the wait phase for parent relations, this is done in a single transaction before phase 2
3) Authorization of the drop for invalid system indexes
The problem remaining is related to toast indexes. In current master code, tuptoastter.c assumes that the index attached to the toast relation is unique
This creates a problem when running concurrent reindex on toast indexes, because after phase 2, there is this problem:
pg_toast_index valid && ready
pg_toast_index_cct valid && !ready
The concurrent toast index went though index_build is set as valid. So at this instant, the index can be used when inserting new entries.
However, when inserting a new entry in the toast index, only the index registered in reltoastidxid is used for insertion in tuptoaster.c:toast_save_datum.
toastidx = index_open(toastrel->rd_rel->reltoastidxid, RowExclusiveLock);
This cannot work when there are concurrent toast indexes as in this case the toast index is thought as unique.
In order to fix that, it is necessary to extend toast_save_datum to insert index data to the other concurrent indexes as well, and I am currently thinking about two possible approaches:
1) Change reltoastidxid from oid type to oidvector to be able to manage multiple toast index inserts. The concurrent indexes would be added in this vector once built and all the indexes in this vector would be used by tuptoaster.c:toast_save_datum. Not backward compatible but does it matter for toast relations?
2) Add new oidvector column in pg_class containing a vector of concurrent toast index Oids built but not validated. toast_save_datum would scan this vector and insert entries in index if there are any present in vector.
Comments as well as other ideas are welcome.
Thanks,
--
Michael
Attachment
Hi Michael, On 2013-02-07 16:45:57 +0900, Michael Paquier wrote: > Please find attached a patch fixing 3 of the 4 problems reported before > (the patch does not contain docs). Cool! > 1) Removal of the quadratic dependency with list_append_unique_oid > 2) Minimization of the wait phase for parent relations, this is done in a > single transaction before phase 2 > 3) Authorization of the drop for invalid system indexes I think there's also the issue of some minor changes required to make exclusion constraints work. > The problem remaining is related to toast indexes. In current master code, > tuptoastter.c assumes that the index attached to the toast relation is > unique > This creates a problem when running concurrent reindex on toast indexes, > because after phase 2, there is this problem: > pg_toast_index valid && ready > pg_toast_index_cct valid && !ready > The concurrent toast index went though index_build is set as valid. So at > this instant, the index can be used when inserting new entries. Um, isn't pg_toast_index_cct !valid && ready? > However, when inserting a new entry in the toast index, only the index > registered in reltoastidxid is used for insertion in > tuptoaster.c:toast_save_datum. > toastidx = index_open(toastrel->rd_rel->reltoastidxid, RowExclusiveLock); > This cannot work when there are concurrent toast indexes as in this case > the toast index is thought as unique. > > In order to fix that, it is necessary to extend toast_save_datum to insert > index data to the other concurrent indexes as well, and I am currently > thinking about two possible approaches: > 1) Change reltoastidxid from oid type to oidvector to be able to manage > multiple toast index inserts. The concurrent indexes would be added in this > vector once built and all the indexes in this vector would be used by > tuptoaster.c:toast_save_datum. Not backward compatible but does it matter > for toast relations? I don't see a problem breaking backward compat in that area. > 2) Add new oidvector column in pg_class containing a vector of concurrent > toast index Oids built but not validated. toast_save_datum would scan this > vector and insert entries in index if there are any present in vector. What about 3) Use reltoastidxid if != InvalidOid and manually build the list (using RelationGetIndexList) otherwise? That should keep the additional overhead minimal and should be relatively straightforward to implement? I think your patch accidentially squashed in some other changes (like 5a1cd89f8f), care to repost without? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > What about > 3) Use reltoastidxid if != InvalidOid and manually build the list (using > RelationGetIndexList) otherwise? Do we actually need reltoastidxid at all? I always thought having that field was a case of premature optimization. There might be some case for keeping it to avoid breaking any client-side code that might be looking at it ... but if you're proposing changing the field contents anyway, that argument goes right out the window. regards, tom lane
On Thu, Feb 7, 2013 at 4:55 PM, Andres Freund <andres@2ndquadrant.com> wrote:
-- > 1) Removal of the quadratic dependency with list_append_unique_oid> 2) Minimization of the wait phase for parent relations, this is done in aI think there's also the issue of some minor changes required to make
> single transaction before phase 2
> 3) Authorization of the drop for invalid system indexes
exclusion constraints work.
Thanks for reminding, I completely forgot this issue. I added a check with a comment in execUtils.c:check_exclusion_constraint for that.
> The problem remaining is related to toast indexes. In current master code,Um, isn't pg_toast_index_cct !valid && ready?
> tuptoastter.c assumes that the index attached to the toast relation is
> unique
> This creates a problem when running concurrent reindex on toast indexes,
> because after phase 2, there is this problem:
> pg_toast_index valid && ready
> pg_toast_index_cct valid && !ready
> The concurrent toast index went though index_build is set as valid. So at
> this instant, the index can be used when inserting new entries.
You are right ;)
I don't see a problem breaking backward compat in that area.
> However, when inserting a new entry in the toast index, only the index
> registered in reltoastidxid is used for insertion in
> tuptoaster.c:toast_save_datum.
> toastidx = index_open(toastrel->rd_rel->reltoastidxid, RowExclusiveLock);
> This cannot work when there are concurrent toast indexes as in this case
> the toast index is thought as unique.
>
> In order to fix that, it is necessary to extend toast_save_datum to insert
> index data to the other concurrent indexes as well, and I am currently
> thinking about two possible approaches:
> 1) Change reltoastidxid from oid type to oidvector to be able to manage
> multiple toast index inserts. The concurrent indexes would be added in this
> vector once built and all the indexes in this vector would be used by
> tuptoaster.c:toast_save_datum. Not backward compatible but does it matter
> for toast relations?
Agreed. I though so.
What about
> 2) Add new oidvector column in pg_class containing a vector of concurrent
> toast index Oids built but not validated. toast_save_datum would scan this
> vector and insert entries in index if there are any present in vector.
3) Use reltoastidxid if != InvalidOid and manually build the list (using
RelationGetIndexList) otherwise? That should keep the additional
overhead minimal and should be relatively straightforward to implement?
OK. Here is a new idea.
I think your patch accidentially squashed in some other changes (like
5a1cd89f8f), care to repost without?
That's... well... unfortunate... Updated version attached.
Michael
Attachment
On 2013-02-07 03:01:36 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > What about > > > 3) Use reltoastidxid if != InvalidOid and manually build the list (using > > RelationGetIndexList) otherwise? > > Do we actually need reltoastidxid at all? I always thought having that > field was a case of premature optimization. I am a bit doubtful its really measurable as well. Really supporting a dynamic number of indexes might be noticeable because we would need to allocate memory et al for each toasted Datum, but only supporting one or two seems easy enough. The only advantage besides the dubious performance advantage of my proposed solution is that less code needs to change as only toast_save_datum() would need to change. > There might be some case > for keeping it to avoid breaking any client-side code that might be > looking at it ... but if you're proposing changing the field contents > anyway, that argument goes right out the window. Well, it would only be 0/InvalidOid while being reindexed concurrently, but yea. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Feb 7, 2013 at 5:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
-- Andres Freund <andres@2ndquadrant.com> writes:Do we actually need reltoastidxid at all? I always thought having that
> What about
> 3) Use reltoastidxid if != InvalidOid and manually build the list (using
> RelationGetIndexList) otherwise?
field was a case of premature optimization. There might be some case
for keeping it to avoid breaking any client-side code that might be
looking at it ... but if you're proposing changing the field contents
anyway, that argument goes right out the window.
Here is an interesting idea. Could there be some performance impact if we remove this field and replace it by RelationGetIndexList to fetch the list of indexes that need to be inserted?
Michael
On Thu, Feb 7, 2013 at 5:15 PM, Andres Freund <span dir="ltr"><<a href="mailto:andres@2ndquadrant.com" target="_blank">andres@2ndquadrant.com</a>></span>wrote:<br /><div class="gmail_quote"><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On 2013-02-07 03:01:36-0500, Tom Lane wrote:<br /> > Andres Freund <<a href="mailto:andres@2ndquadrant.com">andres@2ndquadrant.com</a>>writes:<br /> > > What about<br /> ><br /> >> 3) Use reltoastidxid if != InvalidOid and manually build the list (using<br /> > > RelationGetIndexList) otherwise?<br/> ><br /> > Do we actually need reltoastidxid at all? I always thought having that<br /> > fieldwas a case of premature optimization.<br /><br /></div>I am a bit doubtful its really measurable as well. Really supportinga<br /> dynamic number of indexes might be noticeable because we would need to<br /> allocate memory et al foreach toasted Datum, but only supporting one or<br /> two seems easy enough.<br /><br /> The only advantage besides thedubious performance advantage of my<br /> proposed solution is that less code needs to change as only<br /> toast_save_datum()would need to change.<br /><div class="im"><br /> > There might be some case<br /> > for keepingit to avoid breaking any client-side code that might be<br /> > looking at it ... but if you're proposing changingthe field contents<br /> > anyway, that argument goes right out the window.<br /><br /></div>Well, it would onlybe 0/InvalidOid while being reindexed concurrently,<br /> but yea.<br /></blockquote>Removing reltoastindxid is moreappealing for at least 2 reasons regarding current implementation of REINDEX CONCURRENTLY:<br />1) if reltoastidxid isset to InvalidOid during a concurrent reindex and reindex fails, how would it be possible to set it back to the correctvalue? This would need more special code, which could become a maintenance burden for sure.<br /> 2) There is alreadysome special code in my patch to update reltoastidxid to the new Oid value when swapping indexes. Removing that wouldhonestly make the index swapping cleaner.<br /><br />Btw, I think that if this optimization for toast relations is done,it should be a separate patch. Also, as I am not a specialist in toast indexes, any opinion about potential performanceimpact (if any) is welcome if we remove reltoastidxid and use RelationGetIndexList instead.<br /></div>-- <br/>Michael<br />
On 2013-02-07 17:28:53 +0900, Michael Paquier wrote: > On Thu, Feb 7, 2013 at 5:15 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > > On 2013-02-07 03:01:36 -0500, Tom Lane wrote: > > > Andres Freund <andres@2ndquadrant.com> writes: > > > > What about > > > > > > > 3) Use reltoastidxid if != InvalidOid and manually build the list > > (using > > > > RelationGetIndexList) otherwise? > > > > > > Do we actually need reltoastidxid at all? I always thought having that > > > field was a case of premature optimization. > > > > I am a bit doubtful its really measurable as well. Really supporting a > > dynamic number of indexes might be noticeable because we would need to > > allocate memory et al for each toasted Datum, but only supporting one or > > two seems easy enough. > > > > The only advantage besides the dubious performance advantage of my > > proposed solution is that less code needs to change as only > > toast_save_datum() would need to change. > > > > > There might be some case > > > for keeping it to avoid breaking any client-side code that might be > > > looking at it ... but if you're proposing changing the field contents > > > anyway, that argument goes right out the window. > > > > Well, it would only be 0/InvalidOid while being reindexed concurrently, > > but yea. > > > Removing reltoastindxid is more appealing for at least 2 reasons regarding > current implementation of REINDEX CONCURRENTLY: > 1) if reltoastidxid is set to InvalidOid during a concurrent reindex and > reindex fails, how would it be possible to set it back to the correct > value? This would need more special code, which could become a maintenance > burden for sure. I would just let it stay slightly less efficient till the index is dropped/reindexed. > Btw, I think that if this optimization for toast relations is done, it > should be a separate patch. What do you mean by a separate patch? Commit it before committing REINDEX CONCURRENTLY? If so, yes, sure. If you mean it can be fixed later, I don't really see how, since this is an unresolved problem... > Also, as I am not a specialist in toast > indexes, any opinion about potential performance impact (if any) is welcome > if we remove reltoastidxid and use RelationGetIndexList instead. Tom doubted it will be really measurable, so did I... If anything I think it will be measurable during querying toast tables. So possibly we would have to retain reltoastidxid for querying... The minimal (not so nice) patch to make this correct probably is fairly easy. Changing only toast_save_datum: Relation toastidx[2]; ... if (toastrel->rd_indexvalid == 0) RelationGetIndexList(toastrel); num_indexes = list_length(toastrel->rd_indexlist); if (num_indexes == 1) toastidx[0] = index_open(toastrel->rd_rel->reltoastidxid); else if (num_indexes == 2) { int off = 0; ListCell *l; foreach(l, RelationGetIndexList(toastrel)) toastidx[off] = index_open(lfirst_oid(l)); } else elog(ERROR, "toast indexes with unsupported number of indexes"); ... for (cur_index = 0; cur_index < num_indexes; cur_index++) index_insert(toastidx[cur_index], t_values, t_isnull, &(toasttup->t_self), toastrel, toastidx[cur_index]->rd_index->indisunique ? UNIQUE_CHECK_YES : UNIQUE_CHECK_NO); ... for (cur_index = 0; cur_index < num_indexes; cur_index++)index_close(toastidx[cur_index], RowExclusiveLock); (that indisunique check seems like a copy&paste remnant btw). Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2013-02-07 16:45:57 +0900, Michael Paquier wrote: > Please find attached a patch fixing 3 of the 4 problems reported before > (the patch does not contain docs). > 1) Removal of the quadratic dependency with list_append_unique_oid Afaics you now simply lock objects multiple times, is that right? > 2) Minimization of the wait phase for parent relations, this is done in a > single transaction before phase 2 Unfortunately I don't think this did the trick. You currently have the following: + /* Perform a wait on each session lock separate transaction */ + StartTransactionCommand(); + foreach(lc, lockTags) + { + LOCKTAG *localTag = (LOCKTAG *) lfirst(lc); + Assert(localTag && localTag->locktag_field2 != InvalidOid); + WaitForVirtualLocks(*localTag, ShareLock); + } + CommitTransactionCommand(); and +void +WaitForVirtualLocks(LOCKTAG heaplocktag, LOCKMODE lockmode) +{ + VirtualTransactionId *old_lockholders; + + old_lockholders = GetLockConflicts(&heaplocktag, lockmode); + + while (VirtualTransactionIdIsValid(*old_lockholders)) + { + VirtualXactLock(*old_lockholders, true); + old_lockholders++; + } +} To get rid of the issue you need to batch all the GetLockConflicts calls together before doing any of the VirtualXactLocks. Otherwise other backends will produce new conflicts on relation n+1 while you wait for relation n. So it would need to be something like: void WaitForVirtualLocksList(List heaplocktags, LOCKMODE lockmode) { VirtualTransactionId **old_lockholders; ListCell *lc; int off = 0; int i; old_lockholders = palloc(sizeof(VirtualTransactionId *) * list_length(heaplocktags)); /* collect transactions we need to wait on for all transactions */foreach(lc, heaplocktags) { LOCKTAG *tag = lfirst(lc); old_lockholders[off++] = GetLockConflicts(tag, lockmode); } /* wait on all transactions */ for (i = 0; i < off; i++) { VirtualTransactionId *lockholders = old_lockholders[i]; while (VirtualTransactionIdIsValid(lockholders[i])) { VirtualXactLock(lockholders[i], true); lockholders++; } } } Makes sense? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 12, 2013 at 8:47 PM, Andres Freund <andres@2ndquadrant.com> wrote:
--
MichaelOn 2013-02-07 17:28:53 +0900, Michael Paquier wrote:
> On Thu, Feb 7, 2013 at 5:15 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> Btw, I think that if this optimization for toast relations is done, it> should be a separate patch.What do you mean by a separate patch? Commit it before committing
REINDEX CONCURRENTLY? If so, yes, sure. If you mean it can be fixed
later, I don't really see how, since this is an unresolved problem...
Of course I meant that it would be necessary to validate the toast patch first, it is a prerequisite for REINDEX CONCURRENTLY. Sorry for not being that clear.
Yes, I have spent a little bit of time looking at the code related to retoastindxid and thought about this possibility. It would make the changes far easier with the existing patch, it will also be necessary to update the catalog pg_statio_all_tables to make the case where OID is InvalidOid correct with this catalog. However, I do not think it is as clean as simply removing retoastindxid and have all the toast APIs running consistent operations, aka using only RelationGetIndexList.Tom doubted it will be really measurable, so did I... If anything I
> Also, as I am not a specialist in toast
> indexes, any opinion about potential performance impact (if any) is welcome
> if we remove reltoastidxid and use RelationGetIndexList instead.
think it will be measurable during querying toast tables. So possibly we
would have to retain reltoastidxid for querying...
The minimal (not so nice) patch to make this correct probably is fairly
easy.
Changing only toast_save_datum:
[... code ...]
--
On 2013-02-12 21:54:52 +0900, Michael Paquier wrote: > > Changing only toast_save_datum: > > > > [... code ...] > > > Yes, I have spent a little bit of time looking at the code related to > retoastindxid and thought about this possibility. It would make the changes > far easier with the existing patch, it will also be necessary to update the > catalog pg_statio_all_tables to make the case where OID is InvalidOid > correct with this catalog. What I proposed above wouldn't need the case where toastrelidx = InvalidOid, so no need to worry about that. > However, I do not think it is as clean as simply > removing retoastindxid and have all the toast APIs running consistent > operations, aka using only RelationGetIndexList. Sure. This just seems easier as it really only requires changes inside toast_save_datum() and which mostly avoids any overhead (not even additional palloc()s) if there is only one index. That would lower the burden of proof that no performance regressions exist (which I guess would be during querying) and the amount of possibly external breakage due to removing the field... Not sure whats the best way to do this when committing. But I think you could incorporate something like the proposed to continue working on the patch. It really should only take some minutes to incorporate it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 12, 2013 at 10:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
-- On 2013-02-12 21:54:52 +0900, Michael Paquier wrote:What I proposed above wouldn't need the case where toastrelidx =
> > Changing only toast_save_datum:
> >
> > [... code ...]
> >
> Yes, I have spent a little bit of time looking at the code related to
> retoastindxid and thought about this possibility. It would make the changes
> far easier with the existing patch, it will also be necessary to update the
> catalog pg_statio_all_tables to make the case where OID is InvalidOid
> correct with this catalog.
InvalidOid, so no need to worry about that.
[re-reading code...] Oh ok. I missed the point in your previous email. Yeah indeed you are right.
Sure. This just seems easier as it really only requires changes inside
> However, I do not think it is as clean as simply
> removing retoastindxid and have all the toast APIs running consistent
> operations, aka using only RelationGetIndexList.
toast_save_datum() and which mostly avoids any overhead (not even
additional palloc()s) if there is only one index.
That would lower the burden of proof that no performance regressions
exist (which I guess would be during querying) and the amount of
possibly external breakage due to removing the field...
Not sure whats the best way to do this when committing. But I think you
could incorporate something like the proposed to continue working on the
patch. It really should only take some minutes to incorporate it.
OK I'll add the changes you are proposing. I still want to have a look at the approach for the removal of reltoastidxid btw.
Michael
Hi,
Please find attached a new version of the patch incorporating the 2 fixes requested:
- Fix for to insert new data to multiple toast indexes in toast_save_datum if necessary
- Fix the lock wait phase with new function WaitForMultipleVirtualLocks allowing to perform a wait on multiple locktags at the same time. WaitForVirtualLocks uses also WaitForMultipleVirtualLocks but on a single locktag.
I am still looking at the approach removing reltoastidxid, approach more complicated but cleaner than what is currently done in the patch.
Regards,
--
Michael
Please find attached a new version of the patch incorporating the 2 fixes requested:
- Fix for to insert new data to multiple toast indexes in toast_save_datum if necessary
- Fix the lock wait phase with new function WaitForMultipleVirtualLocks allowing to perform a wait on multiple locktags at the same time. WaitForVirtualLocks uses also WaitForMultipleVirtualLocks but on a single locktag.
I am still looking at the approach removing reltoastidxid, approach more complicated but cleaner than what is currently done in the patch.
Regards,
On Tue, Feb 12, 2013 at 10:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-02-12 21:54:52 +0900, Michael Paquier wrote:What I proposed above wouldn't need the case where toastrelidx =
> > Changing only toast_save_datum:
> >
> > [... code ...]
> >
> Yes, I have spent a little bit of time looking at the code related to
> retoastindxid and thought about this possibility. It would make the changes
> far easier with the existing patch, it will also be necessary to update the
> catalog pg_statio_all_tables to make the case where OID is InvalidOid
> correct with this catalog.
InvalidOid, so no need to worry about that.Sure. This just seems easier as it really only requires changes inside
> However, I do not think it is as clean as simply
> removing retoastindxid and have all the toast APIs running consistent
> operations, aka using only RelationGetIndexList.
toast_save_datum() and which mostly avoids any overhead (not even
additional palloc()s) if there is only one index.
That would lower the burden of proof that no performance regressions
exist (which I guess would be during querying) and the amount of
possibly external breakage due to removing the field...
Not sure whats the best way to do this when committing. But I think you
could incorporate something like the proposed to continue working on the
patch. It really should only take some minutes to incorporate it.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Michael
Attachment
Hi all,
Please find attached a new set of 3 patches for REINDEX CONCURRENTLY (v11).
- 20130214_1_remove_reltoastidxid.patch
- 20130214_2_reindex_concurrently_v11.patch
- 20130214_3_reindex_concurrently_docs_v11.patch
Patch 1 needs to be applied before patches 2 and 3.
20130214_1_remove_reltoastidxid.patch is the patch removing reltoastidxid (approach mentioned by Tom) to allow server to manipulate multiple indexes of toast relations. Catalog views, system functions and pg_upgrade have been updated in consequence by replacing reltoastidxid use by a join on pg_index/pg_class. All the functions of tuptoaster.c now use RelationGetIndexList to fetch the list of indexes on which depend a given toast relation. There are no warnings, regressions are passing (here only an update of rules.out and oidjoins has been necessary).
20130214_2_reindex_concurrently_v11.patch depends on patch 1. It includes the feature with all the fixes requested by Andres in his previous reviews. Regressions are passing and I haven't seen any warnings. in this patch concurrent rebuild of toast indexes is fully supported thanks to patch 1. The kludge used in previous version to change reltoastidxid when swapping indexes is not needed anymore, making swap code far cleaner.
20130214_3_reindex_concurrently_docs_v11.patch includes the documentation of REINDEX CONCURRENTLY. This might need some reshuffling with what is written for CREATE INDEX CONCURRENTLY.
I am now pretty happy with the way implementation is done, so I think that the basic implementation architecture does not need to be changed.
Andres, I think that only a single round of review would be necessary now before setting this patch as ready for committer. Thoughts?
Comments, as well as reviews are welcome.
--
Michael
Please find attached a new set of 3 patches for REINDEX CONCURRENTLY (v11).
- 20130214_1_remove_reltoastidxid.patch
- 20130214_2_reindex_concurrently_v11.patch
- 20130214_3_reindex_concurrently_docs_v11.patch
Patch 1 needs to be applied before patches 2 and 3.
20130214_1_remove_reltoastidxid.patch is the patch removing reltoastidxid (approach mentioned by Tom) to allow server to manipulate multiple indexes of toast relations. Catalog views, system functions and pg_upgrade have been updated in consequence by replacing reltoastidxid use by a join on pg_index/pg_class. All the functions of tuptoaster.c now use RelationGetIndexList to fetch the list of indexes on which depend a given toast relation. There are no warnings, regressions are passing (here only an update of rules.out and oidjoins has been necessary).
20130214_2_reindex_concurrently_v11.patch depends on patch 1. It includes the feature with all the fixes requested by Andres in his previous reviews. Regressions are passing and I haven't seen any warnings. in this patch concurrent rebuild of toast indexes is fully supported thanks to patch 1. The kludge used in previous version to change reltoastidxid when swapping indexes is not needed anymore, making swap code far cleaner.
20130214_3_reindex_concurrently_docs_v11.patch includes the documentation of REINDEX CONCURRENTLY. This might need some reshuffling with what is written for CREATE INDEX CONCURRENTLY.
I am now pretty happy with the way implementation is done, so I think that the basic implementation architecture does not need to be changed.
Andres, I think that only a single round of review would be necessary now before setting this patch as ready for committer. Thoughts?
Comments, as well as reviews are welcome.
--
Michael
Attachment
On Thu, Feb 14, 2013 at 4:08 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Hi all, > > Please find attached a new set of 3 patches for REINDEX CONCURRENTLY (v11). > - 20130214_1_remove_reltoastidxid.patch > - 20130214_2_reindex_concurrently_v11.patch > - 20130214_3_reindex_concurrently_docs_v11.patch > Patch 1 needs to be applied before patches 2 and 3. > > 20130214_1_remove_reltoastidxid.patch is the patch removing reltoastidxid > (approach mentioned by Tom) to allow server to manipulate multiple indexes > of toast relations. Catalog views, system functions and pg_upgrade have been > updated in consequence by replacing reltoastidxid use by a join on > pg_index/pg_class. All the functions of tuptoaster.c now use > RelationGetIndexList to fetch the list of indexes on which depend a given > toast relation. There are no warnings, regressions are passing (here only an > update of rules.out and oidjoins has been necessary). > 20130214_2_reindex_concurrently_v11.patch depends on patch 1. It includes > the feature with all the fixes requested by Andres in his previous reviews. > Regressions are passing and I haven't seen any warnings. in this patch > concurrent rebuild of toast indexes is fully supported thanks to patch 1. > The kludge used in previous version to change reltoastidxid when swapping > indexes is not needed anymore, making swap code far cleaner. > 20130214_3_reindex_concurrently_docs_v11.patch includes the documentation of > REINDEX CONCURRENTLY. This might need some reshuffling with what is written > for CREATE INDEX CONCURRENTLY. > > I am now pretty happy with the way implementation is done, so I think that > the basic implementation architecture does not need to be changed. > Andres, I think that only a single round of review would be necessary now > before setting this patch as ready for committer. Thoughts? > > Comments, as well as reviews are welcome. When I compiled the HEAD with the patches, I got the following warnings. index.c:1273: warning: unused variable 'parentRel' execUtils.c:1199: warning: 'return' with no value, in function returning non-void When I ran REINDEX CONCURRENTLY for the same index from two different sessions, I got the deadlock. The error log is: ERROR: deadlock detected DETAIL: Process 37121 waits for ShareLock on virtual transaction 2/196; blocked by process 36413.Process 36413 waits for ShareUpdateExclusiveLock on relation 16457 of database 12293; blocked by process 37121.Process 37121: REINDEX TABLE CONCURRENTLY pgbench_accounts;Process 36413: REINDEXTABLE CONCURRENTLY pgbench_accounts; HINT: See server log for query details. STATEMENT: REINDEX TABLE CONCURRENTLY pgbench_accounts; And, after the REINDEX CONCURRENTLY that survived the deadlock finished, I found that new index with another name was created. It was NOT marked as INVALID. Are these behaviors intentional? =# \di pgbench_accounts* List of relationsSchema | Name | Type | Owner | Table --------+---------------------------+-------+----------+------------------public | pgbench_accounts_pkey | index | postgres| pgbench_accountspublic | pgbench_accounts_pkey_cct | index | postgres | pgbench_accounts (2 rows) Regards, -- Fujii Masao
Thanks for your review!
On Wed, Feb 20, 2013 at 12:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
MichaelWhen I compiled the HEAD with the patches, I got the following warnings.
index.c:1273: warning: unused variable 'parentRel'
execUtils.c:1199: warning: 'return' with no value, in function
returning non-void
Oops, corrected.
When I ran REINDEX CONCURRENTLY for the same index from two different
sessions, I got the deadlock. The error log is:
ERROR: deadlock detected
DETAIL: Process 37121 waits for ShareLock on virtual transaction
2/196; blocked by process 36413.
Process 36413 waits for ShareUpdateExclusiveLock on relation 16457 of
database 12293; blocked by process 37121.
Process 37121: REINDEX TABLE CONCURRENTLY pgbench_accounts;
Process 36413: REINDEX TABLE CONCURRENTLY pgbench_accounts;
HINT: See server log for query details.
STATEMENT: REINDEX TABLE CONCURRENTLY pgbench_accounts;
And, after the REINDEX CONCURRENTLY that survived the deadlock finished,
I found that new index with another name was created. It was NOT marked as
INVALID. Are these behaviors intentional?
This happens because of the following scenario:
- session 1: REINDEX CONCURRENTLY, that has not yet reached phase 3 where indexes are validated. necessary ShareUpdateExclusiveLock locks are taken on relations rebuilt.
- session 2: REINDEX CONCURRENTLY, waits for a ShareUpdateExclusiveLock lock to be obtained, its transaction begins before session 1 reaches phase 3
- session 1: enters phase 3, and fails at WaitForOldSnapshots as session 2 has an older snapshot and is currently waiting for lock on session 1
- session 2: succeeds, but concurrent index created by session 1 still exists
A ShareUpdateExclusiveLock is taken on index or table that is going to be rebuilt just before calling ReindexRelationConcurrently. So the solution I have here is to make REINDEX CONCURRENTLY fail for session 2. REINDEX CONCURRENTLY is made to allow a table to run DML in parallel to the operation so it doesn't look strange to me to make session 2 fail if REINDEX CONCURRENTLY is done in parallel on the same relation.
This fixes the problem of the concurrent index *_cct appearing after session 1 failed due to the deadlock in Masao's report.
The patch correcting this problem is attached.
Error message could be improved, here is what it is now when session 2 fails:
postgres=# reindex table concurrently aa;
ERROR: could not obtain lock on relation "aa"
Comments?
--
- session 1: REINDEX CONCURRENTLY, that has not yet reached phase 3 where indexes are validated. necessary ShareUpdateExclusiveLock locks are taken on relations rebuilt.
- session 2: REINDEX CONCURRENTLY, waits for a ShareUpdateExclusiveLock lock to be obtained, its transaction begins before session 1 reaches phase 3
- session 1: enters phase 3, and fails at WaitForOldSnapshots as session 2 has an older snapshot and is currently waiting for lock on session 1
- session 2: succeeds, but concurrent index created by session 1 still exists
A ShareUpdateExclusiveLock is taken on index or table that is going to be rebuilt just before calling ReindexRelationConcurrently. So the solution I have here is to make REINDEX CONCURRENTLY fail for session 2. REINDEX CONCURRENTLY is made to allow a table to run DML in parallel to the operation so it doesn't look strange to me to make session 2 fail if REINDEX CONCURRENTLY is done in parallel on the same relation.
This fixes the problem of the concurrent index *_cct appearing after session 1 failed due to the deadlock in Masao's report.
The patch correcting this problem is attached.
Error message could be improved, here is what it is now when session 2 fails:
postgres=# reindex table concurrently aa;
ERROR: could not obtain lock on relation "aa"
Comments?
--
Attachment
On Thu, Feb 21, 2013 at 11:55 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > A ShareUpdateExclusiveLock is taken on index or table that is going to be > rebuilt just before calling ReindexRelationConcurrently. So the solution I > have here is to make REINDEX CONCURRENTLY fail for session 2. REINDEX > CONCURRENTLY is made to allow a table to run DML in parallel to the > operation so it doesn't look strange to me to make session 2 fail if REINDEX > CONCURRENTLY is done in parallel on the same relation. Thanks for updating the patch! With updated patch, REINDEX CONCURRENTLY seems to fail even when SharedUpdateExclusiveLock is taken by the command other than REINDEX CONCURRENTLY, for example, VACUUM. Is this intentional? This behavior should be avoided. Otherwise, users might need to disable autovacuum whenever they run REINDEX CONCURRENTLY. With updated patch, unfortunately, I got the similar deadlock error when I ran REINDEX CONCURRENTLY in session1 and ANALYZE in session2. ERROR: deadlock detected DETAIL: Process 70551 waits for ShareLock on virtual transaction 3/745; blocked by process 70652.Process 70652 waits for ShareUpdateExclusiveLock on relation 17460 of database 12293; blocked by process 70551.Process 70551: REINDEX TABLE CONCURRENTLY pgbench_accounts;Process 70652: ANALYZEpgbench_accounts; HINT: See server log for query details. STATEMENT: REINDEX TABLE CONCURRENTLY pgbench_accounts; Like original problem that I reported, temporary index created by REINDEX CONCURRENTLY was NOT marked as INVALID. =# \di pgbench_accounts* List of relationsSchema | Name | Type | Owner | Table --------+---------------------------+-------+----------+------------------public | pgbench_accounts_pkey | index | postgres| pgbench_accountspublic | pgbench_accounts_pkey_cct | index | postgres | pgbench_accounts (2 rows) Regards, -- Fujii Masao
On Sat, Feb 23, 2013 at 2:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
-- On Thu, Feb 21, 2013 at 11:55 AM, Michael PaquierThanks for updating the patch!
<michael.paquier@gmail.com> wrote:
> A ShareUpdateExclusiveLock is taken on index or table that is going to be
> rebuilt just before calling ReindexRelationConcurrently. So the solution I
> have here is to make REINDEX CONCURRENTLY fail for session 2. REINDEX
> CONCURRENTLY is made to allow a table to run DML in parallel to the
> operation so it doesn't look strange to me to make session 2 fail if REINDEX
> CONCURRENTLY is done in parallel on the same relation.
With updated patch, REINDEX CONCURRENTLY seems to fail even when
SharedUpdateExclusiveLock is taken by the command other than REINDEX
CONCURRENTLY, for example, VACUUM. Is this intentional? This behavior
should be avoided. Otherwise, users might need to disable autovacuum
whenever they run REINDEX CONCURRENTLY.
With updated patch, unfortunately, I got the similar deadlock error when I
ran REINDEX CONCURRENTLY in session1 and ANALYZE in session2.
Such deadlocks are also possible when running manual VACUUM with CREATE INDEX CONCURRENTLY. This is because ANALYZE can be included in a transaction that might do arbitrary operations on the parent table (see comments in indexcmds.c) between the index build and validation. So the only problem I see here is that the concurrent index is marked as VALID in the transaction when a deadlock occurs and REINDEX CONCURRENTLY fails, right?
ERROR: deadlock detected
DETAIL: Process 70551 waits for ShareLock on virtual transaction
3/745; blocked by process 70652.
Process 70652 waits for ShareUpdateExclusiveLock on relation 17460 of
database 12293; blocked by process 70551.
Process 70551: REINDEX TABLE CONCURRENTLY pgbench_accounts;
Process 70652: ANALYZE pgbench_accounts;HINT: See server log for query details.Like original problem that I reported, temporary index created by REINDEX
STATEMENT: REINDEX TABLE CONCURRENTLY pgbench_accounts;
CONCURRENTLY was NOT marked as INVALID.
=# \di pgbench_accounts*
List of relations
Schema | Name | Type | Owner | Table
--------+---------------------------+-------+----------+------------------
public | pgbench_accounts_pkey | index | postgres | pgbench_accounts
public | pgbench_accounts_pkey_cct | index | postgres | pgbench_accounts
(2 rows)
Btw, ¥di also prints invalid indexes...
OK, so what you want to see is the index being marked as not valid when a deadlock occurs with REINDEX CONCURRENTLY when an ANALYZE kicks in (btw, deadlocks are also possible with CREATE INDEX CONCURRENTLY when ANALYZE is done on a table, in this case the index is marked as not valid). So indeed there was a bug in my code for v12 and prior as if a deadlock occurred the concurrent index was marked as valid.
I have been able to fix that with updated patch attached, which removed the change done in v12 and checks for deadlock at phase 3 before actually marking the index as valid (opposite operation was done in v11 and below making the indexes being seen as valid when the deadlock appeared).
So now here is what heppens with a deadlock:
ioltas=# create table aa (a int);
CREATE TABLE
ioltas=# create index aap on aa (a);
CREATE INDEX
ioltas=# reindex index concurrently aap;
ERROR: deadlock detected
DETAIL: Process 32174 waits for ShareLock on virtual transaction 3/2; blocked by process 32190.
Process 32190 waits for ShareUpdateExclusiveLock on relation 16385 of database 16384; blocked by process 32174.
HINT: See server log for query details.
And how the relation remains after the deadlock:
ioltas=# \d aa
Table "public.aa"
Column | Type | Modifiers
--------+---------+-----------
a | integer |
Indexes:
"aap" btree (a)
"aap_cct" btree (a) INVALID
ioltas=# \di aa*
List of relations
Schema | Name | Type | Owner | Table
--------+---------+-------+--------+-------
public | aap | index | ioltas | aa
public | aap_cct | index | ioltas | aa
(2 rows)
The potential *problem* (actually that looks more to be a non-problem) is the case of REINDEX CONCURRENTLY run on a table with multiple indexes.
For example, let's take the case of a table with 2 indexes.
1) Session 1: Run REINDEX CONCURRENTLY on this table.
2) Session 2: Run ANALYZE on this table after 1st index has been validated but before the 2nd index is validated
3) Session 1: fails due to a deadlock, the table containing 3 valid indexes, the former 2 indexes and the 1st concurrent one that has been validated. The 2nd concurrent index is marked as not valid.
This can happen when REINDEX CONCURRENTLY conflicts with the following commands: CREATE INDEX CONCURRENTLY, another REINDEX CONCURRENTLY and ANALYZE. Note that the 1st concurrent index is perfectly valid, so user can still drop the 1st old index after the deadlock.
So, in the case of a single index being rebuilt with REINDEX CONCURRENTLY there are no problems, but there is a risk of multiplying the number of indexes on a table when it is used to rebuild multiple indexes at the same time with REINDEX TABLE CONCURRENTLY, or even REINDEX DATABASE CONCURRENTLY. I think that this feature can live with that as long as the user is aware of the risks when doing a REINDEX CONCURRENTLY that rebuilds more than 1 index at the same time. Comments?
I have been able to fix that with updated patch attached, which removed the change done in v12 and checks for deadlock at phase 3 before actually marking the index as valid (opposite operation was done in v11 and below making the indexes being seen as valid when the deadlock appeared).
So now here is what heppens with a deadlock:
ioltas=# create table aa (a int);
CREATE TABLE
ioltas=# create index aap on aa (a);
CREATE INDEX
ioltas=# reindex index concurrently aap;
ERROR: deadlock detected
DETAIL: Process 32174 waits for ShareLock on virtual transaction 3/2; blocked by process 32190.
Process 32190 waits for ShareUpdateExclusiveLock on relation 16385 of database 16384; blocked by process 32174.
HINT: See server log for query details.
And how the relation remains after the deadlock:
ioltas=# \d aa
Table "public.aa"
Column | Type | Modifiers
--------+---------+-----------
a | integer |
Indexes:
"aap" btree (a)
"aap_cct" btree (a) INVALID
ioltas=# \di aa*
List of relations
Schema | Name | Type | Owner | Table
--------+---------+-------+--------+-------
public | aap | index | ioltas | aa
public | aap_cct | index | ioltas | aa
(2 rows)
The potential *problem* (actually that looks more to be a non-problem) is the case of REINDEX CONCURRENTLY run on a table with multiple indexes.
For example, let's take the case of a table with 2 indexes.
1) Session 1: Run REINDEX CONCURRENTLY on this table.
2) Session 2: Run ANALYZE on this table after 1st index has been validated but before the 2nd index is validated
3) Session 1: fails due to a deadlock, the table containing 3 valid indexes, the former 2 indexes and the 1st concurrent one that has been validated. The 2nd concurrent index is marked as not valid.
This can happen when REINDEX CONCURRENTLY conflicts with the following commands: CREATE INDEX CONCURRENTLY, another REINDEX CONCURRENTLY and ANALYZE. Note that the 1st concurrent index is perfectly valid, so user can still drop the 1st old index after the deadlock.
So, in the case of a single index being rebuilt with REINDEX CONCURRENTLY there are no problems, but there is a risk of multiplying the number of indexes on a table when it is used to rebuild multiple indexes at the same time with REINDEX TABLE CONCURRENTLY, or even REINDEX DATABASE CONCURRENTLY. I think that this feature can live with that as long as the user is aware of the risks when doing a REINDEX CONCURRENTLY that rebuilds more than 1 index at the same time. Comments?
Michael
Attachment
Andres, Masao, do you need an extra round or review or do you think this is ready to be marked as committer?<br />On my sideI have nothing more to add to the existing patches.<br />Thanks,<br />-- <br />Michael<br />
Hi, Michael Paquier <michael.paquier@gmail.com> schrieb: >Andres, Masao, do you need an extra round or review or do you think >this is >ready to be marked as committer? >On my side I have nothing more to add to the existing patches. I think they do need review before that - I won't be able to do another review before the weekend though. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
On Thu, Feb 28, 2013 at 4:56 PM, anarazel@anarazel.de <andres@anarazel.de> wrote:
-- Hi,
Michael Paquier <michael.paquier@gmail.com> schrieb:I think they do need review before that - I won't be able to do another review before the weekend though.
>Andres, Masao, do you need an extra round or review or do you think
>this is
>ready to be marked as committer?
>On my side I have nothing more to add to the existing patches.
Sure. Thanks.
Michael
On Thu, Feb 28, 2013 at 3:21 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Andres, Masao, do you need an extra round or review or do you think this is > ready to be marked as committer? > On my side I have nothing more to add to the existing patches. Sorry for the late reply. I found one problem in the latest patch. I got the segmentation fault when I executed the following SQLs. CREATE TABLE hoge (i int); CREATE INDEX hogeidx ON hoge(abs(i)); INSERT INTO hoge VALUES (generate_series(1,10)); REINDEX TABLE CONCURRENTLY hoge; The error messages are: LOG: server process (PID 33641) was terminated by signal 11: Segmentation fault DETAIL: Failed process was running: REINDEX TABLE CONCURRENTLY hoge; Regards, -- Fujii Masao
On Thu, Feb 28, 2013 at 11:26 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
--
Michael
I found one problem in the latest patch. I got the segmentation fault
when I executed the following SQLs.
CREATE TABLE hoge (i int);
CREATE INDEX hogeidx ON hoge(abs(i));
INSERT INTO hoge VALUES (generate_series(1,10));
REINDEX TABLE CONCURRENTLY hoge;
The error messages are:
LOG: server process (PID 33641) was terminated by signal 11: Segmentation fault
DETAIL: Failed process was running: REINDEX TABLE CONCURRENTLY hoge;
Oops. Index expressions were not correctly extracted when building columnNames for index_create in index_concurrent_create.
Fixed in this new patch. Thanks for catching that.
Fixed in this new patch. Thanks for catching that.
Michael
Attachment
On Fri, Mar 1, 2013 at 12:57 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Feb 28, 2013 at 11:26 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> I found one problem in the latest patch. I got the segmentation fault >> when I executed the following SQLs. >> >> CREATE TABLE hoge (i int); >> CREATE INDEX hogeidx ON hoge(abs(i)); >> INSERT INTO hoge VALUES (generate_series(1,10)); >> REINDEX TABLE CONCURRENTLY hoge; >> >> The error messages are: >> >> LOG: server process (PID 33641) was terminated by signal 11: Segmentation >> fault >> DETAIL: Failed process was running: REINDEX TABLE CONCURRENTLY hoge; > > Oops. Index expressions were not correctly extracted when building > columnNames for index_create in index_concurrent_create. > Fixed in this new patch. Thanks for catching that. I found another problem in the latest patch. When I issued the following SQLs, I got the assertion failure. CREATE EXTENSION pg_trgm; CREATE TABLE hoge (col1 text); CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH (fastupdate = off); INSERT INTO hoge SELECT random()::text FROM generate_series(1,100); REINDEX TABLE CONCURRENTLY hoge; The error message that I got is: TRAP: FailedAssertion("!(((array)->elemtype) == 25)", File: "reloptions.c", Line: 874) LOG: server process (PID 45353) was terminated by signal 6: Abort trap DETAIL: Failed process was running: REINDEX TABLE CONCURRENTLY hoge; ISTM that the patch doesn't handle the gin option "fastupdate = off" correctly. Anyway, I think you should test whether REINDEX CONCURRENTLY goes well with every type of indexes, before posting the next patch. Otherwise, I might find another problem ;P @@ -1944,7 +2272,8 @@ index_build(Relation heapRelation, Relation indexRelation, IndexInfo *indexInfo, bool isprimary, - bool isreindex) + bool isreindex, + bool istoastupdate) istoastupdate seems to be unused. Regards, -- Fujii Masao
On Sat, Mar 2, 2013 at 2:43 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> Fixed in this new patch. Thanks for catching that. After make installcheck finished, I connected to the "regression" database and issued "REINDEX DATABASE CONCURRENTLY regression", then I got the error: ERROR: constraints cannot have index expressions STATEMENT: REINDEX DATABASE CONCURRENTLY regression; OTOH "REINDEX DATABASE regression" did not generate an error. Is this a bug? Regards, -- Fujii Masao
REINDEX CONCURRENTLY resets the statistics in pg_stat_user_indexes, whereas plain REINDEX does not. I think they should be preserved in either case.
On 2013-03-01 16:32:19 -0500, Peter Eisentraut wrote: > REINDEX CONCURRENTLY resets the statistics in pg_stat_user_indexes, > whereas plain REINDEX does not. I think they should be preserved in > either case. Yes. Imo this further suggests that it would be better to switch the relfilenodes (+indisclustered) of the two indexes instead of switching the names. That would allow to get rid of the code for moving over dependencies as well. Given we use an exclusive lock for the switchover phase anyway, there's not much point in going for the name-based switch. Especially as some eventual mvcc-correct system access would be fine with the relfilenode method. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi,
Please find attached an updated patch fixing the following issues:
- gin and gist indexes are now rebuilt correctly. Some option values were not passed to the concurrent indexes (reported by Masao)
- swap is done with relfilenode and not names. In consequence pg_stat_user_indexes is not reset (reported by Peter).
I am looking at the issue reported previously with make installcheck.
Regards,
--
Michael
Please find attached an updated patch fixing the following issues:
- gin and gist indexes are now rebuilt correctly. Some option values were not passed to the concurrent indexes (reported by Masao)
- swap is done with relfilenode and not names. In consequence pg_stat_user_indexes is not reset (reported by Peter).
I am looking at the issue reported previously with make installcheck.
Regards,
On Sun, Mar 3, 2013 at 9:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Yes. Imo this further suggests that it would be better to switch theOn 2013-03-01 16:32:19 -0500, Peter Eisentraut wrote:
> REINDEX CONCURRENTLY resets the statistics in pg_stat_user_indexes,
> whereas plain REINDEX does not. I think they should be preserved in
> either case.
relfilenodes (+indisclustered) of the two indexes instead of switching
the names. That would allow to get rid of the code for moving over
dependencies as well.
Given we use an exclusive lock for the switchover phase anyway, there's
not much point in going for the name-based switch. Especially as some
eventual mvcc-correct system access would be fine with the relfilenode
method.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Michael
Attachment
Hi all,
Please find attached a patch fixing the last issue that Masao found with make installcheck. Now REINDEX DATABASE CONCURRENTLY on the regression database passes. There were 2 problems:
- Concurrent indexes for unique indexes using expressions were not correctly created
- Concurrent indexes for indexes with duplicate column names were not correctly created.
So, this solves the last issue currently on stack. I added some new tests in regressions to cover those problems.
Regards,
--
Michael
Please find attached a patch fixing the last issue that Masao found with make installcheck. Now REINDEX DATABASE CONCURRENTLY on the regression database passes. There were 2 problems:
- Concurrent indexes for unique indexes using expressions were not correctly created
- Concurrent indexes for indexes with duplicate column names were not correctly created.
So, this solves the last issue currently on stack. I added some new tests in regressions to cover those problems.
Regards,
--
Michael
Attachment
Hi, Have you benchmarked the toastrelidx removal stuff in any way? If not, thats fine, but if yes I'd be interested. On 2013-03-04 22:33:53 +0900, Michael Paquier wrote: > --- a/src/backend/access/heap/tuptoaster.c > +++ b/src/backend/access/heap/tuptoaster.c > @@ -1238,7 +1238,7 @@ toast_save_datum(Relation rel, Datum value, > struct varlena * oldexternal, int options) > { > Relation toastrel; > - Relation toastidx; > + Relation *toastidxs; > HeapTuple toasttup; > TupleDesc toasttupDesc; > Datum t_values[3]; > @@ -1257,15 +1257,26 @@ toast_save_datum(Relation rel, Datum value, > char *data_p; > int32 data_todo; > Pointer dval = DatumGetPointer(value); > + ListCell *lc; > + int count = 0; I find count a confusing name for a loop iteration variable... i of orr, idxno, or ... > + int num_indexes; > > /* > * Open the toast relation and its index. We can use the index to check > * uniqueness of the OID we assign to the toasted item, even though it has > - * additional columns besides OID. > + * additional columns besides OID. A toast table can have multiple identical > + * indexes associated to it. > */ > toastrel = heap_open(rel->rd_rel->reltoastrelid, RowExclusiveLock); > toasttupDesc = toastrel->rd_att; > - toastidx = index_open(toastrel->rd_rel->reltoastidxid, RowExclusiveLock); > + if (toastrel->rd_indexvalid == 0) > + RelationGetIndexList(toastrel); Hm, I think we should move this into a macro, this is cropping up at more and more places. > - index_insert(toastidx, t_values, t_isnull, > - &(toasttup->t_self), > - toastrel, > - toastidx->rd_index->indisunique ? > - UNIQUE_CHECK_YES : UNIQUE_CHECK_NO); > + for (count = 0; count < num_indexes; count++) > + index_insert(toastidxs[count], t_values, t_isnull, > + &(toasttup->t_self), > + toastrel, > + toastidxs[count]->rd_index->indisunique ? > + UNIQUE_CHECK_YES : UNIQUE_CHECK_NO); The indisunique check looks like a copy & pasto to me, albeit not yours... > > /* > * Create the TOAST pointer value that we'll return > @@ -1475,10 +1493,13 @@ toast_delete_datum(Relation rel, Datum value) > struct varlena *attr = (struct varlena *) DatumGetPointer(value); > struct varatt_external toast_pointer; > + /* > + * We actually use only the first index but taking a lock on all is > + * necessary. > + */ Hm, is it guaranteed that the first index is valid? > + foreach(lc, toastrel->rd_indexlist) > + toastidxs[count++] = index_open(lfirst_oid(lc), RowExclusiveLock); > /* > - * If we're swapping two toast tables by content, do the same for their > - * indexes. > + * If we're swapping two toast tables by content, do the same for all of > + * their indexes. The swap can actually be safely done only if all the indexes > + * have valid Oids. > */ What's an index without a valid oid? > if (swap_toast_by_content && > - relform1->reltoastidxid && relform2->reltoastidxid) > - swap_relation_files(relform1->reltoastidxid, > - relform2->reltoastidxid, > - target_is_pg_class, > - swap_toast_by_content, > - InvalidTransactionId, > - InvalidMultiXactId, > - mapped_tables); > + relform1->reltoastrelid && > + relform2->reltoastrelid) > + { > + Relation toastRel1, toastRel2; > + > + /* Open relations */ > + toastRel1 = heap_open(relform1->reltoastrelid, RowExclusiveLock); > + toastRel2 = heap_open(relform2->reltoastrelid, RowExclusiveLock); Shouldn't those be Access Exlusive Locks? > + /* Obtain index list if necessary */ > + if (toastRel1->rd_indexvalid == 0) > + RelationGetIndexList(toastRel1); > + if (toastRel2->rd_indexvalid == 0) > + RelationGetIndexList(toastRel2); > + > + /* Check if the swap is possible for all the toast indexes */ So there's no error being thrown if this turns out not to be possible? > + if (!list_member_oid(toastRel1->rd_indexlist, InvalidOid) && > + !list_member_oid(toastRel2->rd_indexlist, InvalidOid) && > + list_length(toastRel1->rd_indexlist) == list_length(toastRel2->rd_indexlist)) > + { > + ListCell *lc1, *lc2; > + > + /* Now swap each couple */ > + lc2 = list_head(toastRel2->rd_indexlist); > + foreach(lc1, toastRel1->rd_indexlist) > + { > + Oid indexOid1 = lfirst_oid(lc1); > + Oid indexOid2 = lfirst_oid(lc2); > + swap_relation_files(indexOid1, > + indexOid2, > + target_is_pg_class, > + swap_toast_by_content, > + InvalidTransactionId, > + InvalidMultiXactId, > + mapped_tables); > + lc2 = lnext(lc2); > + } > + } > + > + heap_close(toastRel1, RowExclusiveLock); > + heap_close(toastRel2, RowExclusiveLock); > + } > /* rename the toast table ... */ > @@ -1528,11 +1563,23 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap, > RenameRelationInternal(newrel->rd_rel->reltoastrelid, > NewToastName); > > - /* ... and its index too */ > - snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index", > - OIDOldHeap); > - RenameRelationInternal(toastidx, > - NewToastName); > + /* ... and its indexes too */ > + foreach(lc, toastrel->rd_indexlist) > + { > + /* > + * The first index keeps the former toast name and the > + * following entries are thought as being concurrent indexes. > + */ > + if (count == 0) > + snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index", > + OIDOldHeap); > + else > + snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index_cct%d", > + OIDOldHeap, count); > + RenameRelationInternal(lfirst_oid(lc), > + NewToastName); > + count++; > + } Hm. It seems wrong that this layer needs to know about _cct. > /* > - * Calculate total on-disk size of a TOAST relation, including its index. > + * Calculate total on-disk size of a TOAST relation, including its indexes. > * Must not be applied to non-TOAST relations. > */ > static int64 > @@ -340,8 +340,8 @@ calculate_toast_table_size(Oid toastrelid) > { > ... > + /* Size is evaluated based on the first index available */ Uh. Why? Imo all indexes should be counted. > + foreach(lc, toastRel->rd_indexlist) > + { > + Relation toastIdxRel; > + toastIdxRel = relation_open(lfirst_oid(lc), > + AccessShareLock); > + for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++) > + size += calculate_relation_size(&(toastIdxRel->rd_node), > + toastIdxRel->rd_backend, forkNum); > + > + relation_close(toastIdxRel, AccessShareLock); > + } > -#define CATALOG_VERSION_NO 201302181 > +#define CATALOG_VERSION_NO 20130219 Think you forgot a digit here ;) > /* > * This case is currently not supported, but there's no way to ask for it > - * in the grammar anyway, so it can't happen. > + * in the grammar anyway, so it can't happen. This might be called during a > + * conccurrent reindex operation, in this case sufficient locks are already > + * taken on the related relations. > */ I'd rather change that to something like /** This case is currently only supported during a concurrent index* rebuild, but there is no way to ask for it in the grammarotherwise* anyway.*/ Or similar. > + > +/* > + * index_concurrent_create > + * > + * Create an index based on the given one that will be used for concurrent > + * operations. The index is inserted into catalogs and needs to be built later > + * on. This is called during concurrent index processing. The heap relation > + * on which is based the index needs to be closed by the caller. > + */ > +Oid > +index_concurrent_create(Relation heapRelation, Oid indOid, char *concurrentName) > +{ > ... > + /* > + * Determine if index is initdeferred, this depends on its dependent > + * constraint. > + */ > + if (OidIsValid(constraintOid)) > + { > + /* Look for the correct value */ > + HeapTuple constTuple; > + Form_pg_constraint constraint; > + > + constTuple = SearchSysCache1(CONSTROID, > + ObjectIdGetDatum(constraintOid)); > + if (!HeapTupleIsValid(constTuple)) > + elog(ERROR, "cache lookup failed for constraint %u", > + constraintOid); > + constraint = (Form_pg_constraint) GETSTRUCT(constTuple); > + initdeferred = constraint->condeferred; > + > + ReleaseSysCache(constTuple); > + } Very, very nitpicky, but I find "constTuple" to be confusing, I thought at first it meant that the tuple shouldn't be modified or something. > + /* > + * Index is considered as a constraint if it is PRIMARY KEY or EXCLUSION. > + */ > + isconstraint = indexRelation->rd_index->indisprimary || > + indexRelation->rd_index->indisexclusion; unique constraints aren't mattering here? > +/* > + * index_concurrent_swap > + * > + * Replace old index by old index in a concurrent context. For the time being > + * what is done here is switching the relation relfilenode of the indexes. If > + * extra operations are necessary during a concurrent swap, processing should > + * be added here. AccessExclusiveLock is taken on the index relations that are > + * swapped until the end of the transaction where this function is called. > + */ > +void > +index_concurrent_swap(Oid newIndexOid, Oid oldIndexOid) > +{ > + Relation oldIndexRel, newIndexRel, pg_class; > + HeapTuple oldIndexTuple, newIndexTuple; > + Form_pg_class oldIndexForm, newIndexForm; > + Oid tmpnode; > + > + /* > + * Take an exclusive lock on the old and new index before swapping them. > + */ > + oldIndexRel = relation_open(oldIndexOid, AccessExclusiveLock); > + newIndexRel = relation_open(newIndexOid, AccessExclusiveLock); > + > + /* Now swap relfilenode of those indexes */ Any chance to reuse swap_relation_files here? Not sure whether it would be beneficial given that it is more generic and normally works on a relation level... We probably should remove the fsm of the index altogether after this? > + pg_class = heap_open(RelationRelationId, RowExclusiveLock); > + > + oldIndexTuple = SearchSysCacheCopy1(RELOID, > + ObjectIdGetDatum(oldIndexOid)); > + if (!HeapTupleIsValid(oldIndexTuple)) > + elog(ERROR, "could not find tuple for relation %u", oldIndexOid); > + newIndexTuple = SearchSysCacheCopy1(RELOID, > + ObjectIdGetDatum(newIndexOid)); > + if (!HeapTupleIsValid(newIndexTuple)) > + elog(ERROR, "could not find tuple for relation %u", newIndexOid); > + oldIndexForm = (Form_pg_class) GETSTRUCT(oldIndexTuple); > + newIndexForm = (Form_pg_class) GETSTRUCT(newIndexTuple); > + > + /* Here is where the actual swapping happens */ > + tmpnode = oldIndexForm->relfilenode; > + oldIndexForm->relfilenode = newIndexForm->relfilenode; > + newIndexForm->relfilenode = tmpnode; > + > + /* Then update the tuples for each relation */ > + simple_heap_update(pg_class, &oldIndexTuple->t_self, oldIndexTuple); > + simple_heap_update(pg_class, &newIndexTuple->t_self, newIndexTuple); > + CatalogUpdateIndexes(pg_class, oldIndexTuple); > + CatalogUpdateIndexes(pg_class, newIndexTuple); > + > + /* Close relations and clean up */ > + heap_close(pg_class, RowExclusiveLock); > + > + /* The lock taken previously is not released until the end of transaction */ > + relation_close(oldIndexRel, NoLock); > + relation_close(newIndexRel, NoLock); It might be worthwile adding a heap_freetuple here for (old, new)IndexTuple, just to spare the reader the thinking whether it needs to be done. > +/* > + * index_concurrent_drop > + * > + * Drop a single index concurrently as the last step of an index concurrent > + * process Deletion is done through performDeletion or dependencies of the > + * index are not dropped. At this point all the indexes are already considered > + * as invalid and dead so they can be dropped without using any concurrent > + * options. > + */ "or dependencies of the index would not get dropped"? > +void > +index_concurrent_drop(Oid indexOid) > +{ > + Oid constraintOid = get_index_constraint(indexOid); > + ObjectAddress object; > + Form_pg_index indexForm; > + Relation pg_index; > + HeapTuple indexTuple; > + bool indislive; > + > + /* > + * Check that the index dropped here is not alive, it might be used by > + * other backends in this case. > + */ > + pg_index = heap_open(IndexRelationId, RowExclusiveLock); > + > + indexTuple = SearchSysCacheCopy1(INDEXRELID, > + ObjectIdGetDatum(indexOid)); > + if (!HeapTupleIsValid(indexTuple)) > + elog(ERROR, "cache lookup failed for index %u", indexOid); > + indexForm = (Form_pg_index) GETSTRUCT(indexTuple); > + indislive = indexForm->indislive; > + > + /* Clean up */ > + heap_close(pg_index, RowExclusiveLock); > + > + /* Leave if index is still alive */ > + if (indislive) > + return; This seems like a confusing path? Why is it valid to get here with a valid index and why is it ok to silently ignore that case? > /* > + * ReindexRelationConcurrently > + * > + * Process REINDEX CONCURRENTLY for given relation Oid. The relation can be > + * either an index or a table. If a table is specified, each reindexing step > + * is done in parallel with all the table's indexes as well as its dependent > + * toast indexes. > + */ > +bool > +ReindexRelationConcurrently(Oid relationOid) > +{ > + List *concurrentIndexIds = NIL, > + *indexIds = NIL, > + *parentRelationIds = NIL, > + *lockTags = NIL, > + *relationLocks = NIL; > + ListCell *lc, *lc2; > + Snapshot snapshot; > + > + /* > + * Extract the list of indexes that are going to be rebuilt based on the > + * list of relation Oids given by caller. For each element in given list, > + * If the relkind of given relation Oid is a table, all its valid indexes > + * will be rebuilt, including its associated toast table indexes. If > + * relkind is an index, this index itself will be rebuilt. The locks taken > + * parent relations and involved indexes are kept until this transaction > + * is committed to protect against schema changes that might occur until > + * the session lock is taken on each relation. > + */ > + switch (get_rel_relkind(relationOid)) > + { > + case RELKIND_RELATION: > + { > + /* > + * In the case of a relation, find all its indexes > + * including toast indexes. > + */ > + Relation heapRelation = heap_open(relationOid, > + ShareUpdateExclusiveLock); > + > + /* Track this relation for session locks */ > + parentRelationIds = lappend_oid(parentRelationIds, relationOid); > + > + /* Relation on which is based index cannot be shared */ > + if (heapRelation->rd_rel->relisshared) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("concurrent reindex is not supported for shared relations"))); > + > + /* Add all the valid indexes of relation to list */ > + foreach(lc2, RelationGetIndexList(heapRelation)) Hm. This means we will not notice having about-to-be dropped indexes around. Which seems safe because locks will prevent that anyway... > + default: > + /* nothing to do */ > + break; Shouldn't we error out? > + foreach(lc, indexIds) > + { > + Relation indexRel; > + Oid indOid = lfirst_oid(lc); > + Oid concurrentOid = lfirst_oid(lc2); > + bool primary; > + > + /* Move to next concurrent item */ > + lc2 = lnext(lc2); forboth() > + /* > + * Phase 3 of REINDEX CONCURRENTLY > + * > + * During this phase the concurrent indexes catch up with the INSERT that > + * might have occurred in the parent table and are marked as valid once done. > + * > + * We once again wait until no transaction can have the table open with > + * the index marked as read-only for updates. Each index validation is done > + * with a separate transaction to avoid opening transaction for an > + * unnecessary too long time. > + */ Maybe I am being dumb because I have the feeling I said differently in the past, but why do we not need a WaitForMultipleVirtualLocks() here? The comment seems to say we need to do so. > + /* > + * Perform a scan of each concurrent index with the heap, then insert > + * any missing index entries. > + */ > + foreach(lc, concurrentIndexIds) > + { > + Oid indOid = lfirst_oid(lc); > + Oid relOid; > + > + /* Open separate transaction to validate index */ > + StartTransactionCommand(); > + > + /* Get the parent relation Oid */ > + relOid = IndexGetRelation(indOid, false); > + > + /* > + * Take the reference snapshot that will be used for the concurrent indexes > + * validation. > + */ > + snapshot = RegisterSnapshot(GetTransactionSnapshot()); > + PushActiveSnapshot(snapshot); > + > + /* Validate index, which might be a toast */ > + validate_index(relOid, indOid, snapshot); > + > + /* > + * This concurrent index is now valid as they contain all the tuples > + * necessary. However, it might not have taken into account deleted tuples > + * before the reference snapshot was taken, so we need to wait for the > + * transactions that might have older snapshots than ours. > + */ > + WaitForOldSnapshots(snapshot); > + > + /* > + * Concurrent index can now be marked as valid -- update pg_index > + * entries. > + */ > + index_set_state_flags(indOid, INDEX_CREATE_SET_VALID); > + > + /* > + * The pg_index update will cause backends to update its entries for the > + * concurrent index but it is necessary to do the same thing for cache. > + */ > + CacheInvalidateRelcacheByRelid(relOid); > + > + /* we can now do away with our active snapshot */ > + PopActiveSnapshot(); > + > + /* And we can remove the validating snapshot too */ > + UnregisterSnapshot(snapshot); > + > + /* Commit this transaction to make the concurrent index valid */ > + CommitTransactionCommand(); > + } > + /* > + * Phase 5 of REINDEX CONCURRENTLY > + * > + * The concurrent indexes now hold the old relfilenode of the other indexes > + * transactions that might use them. Each operation is performed with a > + * separate transaction. > + */ > + > + /* Now mark the concurrent indexes as not ready */ > + foreach(lc, concurrentIndexIds) > + { > + Oid indOid = lfirst_oid(lc); > + Oid relOid; > + > + StartTransactionCommand(); > + relOid = IndexGetRelation(indOid, false); > + > + /* > + * Finish the index invalidation and set it as dead. It is not > + * necessary to wait for virtual locks on the parent relation as it > + * is already sure that this session holds sufficient locks.s > + */ tiny typo (lock.s) > + /* > + * Phase 6 of REINDEX CONCURRENTLY > + * > + * Drop the concurrent indexes. This needs to be done through > + * performDeletion or related dependencies will not be dropped for the old > + * indexes. The internal mechanism of DROP INDEX CONCURRENTLY is not used > + * as here the indexes are already considered as dead and invalid, so they > + * will not be used by other backends. > + */ > + foreach(lc, concurrentIndexIds) > + { > + Oid indexOid = lfirst_oid(lc); > + > + /* Start transaction to drop this index */ > + StartTransactionCommand(); > + > + /* Get fresh snapshot for next step */ > + PushActiveSnapshot(GetTransactionSnapshot()); > + > + /* > + * Open transaction if necessary, for the first index treated its > + * transaction has been already opened previously. > + */ > + index_concurrent_drop(indexOid); > + > + /* > + * For the last index to be treated, do not commit transaction yet. > + * This will be done once all the locks on indexes and parent relations > + * are released. > + */ Hm. This doesn't seem to commit the last transaction at all right now? Not sure why UnlockRelationIdForSession needs to be run in a transaction anyway? > + if (indexOid != llast_oid(concurrentIndexIds)) > + { > + /* We can do away with our snapshot */ > + PopActiveSnapshot(); > + > + /* Commit this transaction to make the update visible. */ > + CommitTransactionCommand(); > + } > + } > + > + /* > + * Last thing to do is release the session-level lock on the parent table > + * and the indexes of table. > + */ > + foreach(lc, relationLocks) > + { > + LockRelId lockRel = * (LockRelId *) lfirst(lc); > + UnlockRelationIdForSession(&lockRel, ShareUpdateExclusiveLock); > + } > + > + return true; > +} > + > + > + /* > + * Check the case of a system index that might have been invalidated by a > + * failed concurrent process and allow its drop. > + */ This is only possible for toast indexes right now, right? If so, the comment should mention that. > + if (IsSystemClass(classform) && > + relkind == RELKIND_INDEX) > + { > + HeapTuple locTuple; > + Form_pg_index indexform; > + bool indisvalid; > + > + locTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(state->heapOid)); > + if (!HeapTupleIsValid(locTuple)) > + { > + ReleaseSysCache(tuple); > + return; > + } > + > + indexform = (Form_pg_index) GETSTRUCT(locTuple); > + indisvalid = indexform->indisvalid; > + ReleaseSysCache(locTuple); > + > + /* Leave if index entry is not valid */ > + if (!indisvalid) > + { > + ReleaseSysCache(tuple); > + return; > + } > + } > + Ok, thats what I have for now... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Thanks for the review. All your comments are addressed and updated patches are attached.
Please see below for the details, and if you find anything else just let me know.
Michael
Please see below for the details, and if you find anything else just let me know.
On Tue, Mar 5, 2013 at 6:27 PM, Andres Freund <andres@2ndquadrant.com> wrote:
-- Have you benchmarked the toastrelidx removal stuff in any way? If not,
thats fine, but if yes I'd be interested.
No I haven't. Is it really that easily measurable? I think not, but me too I'd be interested in looking at such results.
On 2013-03-04 22:33:53 +0900, Michael Paquier wrote:
> + ListCell *lc;
> + int count = 0;
I find count a confusing name for a loop iteration variable... i of orr,
idxno, or ...
That's only a matter of personal way of doing... But done for all the functions I modified in this file.
> + if (toastrel->rd_indexvalid == 0)
> + RelationGetIndexList(toastrel);
Hm, I think we should move this into a macro, this is cropping up at
more and more places.
This is not necessary. RelationGetIndexList does a check similar at its top, so I simply removed all those checks.
> + for (count = 0; count < num_indexes; count++)
> + index_insert(toastidxs[count], t_values, t_isnull,
> + &(toasttup->t_self),
> + toastrel,
> + toastidxs[count]->rd_index->indisunique ?
> + UNIQUE_CHECK_YES : UNIQUE_CHECK_NO);
The indisunique check looks like a copy & pasto to me, albeit not
yours...
Yes it is the same for all the indexes normally, but it looks more solid to me to do that as it is. So unchanged.
> + /*
> + * We actually use only the first index but taking a lock on all is
> + * necessary.
> + */
Hm, is it guaranteed that the first index is valid?
Not at all. Fixed. If all the indexes are invalid, an error is returned.
> + * If we're swapping two toast tables by content, do the same for all of
> + * their indexes. The swap can actually be safely done only if all the indexes
> + * have valid Oids.
What's an index without a valid oid?
That's a good question... I re-read the code and it didn't any sense, so switched to a check on empty index list for both relations.
> + /* Open relations */
> + toastRel1 = heap_open(relform1->reltoastrelid, RowExclusiveLock);
> + toastRel2 = heap_open(relform2->reltoastrelid, RowExclusiveLock);
Shouldn't those be Access Exlusive Locks?
Yeah seems better for this swap.
> + /* Obtain index list if necessary */
> + if (toastRel1->rd_indexvalid == 0)
> + RelationGetIndexList(toastRel1);
> + if (toastRel2->rd_indexvalid == 0)
> + RelationGetIndexList(toastRel2);
> +
> + /* Check if the swap is possible for all the toast indexes */
So there's no error being thrown if this turns out not to be possible?
There are no errors also in the former process... This should fail silently, no?
> + if (count == 0)
> + snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index",
> + OIDOldHeap);
> + else
> + snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index_cct%d",
> + OIDOldHeap, count);
> + RenameRelationInternal(lfirst_oid(lc),
> + NewToastName);
> + count++;
> + }
Hm. It seems wrong that this layer needs to know about _cct.
Any other idea? For the time being I removed cct and added only a suffix based on the index number...
> /*
> - * Calculate total on-disk size of a TOAST relation, including its index.
> + * Calculate total on-disk size of a TOAST relation, including its indexes.
> * Must not be applied to non-TOAST relations.
> */
> static int64
> @@ -340,8 +340,8 @@ calculate_toast_table_size(Oid toastrelid)
> {
> ...
> + /* Size is evaluated based on the first index available */
Uh. Why? Imo all indexes should be counted.
They are! The comment only is incorrect. Fixed.
> -#define CATALOG_VERSION_NO 201302181
> +#define CATALOG_VERSION_NO 20130219
Think you forgot a digit here ;)
Fixed.
/*
* This case is currently only supported during a concurrent index
* rebuild, but there is no way to ask for it in the grammar otherwise
* anyway.
*/
Or similar.
Makes sense. Thanks.
> + ReleaseSysCache(constTuple);
> + }
Very, very nitpicky, but I find "constTuple" to be confusing, I thought
at first it meant that the tuple shouldn't be modified or something.
Made that clear.
> + /*
> + * Index is considered as a constraint if it is PRIMARY KEY or EXCLUSION.
> + */
> + isconstraint = indexRelation->rd_index->indisprimary ||
> + indexRelation->rd_index->indisexclusion;
unique constraints aren't mattering here?
No they are not. Unique indexes are not counted as constraints in the case of index_create. Previous versions of the patch did that but there are issues with unique indexes using expressions.
> +/*
> + * index_concurrent_swap
> + *
> + * Replace old index by old index in a concurrent context. For the time being
> + * what is done here is switching the relation relfilenode of the indexes. If
> + * extra operations are necessary during a concurrent swap, processing should
> + * be added here. AccessExclusiveLock is taken on the index relations that are
> + * swapped until the end of the transaction where this function is called.
> + */
> +void
> +index_concurrent_swap(Oid newIndexOid, Oid oldIndexOid)
> +{
> + Relation oldIndexRel, newIndexRel, pg_class;
> + HeapTuple oldIndexTuple, newIndexTuple;
> + Form_pg_class oldIndexForm, newIndexForm;
> + Oid tmpnode;
> +
> + /*
> + * Take an exclusive lock on the old and new index before swapping them.
> + */
> + oldIndexRel = relation_open(oldIndexOid, AccessExclusiveLock);
> + newIndexRel = relation_open(newIndexOid, AccessExclusiveLock);
> +
> + /* Now swap relfilenode of those indexes */
Any chance to reuse swap_relation_files here? Not sure whether it would
be beneficial given that it is more generic and normally works on a
relation level...
Hum. I am not sure. The current way of doing is enough to my mind.
We probably should remove the fsm of the index altogether after this?
The freespace map? Not sure it is necessary here. Isn't it going to be removed with the relation anyway?
> + /* The lock taken previously is not released until the end of transaction */
> + relation_close(oldIndexRel, NoLock);
> + relation_close(newIndexRel, NoLock);
It might be worthwile adding a heap_freetuple here for (old,
new)IndexTuple, just to spare the reader the thinking whether it needs
to be done.
Indeed, I forgot some cleanup here. Fixed.
> +/*
> + * index_concurrent_drop
> + */
"or dependencies of the index would not get dropped"?
Fixed.
> +void
> +index_concurrent_drop(Oid indexOid)
> +{
> + Oid constraintOid = get_index_constraint(indexOid);
> + ObjectAddress object;
> + Form_pg_index indexForm;
> + Relation pg_index;
> + HeapTuple indexTuple;
> + bool indislive;
> +
> + /*
> + * Check that the index dropped here is not alive, it might be used by
> + * other backends in this case.
> + */
> + pg_index = heap_open(IndexRelationId, RowExclusiveLock);
> +
> + indexTuple = SearchSysCacheCopy1(INDEXRELID,
> + ObjectIdGetDatum(indexOid));
> + if (!HeapTupleIsValid(indexTuple))
> + elog(ERROR, "cache lookup failed for index %u", indexOid);
> + indexForm = (Form_pg_index) GETSTRUCT(indexTuple);
> + indislive = indexForm->indislive;
> +
> + /* Clean up */
> + heap_close(pg_index, RowExclusiveLock);
> +
> + /* Leave if index is still alive */
> + if (indislive)
> + return;
This seems like a confusing path? Why is it valid to get here with a
valid index and why is it ok to silently ignore that case?
I added that because of a comment of one of the past reviews. Personally I think it makes more sense to remove that for clarity.
> + case RELKIND_RELATION:
> + {
> + /*
> + * In the case of a relation, find all its indexes
> + * including toast indexes.
> + */
> + Relation heapRelation = heap_open(relationOid,
> + ShareUpdateExclusiveLock);
Hm. This means we will not notice having about-to-be dropped indexes
around. Which seems safe because locks will prevent that anyway...
I think that's OK as-is.
> + default:
> + /* nothing to do */
> + break;
Shouldn't we error out?
Don't think so. For example what if the relation is a matview? For REINDEX DATABASE this could finish as an error because a materialized view is listed as a relation to reindex. I prefer having this path failing silently and leave if there are no indexes.
> + foreach(lc, indexIds)
> + {
> + Relation indexRel;
> + Oid indOid = lfirst_oid(lc);
> + Oid concurrentOid = lfirst_oid(lc2);
> + bool primary;
> +
> + /* Move to next concurrent item */
> + lc2 = lnext(lc2);
forboth()
Oh, I didn't know this trick. Thanks.
> + /*
> + * Phase 3 of REINDEX CONCURRENTLY
> + *
> + * During this phase the concurrent indexes catch up with the INSERT that
> + * might have occurred in the parent table and are marked as valid once done.
> + *
> + * We once again wait until no transaction can have the table open with
> + * the index marked as read-only for updates. Each index validation is done
> + * with a separate transaction to avoid opening transaction for an
> + * unnecessary too long time.
> + */
Maybe I am being dumb because I have the feeling I said differently in
the past, but why do we not need a WaitForMultipleVirtualLocks() here?
The comment seems to say we need to do so.
Yes you said the contrary in a previous review. The purpose of this function is to first gather the locks and then wait for everything at once to reduce possible conflicts.
> + /*
> + * Finish the index invalidation and set it as dead. It is not
> + * necessary to wait for virtual locks on the parent relation as it
> + * is already sure that this session holds sufficient locks.s
> + */
tiny typo (lock.s)
Fixed.
> + /*
> + * Phase 6 of REINDEX CONCURRENTLY
> + *
> + * Drop the concurrent indexes. This needs to be done through
> + * performDeletion or related dependencies will not be dropped for the old
> + * indexes. The internal mechanism of DROP INDEX CONCURRENTLY is not used
> + * as here the indexes are already considered as dead and invalid, so they
> + * will not be used by other backends.
> + */
> + foreach(lc, concurrentIndexIds)
> + {
> + Oid indexOid = lfirst_oid(lc);
> +
> + /* Start transaction to drop this index */
> + StartTransactionCommand();
> +
> + /* Get fresh snapshot for next step */
> + PushActiveSnapshot(GetTransactionSnapshot());
> +
> + /*
> + * Open transaction if necessary, for the first index treated its
> + * transaction has been already opened previously.
> + */
> + index_concurrent_drop(indexOid);
> +
> + /*
> + * For the last index to be treated, do not commit transaction yet.
> + * This will be done once all the locks on indexes and parent relations
> + * are released.
> + */
Hm. This doesn't seem to commit the last transaction at all right now?
It is better like this. The end of the process needs to be done inside a transaction, so not committing immediately the last drop makes sense, no?
Not sure why UnlockRelationIdForSession needs to be run in a transaction
anyway?
Even in the case of CREATE INDEX CONCURRENTLY, UnlockRelationIdForSession is run inside a transaction block.
> + /*
> + * Check the case of a system index that might have been invalidated by a
> + * failed concurrent process and allow its drop.
> + */
This is only possible for toast indexes right now, right? If so, the
comment should mention that.
Yes, fixed. I mentioned that in the comment.
Michael
Attachment
On Tue, Mar 5, 2013 at 10:35 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Thanks for the review. All your comments are addressed and updated patches > are attached. I got the compile warnings: tuptoaster.c:1539: warning: format '%s' expects type 'char *', but argument 3 has type 'Oid' tuptoaster.c:1539: warning: too many arguments for format The patch doesn't handle the index on the materialized view correctly. =# CREATE TABLE hoge (i int); CREATE TABLE =# CREATE MATERIALIZED VIEW hogeview AS SELECT * FROM hoge; SELECT 0 =# CREATE INDEX hogeview_idx ON hogeview(i); CREATE INDEX =# REINDEX TABLE hogeview; REINDEX =# REINDEX TABLE CONCURRENTLY hogeview; NOTICE: table "hogeview" has no indexes REINDEX Regards, -- Fujii Masao
On 2013-03-05 22:35:16 +0900, Michael Paquier wrote: > Thanks for the review. All your comments are addressed and updated patches > are attached. > Please see below for the details, and if you find anything else just let me > know. > > On Tue, Mar 5, 2013 at 6:27 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > > Have you benchmarked the toastrelidx removal stuff in any way? If not, > > thats fine, but if yes I'd be interested. > > > No I haven't. Is it really that easily measurable? I think not, but me too > I'd be interested in looking at such results. I don't think its really measurable, at least not for modifications. But istm that the onus to proof that to some degree is upon the patch. > > + if (toastrel->rd_indexvalid == 0) > > > + RelationGetIndexList(toastrel); > > > > Hm, I think we should move this into a macro, this is cropping up at > > more and more places. > > > This is not necessary. RelationGetIndexList does a check similar at its > top, so I simply removed all those checks. Well, in some of those cases a function call might be noticeable (probably only in the toast fetch path). Thats why I suggested putting the above in a macro... > > > > + for (count = 0; count < num_indexes; count++) > > > + index_insert(toastidxs[count], t_values, t_isnull, > > > + &(toasttup->t_self), > > > + toastrel, > > > + > > toastidxs[count]->rd_index->indisunique ? > > > + UNIQUE_CHECK_YES : > > UNIQUE_CHECK_NO); > > > > The indisunique check looks like a copy & pasto to me, albeit not > > yours... > > > Yes it is the same for all the indexes normally, but it looks more solid to > me to do that as it is. So unchanged. Hm, if the toast indexes aren't unique anymore loads of stuff would be broken. Anyway, not your "fault". > > > > > + /* Obtain index list if necessary */ > > > + if (toastRel1->rd_indexvalid == 0) > > > + RelationGetIndexList(toastRel1); > > > + if (toastRel2->rd_indexvalid == 0) > > > + RelationGetIndexList(toastRel2); > > > + > > > + /* Check if the swap is possible for all the toast indexes > > */ > > > > So there's no error being thrown if this turns out not to be possible? > > > There are no errors also in the former process... This should fail > silently, no? Not sure what you mean by "former process"? So far I don't see any reason why it would be a good idea to fail silently. We end up with corrupt data if the swap is silently not performed. > > > + if (count == 0) > > > + snprintf(NewToastName, > > NAMEDATALEN, "pg_toast_%u_index", > > > + OIDOldHeap); > > > + else > > > + snprintf(NewToastName, > > NAMEDATALEN, "pg_toast_%u_index_cct%d", > > > + OIDOldHeap, > > count); > > > + RenameRelationInternal(lfirst_oid(lc), > > > + > > NewToastName); > > > + count++; > > > + } > > > > Hm. It seems wrong that this layer needs to know about _cct. > > > Any other idea? For the time being I removed cct and added only a suffix > based on the index number... Hm. It seems like throwing an error would be sufficient, that path is only entered for shared catalogs, right? Having multiple toast indexes would be a bug. > > > + /* > > > + * Index is considered as a constraint if it is PRIMARY KEY or > > EXCLUSION. > > > + */ > > > + isconstraint = indexRelation->rd_index->indisprimary || > > > + indexRelation->rd_index->indisexclusion; > > > > unique constraints aren't mattering here? > > > No they are not. Unique indexes are not counted as constraints in the case > of index_create. Previous versions of the patch did that but there are > issues with unique indexes using expressions. Hm. index_create's comment says:* isconstraint: index is owned by PRIMARY KEY, UNIQUE, or EXCLUSION constraint There are unique indexes that are constraints and some that are not. Looking at ->indisunique is not sufficient to determine whether its one or not. > > We probably should remove the fsm of the index altogether after this? > > > The freespace map? Not sure it is necessary here. Isn't it going to be > removed with the relation anyway? I had a thinko here, forgot what I said. I thought the freespacemap would be the one from the old index, but htats clearly bogus. Comes from writing reviews after having to leave home at 5 in the morning to catch a plane ;) > > > +void > > > +index_concurrent_drop(Oid indexOid) > > > +{ > > > + Oid constraintOid = > > get_index_constraint(indexOid); > > > + ObjectAddress object; > > > + Form_pg_index indexForm; > > > + Relation pg_index; > > > + HeapTuple indexTuple; > > > + bool indislive; > > > + > > > + /* > > > + * Check that the index dropped here is not alive, it might be > > used by > > > + * other backends in this case. > > > + */ > > > + pg_index = heap_open(IndexRelationId, RowExclusiveLock); > > > + > > > + indexTuple = SearchSysCacheCopy1(INDEXRELID, > > > + > > ObjectIdGetDatum(indexOid)); > > > + if (!HeapTupleIsValid(indexTuple)) > > > + elog(ERROR, "cache lookup failed for index %u", indexOid); > > > + indexForm = (Form_pg_index) GETSTRUCT(indexTuple); > > > + indislive = indexForm->indislive; > > > + > > > + /* Clean up */ > > > + heap_close(pg_index, RowExclusiveLock); > > > + > > > + /* Leave if index is still alive */ > > > + if (indislive) > > > + return; > > > > This seems like a confusing path? Why is it valid to get here with a > > valid index and why is it ok to silently ignore that case? > > > I added that because of a comment of one of the past reviews. Personally I > think it makes more sense to remove that for clarity. Imo it should be an elog(ERROR) or an Assert(). > > > + case RELKIND_RELATION: > > > + { > > > + /* > > > + * In the case of a relation, find all its > > indexes > > > + * including toast indexes. > > > + */ > > > + Relation heapRelation = > > heap_open(relationOid, > > > + > > ShareUpdateExclusiveLock); > > > > Hm. This means we will not notice having about-to-be dropped indexes > > around. Which seems safe because locks will prevent that anyway... > > > I think that's OK as-is. Yes. Just thinking out loud. > > + default: > > > + /* nothing to do */ > > > + break; > > > > Shouldn't we error out? > > > Don't think so. For example what if the relation is a matview? For REINDEX > DATABASE this could finish as an error because a materialized view is > listed as a relation to reindex. I prefer having this path failing silently > and leave if there are no indexes. Imo default fallthroughs makes it harder to adjust code. And afaik its legal to add indexes to materialized views which kinda proofs my point. And if that path is reached for plain views, sequences or toast tables its an error. > > > + /* > > > + * Phase 3 of REINDEX CONCURRENTLY > > > + * > > > + * During this phase the concurrent indexes catch up with the > > INSERT that > > > + * might have occurred in the parent table and are marked as valid > > once done. > > > + * > > > + * We once again wait until no transaction can have the table open > > with > > > + * the index marked as read-only for updates. Each index > > validation is done > > > + * with a separate transaction to avoid opening transaction for an > > > + * unnecessary too long time. > > > + */ > > > > Maybe I am being dumb because I have the feeling I said differently in > > the past, but why do we not need a WaitForMultipleVirtualLocks() here? > > The comment seems to say we need to do so. > > > Yes you said the contrary in a previous review. The purpose of this > function is to first gather the locks and then wait for everything at once > to reduce possible conflicts. you say: + * We once again wait until no transaction can have the table open with + * the index marked as read-only for updates. Each index validation is done + * with a separate transaction to avoid opening transaction for an + * unnecessary too long time. Which doesn't seem to be done? I read back and afaics I only referred to CacheInvalidateRelcacheByRelid not being necessary in this phase. Which I think is correct. Anyway, if I claimed otherwise, I think I was wrong: The reason - I think - we need to wait here is that otherwise its not guaranteed that all other backends see the index with ->isready set. Which means they might add tuples which are invisible to the mvcc snapshot passed to validate_index() (just created beforehand) which are not yet added to the new index because those backends think the index is not ready yet. Any flaws in that logic? ... Yes, reading the comments of validate_index() and the old implementation seems to make my point. > > > + /* > > > + * Phase 6 of REINDEX CONCURRENTLY > > > + * > > > + * Drop the concurrent indexes. This needs to be done through > > > + * performDeletion or related dependencies will not be dropped for > > the old > > > + * indexes. The internal mechanism of DROP INDEX CONCURRENTLY is > > not used > > > + * as here the indexes are already considered as dead and invalid, > > so they > > > + * will not be used by other backends. > > > + */ > > > + foreach(lc, concurrentIndexIds) > > > + { > > > + Oid indexOid = lfirst_oid(lc); > > > + > > > + /* Start transaction to drop this index */ > > > + StartTransactionCommand(); > > > + > > > + /* Get fresh snapshot for next step */ > > > + PushActiveSnapshot(GetTransactionSnapshot()); > > > + > > > + /* > > > + * Open transaction if necessary, for the first index > > treated its > > > + * transaction has been already opened previously. > > > + */ > > > + index_concurrent_drop(indexOid); > > > + > > > + /* > > > + * For the last index to be treated, do not commit > > transaction yet. > > > + * This will be done once all the locks on indexes and > > parent relations > > > + * are released. > > > + */ > > > > Hm. This doesn't seem to commit the last transaction at all right now? > > > It is better like this. The end of the process needs to be done inside a > transaction, so not committing immediately the last drop makes sense, no? I pretty much dislike this. If we need to leave a transaction open (why?), that should happen a function layer above. > > > > Not sure why UnlockRelationIdForSession needs to be run in a transaction > > anyway? > > > Even in the case of CREATE INDEX CONCURRENTLY, UnlockRelationIdForSession > is run inside a transaction block. I have no problem of doing so, I just dislike the way thats done in the loop. You can just open a new one if its required, a transaction is cheap, especially if it doesn't even acquire an xid. Looking good. I'll do some actual testing instead of just reviewing now... Greetings, Andres Freund
On Tue, Mar 5, 2013 at 11:22 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
-- On Tue, Mar 5, 2013 at 10:35 PM, Michael PaquierI got the compile warnings:
<michael.paquier@gmail.com> wrote:
> Thanks for the review. All your comments are addressed and updated patches
> are attached.
tuptoaster.c:1539: warning: format '%s' expects type 'char *', but
argument 3 has type 'Oid'
tuptoaster.c:1539: warning: too many arguments for format
Fixed. Thanks for catching that.
The patch doesn't handle the index on the materialized view correctly.
Hehe... I didn't know that materialized views could have indexes...
I fixed it, will send updated patch once I am done with Andres' comments.
I fixed it, will send updated patch once I am done with Andres' comments.
Michael
Please find attached updated patch realigned with your comments. You can find my answers inline...
The only thing that needs clarification is the comment about UNIQUE_CHECK_YES/UNIQUE_CHECK_NO. Except that all the other things are corrected or adapted to what you wanted. I am also including now tests for matviews.
Michael
The only thing that needs clarification is the comment about UNIQUE_CHECK_YES/UNIQUE_CHECK_NO. Except that all the other things are corrected or adapted to what you wanted. I am also including now tests for matviews.
On Wed, Mar 6, 2013 at 1:49 AM, Andres Freund <andres@2ndquadrant.com> wrote:
-- On 2013-03-05 22:35:16 +0900, Michael Paquier wrote:
> > > + for (count = 0; count < num_indexes; count++)Hm, if the toast indexes aren't unique anymore loads of stuff would be
> > > + index_insert(toastidxs[count], t_values, t_isnull,
> > > + &(toasttup->t_self),
> > > + toastrel,
> > > +
> > toastidxs[count]->rd_index->indisunique ?
> > > + UNIQUE_CHECK_YES :
> > UNIQUE_CHECK_NO);
> >
> > The indisunique check looks like a copy & pasto to me, albeit not
> > yours...
> >
> Yes it is the same for all the indexes normally, but it looks more solid to
> me to do that as it is. So unchanged.
broken. Anyway, not your "fault".
I definitely cannot understand where you are going here. Could you be more explicit? Why could this be a problem? Without my patch a similar check is used for toast indexes.
Not sure what you mean by "former process"? So far I don't see any
> >
> > > + /* Obtain index list if necessary */
> > > + if (toastRel1->rd_indexvalid == 0)
> > > + RelationGetIndexList(toastRel1);
> > > + if (toastRel2->rd_indexvalid == 0)
> > > + RelationGetIndexList(toastRel2);
> > > +
> > > + /* Check if the swap is possible for all the toast indexes
> > */
> >
> > So there's no error being thrown if this turns out not to be possible?
> >
> There are no errors also in the former process... This should fail
> silently, no?
reason why it would be a good idea to fail silently. We end up with
corrupt data if the swap is silently not performed.
OK added an error and a check on the size of rd_indexlist to make things better suited.
Don't think so. Even if now those APIs are used only for catalog tables, I do not believe that this function has been designed to be used only with shared catalogs. Removing the cct suffix makes sense though...Hm. It seems like throwing an error would be sufficient, that path is
> > > + if (count == 0)
> > > + snprintf(NewToastName,
> > NAMEDATALEN, "pg_toast_%u_index",
> > > + OIDOldHeap);
> > > + else
> > > + snprintf(NewToastName,
> > NAMEDATALEN, "pg_toast_%u_index_cct%d",
> > > + OIDOldHeap,
> > count);
> > > + RenameRelationInternal(lfirst_oid(lc),
> > > +
> > NewToastName);
> > > + count++;
> > > + }
> >
> > Hm. It seems wrong that this layer needs to know about _cct.
> >
> Any other idea? For the time being I removed cct and added only a suffix
> based on the index number...
only entered for shared catalogs, right? Having multiple toast indexes
would be a bug.
Hm. index_create's comment says:
> > > + /*
> > > + * Index is considered as a constraint if it is PRIMARY KEY or
> > EXCLUSION.
> > > + */
> > > + isconstraint = indexRelation->rd_index->indisprimary ||
> > > + indexRelation->rd_index->indisexclusion;
> >
> > unique constraints aren't mattering here?
> >
> No they are not. Unique indexes are not counted as constraints in the case
> of index_create. Previous versions of the patch did that but there are
> issues with unique indexes using expressions.
* isconstraint: index is owned by PRIMARY KEY, UNIQUE, or EXCLUSION constraint
There are unique indexes that are constraints and some that are
not. Looking at ->indisunique is not sufficient to determine whether its
one or not.
Hum... OK. I changed that using a method based on get_index_constraint for a given index. So if the constraint Oid is invalid, it means that this index has no constraints and its concurrent entry won't create an index in consequence. It is more stable this way.
> > > +voidImo it should be an elog(ERROR) or an Assert().> > > +index_concurrent_drop(Oid indexOid)
> > > +{
> > > + Oid constraintOid =
> > get_index_constraint(indexOid);
> > > + ObjectAddress object;
> > > + Form_pg_index indexForm;
> > > + Relation pg_index;
> > > + HeapTuple indexTuple;
> > > + bool indislive;
> > > +
> > > + /*
> > > + * Check that the index dropped here is not alive, it might be
> > used by
> > > + * other backends in this case.
> > > + */
> > > + pg_index = heap_open(IndexRelationId, RowExclusiveLock);
> > > +
> > > + indexTuple = SearchSysCacheCopy1(INDEXRELID,
> > > +
> > ObjectIdGetDatum(indexOid));
> > > + if (!HeapTupleIsValid(indexTuple))
> > > + elog(ERROR, "cache lookup failed for index %u", indexOid);
> > > + indexForm = (Form_pg_index) GETSTRUCT(indexTuple);
> > > + indislive = indexForm->indislive;
> > > +
> > > + /* Clean up */
> > > + heap_close(pg_index, RowExclusiveLock);
> > > +
> > > + /* Leave if index is still alive */
> > > + if (indislive)
> > > + return;
> >
> > This seems like a confusing path? Why is it valid to get here with a
> > valid index and why is it ok to silently ignore that case?
> >
> I added that because of a comment of one of the past reviews. Personally I
> think it makes more sense to remove that for clarity.
Assert. Added.
> > + default:Imo default fallthroughs makes it harder to adjust code. And afaik its
> > > + /* nothing to do */
> > > + break;
> >
> > Shouldn't we error out?
> >
> Don't think so. For example what if the relation is a matview? For REINDEX
> DATABASE this could finish as an error because a materialized view is
> listed as a relation to reindex. I prefer having this path failing silently
> and leave if there are no indexes.
legal to add indexes to materialized views which kinda proofs my point.
And if that path is reached for plain views, sequences or toast tables
its an error.
Added an error message. Matviews are now correctly handled (per se report from Masao).
> > > + /*you say:
> > > + * Phase 3 of REINDEX CONCURRENTLY
> > > + *
> > > + * During this phase the concurrent indexes catch up with the
> > INSERT that
> > > + * might have occurred in the parent table and are marked as valid
> > once done.
> > > + *
> > > + * We once again wait until no transaction can have the table open
> > with
> > > + * the index marked as read-only for updates. Each index
> > validation is done
> > > + * with a separate transaction to avoid opening transaction for an
> > > + * unnecessary too long time.
> > > + */
> >
> > Maybe I am being dumb because I have the feeling I said differently in
> > the past, but why do we not need a WaitForMultipleVirtualLocks() here?
> > The comment seems to say we need to do so.
> >
> Yes you said the contrary in a previous review. The purpose of this
> function is to first gather the locks and then wait for everything at once
> to reduce possible conflicts.Which doesn't seem to be done?
+ * We once again wait until no transaction can have the table open with
+ * the index marked as read-only for updates. Each index validation is done
+ * with a separate transaction to avoid opening transaction for an
+ * unnecessary too long time.
I read back and afaics I only referred to CacheInvalidateRelcacheByRelid
not being necessary in this phase. Which I think is correct.
Regarding CacheInvalidateRelcacheByRelid at phase 3, I think that it is needed. If we don't use it the pg_index entries will be updated but not the cache, what is incorrect.
Anyway, if I claimed otherwise, I think I was wrong:
The reason - I think - we need to wait here is that otherwise its not
guaranteed that all other backends see the index with ->isready
set. Which means they might add tuples which are invisible to the mvcc
snapshot passed to validate_index() (just created beforehand) which are
not yet added to the new index because those backends think the index is
not ready yet.
Any flaws in that logic?
Not that I think. In consequence, and I think we will agree on that: I am removing WaitForMultipleVirtualLocks and add a WaitForVirtualLock on the parent relation for EACH index before building and validating it.
I pretty much dislike this. If we need to leave a transaction open> It is better like this. The end of the process needs to be done inside a
> transaction, so not committing immediately the last drop makes sense, no?
(why?), that should happen a function layer above.
Changed as requested.
> > Not sure why UnlockRelationIdForSession needs to be run in a transaction> > anyway?I have no problem of doing so, I just dislike the way thats done in the
> >
> Even in the case of CREATE INDEX CONCURRENTLY, UnlockRelationIdForSession
> is run inside a transaction block.
loop. You can just open a new one if its required, a transaction is
cheap, especially if it doesn't even acquire an xid.
OK. Doing the end of the transaction in a separate transaction and doing the unlocking out of the transaction block...
Michael
Attachment
On 2013-03-06 13:21:27 +0900, Michael Paquier wrote: > Please find attached updated patch realigned with your comments. You can > find my answers inline... > The only thing that needs clarification is the comment about > UNIQUE_CHECK_YES/UNIQUE_CHECK_NO. Except that all the other things are > corrected or adapted to what you wanted. I am also including now tests for > matviews. > > On Wed, Mar 6, 2013 at 1:49 AM, Andres Freund <andres@2ndquadrant.com>wrote: > > > On 2013-03-05 22:35:16 +0900, Michael Paquier wrote: > > > > > > + for (count = 0; count < num_indexes; count++) > > > > > + index_insert(toastidxs[count], t_values, > > t_isnull, > > > > > + &(toasttup->t_self), > > > > > + toastrel, > > > > > + > > > > toastidxs[count]->rd_index->indisunique ? > > > > > + UNIQUE_CHECK_YES : > > > > UNIQUE_CHECK_NO); > > > > > > > > The indisunique check looks like a copy & pasto to me, albeit not > > > > yours... > > > > > > > Yes it is the same for all the indexes normally, but it looks more solid > > to > > > me to do that as it is. So unchanged. > > > > Hm, if the toast indexes aren't unique anymore loads of stuff would be > > broken. Anyway, not your "fault". > > > I definitely cannot understand where you are going here. Could you be more > explicit? Why could this be a problem? Without my patch a similar check is > used for toast indexes. There's no problem. I just dislike the pointless check which caters for a situation that doesn't exist... Forget it, sorry. > > > > > + if (count == 0) > > > > > + snprintf(NewToastName, > > > > NAMEDATALEN, "pg_toast_%u_index", > > > > > + OIDOldHeap); > > > > > + else > > > > > + snprintf(NewToastName, > > > > NAMEDATALEN, "pg_toast_%u_index_cct%d", > > > > > + OIDOldHeap, > > > > count); > > > > > + RenameRelationInternal(lfirst_oid(lc), > > > > > + > > > > NewToastName); > > > > > + count++; > > > > > + } > > > > > > > > Hm. It seems wrong that this layer needs to know about _cct. > > > > > > > Any other idea? For the time being I removed cct and added only a suffix > > > based on the index number... > > > > Hm. It seems like throwing an error would be sufficient, that path is > > only entered for shared catalogs, right? Having multiple toast indexes > > would be a bug. > > > Don't think so. Even if now those APIs are used only for catalog tables, I > do not believe that this function has been designed to be used only with > shared catalogs. Removing the cct suffix makes sense though... Forget what I said. > > > > > + /* > > > > > + * Index is considered as a constraint if it is PRIMARY KEY or > > > > EXCLUSION. > > > > > + */ > > > > > + isconstraint = indexRelation->rd_index->indisprimary || > > > > > + indexRelation->rd_index->indisexclusion; > > > > > > > > unique constraints aren't mattering here? > > > > > > > No they are not. Unique indexes are not counted as constraints in the > > case > > > of index_create. Previous versions of the patch did that but there are > > > issues with unique indexes using expressions. > > > > Hm. index_create's comment says: > > * isconstraint: index is owned by PRIMARY KEY, UNIQUE, or EXCLUSION > > constraint > > > > There are unique indexes that are constraints and some that are > > not. Looking at ->indisunique is not sufficient to determine whether its > > one or not. > > > Hum... OK. I changed that using a method based on get_index_constraint for > a given index. So if the constraint Oid is invalid, it means that this > index has no constraints and its concurrent entry won't create an index in > consequence. It is more stable this way. Sounds good. Just to make that clear: To get a unique index without constraint: CREATE TABLE table_u(id int, data int); CREATE UNIQUE INDEX table_u__data ON table_u(data); To get a constraint: ALTER TABLE table_u ADD CONSTRAINT table_u__id_unique UNIQUE(id); > > > > > + /* > > > > > + * Phase 3 of REINDEX CONCURRENTLY > > > > > + * > > > > > + * During this phase the concurrent indexes catch up with the > > > > INSERT that > > > > > + * might have occurred in the parent table and are marked as > > valid > > > > once done. > > > > > + * > > > > > + * We once again wait until no transaction can have the table > > open > > > > with > > > > > + * the index marked as read-only for updates. Each index > > > > validation is done > > > > > + * with a separate transaction to avoid opening transaction > > for an > > > > > + * unnecessary too long time. > > > > > + */ > > > > > > > > Maybe I am being dumb because I have the feeling I said differently in > > > > the past, but why do we not need a WaitForMultipleVirtualLocks() here? > > > > The comment seems to say we need to do so. > > > > > > > Yes you said the contrary in a previous review. The purpose of this > > > function is to first gather the locks and then wait for everything at > > once > > > to reduce possible conflicts. > > > > you say: > > > > + * We once again wait until no transaction can have the table open > > with > > + * the index marked as read-only for updates. Each index > > validation is done > > + * with a separate transaction to avoid opening transaction for an > > + * unnecessary too long time. > > > > Which doesn't seem to be done? > > > > I read back and afaics I only referred to CacheInvalidateRelcacheByRelid > > not being necessary in this phase. Which I think is correct. > > > Regarding CacheInvalidateRelcacheByRelid at phase 3, I think that it is > needed. If we don't use it the pg_index entries will be updated but not the > cache, what is incorrect. A heap_update will cause cache invalidations to be sent. > Anyway, if I claimed otherwise, I think I was wrong: > > > > The reason - I think - we need to wait here is that otherwise its not > > guaranteed that all other backends see the index with ->isready > > set. Which means they might add tuples which are invisible to the mvcc > > snapshot passed to validate_index() (just created beforehand) which are > > not yet added to the new index because those backends think the index is > > not ready yet. > > Any flaws in that logic? > > > Not that I think. In consequence, and I think we will agree on that: I am > removing WaitForMultipleVirtualLocks and add a WaitForVirtualLock on the > parent relation for EACH index before building and validating it. I have the feeling we are talking past each other. Unless I miss something *there is no* WaitForMultipleVirtualLocks between phase 2 and 3. But one WaitForMultipleVirtualLocks for all would be totally sufficient. 20130305_2_reindex_concurrently_v17.patch: + /* we can do away with our snapshot */ + PopActiveSnapshot(); + + /* + * Commit this transaction to make the indisready update visible for + * concurrent index. + */ + CommitTransactionCommand(); + } + + + /* + * Phase 3 of REINDEX CONCURRENTLY + * + * During this phase the concurrent indexes catch up with the INSERT that + * might have occurred in the parent table and are marked as valid once done. + * + * We once again wait until no transaction can have the table open with + * the index marked as read-only for updates. Each index validation is done + * with a separate transaction to avoid opening transaction for an + * unnecessary too long time. + */ + + /* + * Perform a scan of each concurrent index with the heap, then insert + * any missing index entries. + */ + foreach(lc, concurrentIndexIds) + { + Oid indOid = lfirst_oid(lc); + Oid relOid; Thanks! Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
OK. Patches updated... Please see attached.
With all the work done on those patches, I suppose this is close to being something clean...
Michael
With all the work done on those patches, I suppose this is close to being something clean...
On Wed, Mar 6, 2013 at 5:50 PM, Andres Freund <andres@2ndquadrant.com> wrote:
-- Sounds good. Just to make that clear:On 2013-03-06 13:21:27 +0900, Michael Paquier wrote:
> Hum... OK. I changed that using a method based on get_index_constraint for
> a given index. So if the constraint Oid is invalid, it means that this
> index has no constraints and its concurrent entry won't create an index in
> consequence. It is more stable this way.
To get a unique index without constraint:
CREATE TABLE table_u(id int, data int);
CREATE UNIQUE INDEX table_u__data ON table_u(data);
To get a constraint:
ALTER TABLE table_u ADD CONSTRAINT table_u__id_unique UNIQUE(id);
OK no problem. Thanks for the clarification.
A heap_update will cause cache invalidations to be sent.> > > > > + /*
> > > > > + * Phase 3 of REINDEX CONCURRENTLY
> > > > > + *
> > > > > + * During this phase the concurrent indexes catch up with the
> > > > INSERT that
> > > > > + * might have occurred in the parent table and are marked as
> > valid
> > > > once done.
> > > > > + *
> > > > > + * We once again wait until no transaction can have the table
> > open
> > > > with
> > > > > + * the index marked as read-only for updates. Each index
> > > > validation is done
> > > > > + * with a separate transaction to avoid opening transaction
> > for an
> > > > > + * unnecessary too long time.
> > > > > + */
> > > >
> > > > Maybe I am being dumb because I have the feeling I said differently in
> > > > the past, but why do we not need a WaitForMultipleVirtualLocks() here?
> > > > The comment seems to say we need to do so.
> > > >
> > > Yes you said the contrary in a previous review. The purpose of this
> > > function is to first gather the locks and then wait for everything at
> > once
> > > to reduce possible conflicts.
> >
> > you say:
> >
> > + * We once again wait until no transaction can have the table open
> > with
> > + * the index marked as read-only for updates. Each index
> > validation is done
> > + * with a separate transaction to avoid opening transaction for an
> > + * unnecessary too long time.
> >
> > Which doesn't seem to be done?
> >
> > I read back and afaics I only referred to CacheInvalidateRelcacheByRelid
> > not being necessary in this phase. Which I think is correct.
> >
> Regarding CacheInvalidateRelcacheByRelid at phase 3, I think that it is
> needed. If we don't use it the pg_index entries will be updated but not the
> cache, what is incorrect.
Ok. removed it.
> Anyway, if I claimed otherwise, I think I was wrong:I have the feeling we are talking past each other. Unless I miss
> >
> > The reason - I think - we need to wait here is that otherwise its not
> > guaranteed that all other backends see the index with ->isready
> > set. Which means they might add tuples which are invisible to the mvcc
> > snapshot passed to validate_index() (just created beforehand) which are
> > not yet added to the new index because those backends think the index is
> > not ready yet.
> > Any flaws in that logic?
> >
> Not that I think. In consequence, and I think we will agree on that: I am
> removing WaitForMultipleVirtualLocks and add a WaitForVirtualLock on the
> parent relation for EACH index before building and validating it.
something *there is no* WaitForMultipleVirtualLocks between phase 2 and
3. But one WaitForMultipleVirtualLocks for all would be totally
sufficient.
OK, sorry for the confusion. I added a call to WaitForMultipleVirtualLocks also before phase 3.
Honestly, I am still not very comfortable with the fact that the ShareLock wait on parent relation is done outside each index transaction for build and validation... Changed as requested though...
Honestly, I am still not very comfortable with the fact that the ShareLock wait on parent relation is done outside each index transaction for build and validation... Changed as requested though...
Michael
Attachment
On 2013-03-06 20:59:37 +0900, Michael Paquier wrote: > OK. Patches updated... Please see attached. > With all the work done on those patches, I suppose this is close to being > something clean... Yes, its looking good. There are loads of improvements possible but those can very well be made incrementally. > > I have the feeling we are talking past each other. Unless I miss > > something *there is no* WaitForMultipleVirtualLocks between phase 2 and > > 3. But one WaitForMultipleVirtualLocks for all would be totally > > sufficient. > > > OK, sorry for the confusion. I added a call to WaitForMultipleVirtualLocks > also before phase 3. > Honestly, I am still not very comfortable with the fact that the ShareLock > wait on parent relation is done outside each index transaction for build > and validation... Changed as requested though... Could you detail your concerns a bit? I tried to think it through multiple times now and I still can't see a problem. The lock only ensures that nobody has the relation open with the old index definition in mind... Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Mar 6, 2013 at 9:09 PM, Andres Freund <andres@2ndquadrant.com> wrote:
-- On 2013-03-06 20:59:37 +0900, Michael Paquier wrote:Yes, its looking good. There are loads of improvements possible but
> OK. Patches updated... Please see attached.
> With all the work done on those patches, I suppose this is close to being
> something clean...
those can very well be made incrementally.> > I have the feeling we are talking past each other. Unless I missCould you detail your concerns a bit? I tried to think it through
> > something *there is no* WaitForMultipleVirtualLocks between phase 2 and
> > 3. But one WaitForMultipleVirtualLocks for all would be totally
> > sufficient.
> >
> OK, sorry for the confusion. I added a call to WaitForMultipleVirtualLocks
> also before phase 3.
> Honestly, I am still not very comfortable with the fact that the ShareLock
> wait on parent relation is done outside each index transaction for build
> and validation... Changed as requested though...
multiple times now and I still can't see a problem. The lock only
ensures that nobody has the relation open with the old index definition
in mind...
I am making a comparison with CREATE INDEX CONCURRENTLY where the ShareLock wait is made inside the build and validation transactions. Was there any particular reason why CREATE INDEX CONCURRENTLY wait is done inside a transaction block?
That's my only concern.
That's my only concern.
Michael
On 2013-03-06 21:19:57 +0900, Michael Paquier wrote: > On Wed, Mar 6, 2013 at 9:09 PM, Andres Freund <andres@2ndquadrant.com>wrote: > > > On 2013-03-06 20:59:37 +0900, Michael Paquier wrote: > > > OK. Patches updated... Please see attached. > > > With all the work done on those patches, I suppose this is close to being > > > something clean... > > > > Yes, its looking good. There are loads of improvements possible but > > those can very well be made incrementally. > > > > I have the feeling we are talking past each other. Unless I miss > > > > something *there is no* WaitForMultipleVirtualLocks between phase 2 and > > > > 3. But one WaitForMultipleVirtualLocks for all would be totally > > > > sufficient. > > > > > > > OK, sorry for the confusion. I added a call to > > WaitForMultipleVirtualLocks > > > also before phase 3. > > > Honestly, I am still not very comfortable with the fact that the > > ShareLock > > > wait on parent relation is done outside each index transaction for build > > > and validation... Changed as requested though... > > > > Could you detail your concerns a bit? I tried to think it through > > multiple times now and I still can't see a problem. The lock only > > ensures that nobody has the relation open with the old index definition > > in mind... > > > I am making a comparison with CREATE INDEX CONCURRENTLY where the ShareLock > wait is made inside the build and validation transactions. Was there any > particular reason why CREATE INDEX CONCURRENTLY wait is done inside a > transaction block? > That's my only concern. Well, it needs to be executed in a transaction because it needs a valid resource owner and a previous CommitTransactionCommand() will leave that at NULL. And there is no reason in the single-index case of CREATE INDEX CONCURRENTLY to do it in a separate transaction. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Mar 6, 2013 at 8:59 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > OK. Patches updated... Please see attached. I found odd behavior. After I made REINDEX CONCURRENTLY fail twice, I found that the index which was not marked as INVALID remained unexpectedly. =# CREATE TABLE hoge (i int primary key); CREATE TABLE =# INSERT INTO hoge VALUES (generate_series(1,10)); INSERT 0 10 =# SET statement_timeout TO '1s'; SET =# REINDEX TABLE CONCURRENTLY hoge; ERROR: canceling statement due to statement timeout =# \d hoge Table "public.hoge"Column | Type | Modifiers --------+---------+-----------i | integer | not null Indexes: "hoge_pkey" PRIMARY KEY, btree (i) "hoge_pkey_cct" PRIMARY KEY, btree (i) INVALID =# REINDEX TABLE CONCURRENTLY hoge; ERROR: canceling statement due to statement timeout =# \d hoge Table "public.hoge"Column | Type | Modifiers --------+---------+-----------i | integer | not null Indexes: "hoge_pkey" PRIMARY KEY, btree (i) "hoge_pkey_cct" PRIMARY KEY, btree (i) INVALID "hoge_pkey_cct1" PRIMARYKEY, btree (i) INVALID "hoge_pkey_cct_cct" PRIMARY KEY, btree (i) + The recommended recovery method in such cases is to drop the concurrent + index and try again to perform <command>REINDEX CONCURRENTLY</>. If an invalid index depends on the constraint like primary key, "drop the concurrent index" cannot actually drop the index. In this case, you need to issue "alter table ... drop constraint ..." to recover the situation. I think this informataion should be documented. Regards, -- Fujii Masao
On 2013-03-07 02:09:49 +0900, Fujii Masao wrote: > On Wed, Mar 6, 2013 at 8:59 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > OK. Patches updated... Please see attached. > > I found odd behavior. After I made REINDEX CONCURRENTLY fail twice, > I found that the index which was not marked as INVALID remained unexpectedly. Thats to be expected. Indexes need to be valid *before* we can drop the old one. So if you abort in the right moment you will see those and thats imo fine. > =# CREATE TABLE hoge (i int primary key); > CREATE TABLE > =# INSERT INTO hoge VALUES (generate_series(1,10)); > INSERT 0 10 > =# SET statement_timeout TO '1s'; > SET > =# REINDEX TABLE CONCURRENTLY hoge; > ERROR: canceling statement due to statement timeout > =# \d hoge > Table "public.hoge" > Column | Type | Modifiers > --------+---------+----------- > i | integer | not null > Indexes: > "hoge_pkey" PRIMARY KEY, btree (i) > "hoge_pkey_cct" PRIMARY KEY, btree (i) INVALID > > =# REINDEX TABLE CONCURRENTLY hoge; > ERROR: canceling statement due to statement timeout > =# \d hoge > Table "public.hoge" > Column | Type | Modifiers > --------+---------+----------- > i | integer | not null > Indexes: > "hoge_pkey" PRIMARY KEY, btree (i) > "hoge_pkey_cct" PRIMARY KEY, btree (i) INVALID > "hoge_pkey_cct1" PRIMARY KEY, btree (i) INVALID > "hoge_pkey_cct_cct" PRIMARY KEY, btree (i) Huh, why did that go through? It should have errored out? > + The recommended recovery method in such cases is to drop the concurrent > + index and try again to perform <command>REINDEX CONCURRENTLY</>. > > If an invalid index depends on the constraint like primary key, "drop > the concurrent > index" cannot actually drop the index. In this case, you need to issue > "alter table > ... drop constraint ..." to recover the situation. I think this > informataion should be > documented. I think we just shouldn't set ->isprimary on the temporary indexes. Now we switch only the relfilenodes and not the whole index, that should be perfectly fine. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Mar 7, 2013 at 2:17 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> Indexes: >> "hoge_pkey" PRIMARY KEY, btree (i) >> "hoge_pkey_cct" PRIMARY KEY, btree (i) INVALID >> "hoge_pkey_cct1" PRIMARY KEY, btree (i) INVALID >> "hoge_pkey_cct_cct" PRIMARY KEY, btree (i) > > Huh, why did that go through? It should have errored out? I'm not sure why. Anyway hoge_pkey_cct_cct should not appear or should be marked as invalid, I think. >> + The recommended recovery method in such cases is to drop the concurrent >> + index and try again to perform <command>REINDEX CONCURRENTLY</>. >> >> If an invalid index depends on the constraint like primary key, "drop >> the concurrent >> index" cannot actually drop the index. In this case, you need to issue >> "alter table >> ... drop constraint ..." to recover the situation. I think this >> informataion should be >> documented. > > I think we just shouldn't set ->isprimary on the temporary indexes. Now > we switch only the relfilenodes and not the whole index, that should be > perfectly fine. Sounds good. But, what about other constraint case like unique constraint? Those other cases also can be resolved by not setting ->isprimary? Regards, -- Fujii Masao
On 2013-03-07 02:34:54 +0900, Fujii Masao wrote: > On Thu, Mar 7, 2013 at 2:17 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> Indexes: > >> "hoge_pkey" PRIMARY KEY, btree (i) > >> "hoge_pkey_cct" PRIMARY KEY, btree (i) INVALID > >> "hoge_pkey_cct1" PRIMARY KEY, btree (i) INVALID > >> "hoge_pkey_cct_cct" PRIMARY KEY, btree (i) > > > > Huh, why did that go through? It should have errored out? > > I'm not sure why. Anyway hoge_pkey_cct_cct should not appear or should > be marked as invalid, I think. Hm. Yea. I am still not sure yet why hoge_pkey_cct_cct sprung into existance, but that hoge_pkey_cct1 springs into existance makes sense. I see a problem here, there is a moment here between phase 3 and 4 where both the old and the new indexes are valid and ready. Thats not good because if we abort in that moment we essentially have doubled the amount of indexes. Options: a) we live with it b) we only mark the new index as valid within phase 4. That should be fine I think? c) we invent some other state to mark indexes that are in-progress to replace another one. I guess b) seems fine? > >> + The recommended recovery method in such cases is to drop the concurrent > >> + index and try again to perform <command>REINDEX CONCURRENTLY</>. > >> > >> If an invalid index depends on the constraint like primary key, "drop > >> the concurrent > >> index" cannot actually drop the index. In this case, you need to issue > >> "alter table > >> ... drop constraint ..." to recover the situation. I think this > >> informataion should be > >> documented. > > > > I think we just shouldn't set ->isprimary on the temporary indexes. Now > > we switch only the relfilenodes and not the whole index, that should be > > perfectly fine. > > Sounds good. But, what about other constraint case like unique constraint? > Those other cases also can be resolved by not setting ->isprimary? Unique indexes can exist without a constraint attached, so thats fine. I need to read a bit more code whether its safe to unset it, although indisexclusion, indimmediate might be more important. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Mar 7, 2013 at 2:09 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
-- On Wed, Mar 6, 2013 at 8:59 PM, Michael PaquierI found odd behavior. After I made REINDEX CONCURRENTLY fail twice,
<michael.paquier@gmail.com> wrote:
> OK. Patches updated... Please see attached.
I found that the index which was not marked as INVALID remained unexpectedly.
=# CREATE TABLE hoge (i int primary key);
CREATE TABLE
=# INSERT INTO hoge VALUES (generate_series(1,10));
INSERT 0 10
=# SET statement_timeout TO '1s';
SET
=# REINDEX TABLE CONCURRENTLY hoge;
ERROR: canceling statement due to statement timeout
=# \d hoge
Table "public.hoge"
Column | Type | Modifiers
--------+---------+-----------
i | integer | not null
Indexes:
"hoge_pkey" PRIMARY KEY, btree (i)
"hoge_pkey_cct" PRIMARY KEY, btree (i) INVALID
=# REINDEX TABLE CONCURRENTLY hoge;
ERROR: canceling statement due to statement timeout
=# \d hoge
Table "public.hoge"
Column | Type | Modifiers
--------+---------+-----------
i | integer | not null
Indexes:
"hoge_pkey" PRIMARY KEY, btree (i)
"hoge_pkey_cct" PRIMARY KEY, btree (i) INVALID
"hoge_pkey_cct1" PRIMARY KEY, btree (i) INVALID
"hoge_pkey_cct_cct" PRIMARY KEY, btree (i)
Invalid indexes cannot be reindexed concurrently and are simply bypassed during process, so _cct_cct has no reason to exist. For example here is what I get with a relation having an invalid index:
ioltas=# \d aa
Table "public.aa"
Column | Type | Modifiers
--------+---------+-----------
a | integer |
Indexes:
"aap" btree (a)
"aap_cct" btree (a) INVALID
ioltas=# reindex table concurrently aa;
WARNING: cannot reindex concurrently invalid index "public.aap_cct", skipping
REINDEX
ioltas=# \d aa
Table "public.aa"
Column | Type | Modifiers
--------+---------+-----------
a | integer |
Indexes:
"aap" btree (a)
"aap_cct" btree (a) INVALID
ioltas=# reindex table concurrently aa;
WARNING: cannot reindex concurrently invalid index "public.aap_cct", skipping
REINDEX
+ The recommended recovery method in such cases is to drop the concurrent
+ index and try again to perform <command>REINDEX CONCURRENTLY</>.
If an invalid index depends on the constraint like primary key, "drop
the concurrent
index" cannot actually drop the index. In this case, you need to issue
"alter table
... drop constraint ..." to recover the situation. I think this
information should be
documented.
You are right. I'll add a note in the documentation about that. Personally I find it more instinctive to use DROP CONSTRAINT for a primary key as the image I have of a concurrent index is a twin of the index it rebuilds.
Michael
On Thu, Mar 7, 2013 at 2:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
MichaelOn Thu, Mar 7, 2013 at 2:17 AM, Andres Freund <andres@2ndquadrant.com> wrote:I'm not sure why. Anyway hoge_pkey_cct_cct should not appear or should
>> Indexes:
>> "hoge_pkey" PRIMARY KEY, btree (i)
>> "hoge_pkey_cct" PRIMARY KEY, btree (i) INVALID
>> "hoge_pkey_cct1" PRIMARY KEY, btree (i) INVALID
>> "hoge_pkey_cct_cct" PRIMARY KEY, btree (i)
>
> Huh, why did that go through? It should have errored out?
be marked as invalid, I think.
CHECK_FOR_INTERRUPTS were not added at each phase and they are needed in case process is interrupted by user. This has been mentioned in a pas review but it was missing, so it might have slipped out during a refactoring or smth. Btw, I am surprised to see that this *_cct_cct index has been created knowing that hoge_pkey_cct is invalid. I tried with the latest version of the patch and even the patch attached but couldn't reproduce it.
>> + The recommended recovery method in such cases is to drop the concurrentSounds good. But, what about other constraint case like unique constraint?
>> + index and try again to perform <command>REINDEX CONCURRENTLY</>.
>>
>> If an invalid index depends on the constraint like primary key, "drop
>> the concurrent
>> index" cannot actually drop the index. In this case, you need to issue
>> "alter table
>> ... drop constraint ..." to recover the situation. I think this
>> informataion should be
>> documented.
>
> I think we just shouldn't set ->isprimary on the temporary indexes. Now
> we switch only the relfilenodes and not the whole index, that should be
> perfectly fine.
Those other cases also can be resolved by not setting ->isprimary?
We should stick with the concurrent index being a twin of the index it rebuilds for consistency.
Also, I think that it is important from the session viewpoint to perform a swap with 2 valid indexes. If the process fails just before swapping indexes user might want to do that himself and drop the old index, then use the concurrent one.
Other opinions welcome.
--
Also, I think that it is important from the session viewpoint to perform a swap with 2 valid indexes. If the process fails just before swapping indexes user might want to do that himself and drop the old index, then use the concurrent one.
Other opinions welcome.
--
Attachment
On 2013-03-07 05:26:31 +0900, Michael Paquier wrote: > On Thu, Mar 7, 2013 at 2:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > > > On Thu, Mar 7, 2013 at 2:17 AM, Andres Freund <andres@2ndquadrant.com> > > wrote: > > >> Indexes: > > >> "hoge_pkey" PRIMARY KEY, btree (i) > > >> "hoge_pkey_cct" PRIMARY KEY, btree (i) INVALID > > >> "hoge_pkey_cct1" PRIMARY KEY, btree (i) INVALID > > >> "hoge_pkey_cct_cct" PRIMARY KEY, btree (i) > > > > > > Huh, why did that go through? It should have errored out? > > > > I'm not sure why. Anyway hoge_pkey_cct_cct should not appear or should > > be marked as invalid, I think. > > > CHECK_FOR_INTERRUPTS were not added at each phase and they are needed in > case process is interrupted by user. This has been mentioned in a pas > review but it was missing, so it might have slipped out during a > refactoring or smth. Btw, I am surprised to see that this *_cct_cct index > has been created knowing that hoge_pkey_cct is invalid. I tried with the > latest version of the patch and even the patch attached but couldn't > reproduce it. The strange think about "hoge_pkey_cct_cct" is that it seems to imply that an invalid index was reindexed concurrently? But I don't see how it could happen either. Fujii, can you reproduce it? > >> + The recommended recovery method in such cases is to drop the > > concurrent > > >> + index and try again to perform <command>REINDEX CONCURRENTLY</>. > > >> > > >> If an invalid index depends on the constraint like primary key, "drop > > >> the concurrent > > >> index" cannot actually drop the index. In this case, you need to issue > > >> "alter table > > >> ... drop constraint ..." to recover the situation. I think this > > >> informataion should be > > >> documented. > > > > > > I think we just shouldn't set ->isprimary on the temporary indexes. Now > > > we switch only the relfilenodes and not the whole index, that should be > > > perfectly fine. > > > > Sounds good. But, what about other constraint case like unique constraint? > > Those other cases also can be resolved by not setting ->isprimary? > > > We should stick with the concurrent index being a twin of the index it > rebuilds for consistency. I don't think its legal. We cannot simply have two indexes with 'indisprimary'. Especially not if bot are valid. Also, there will be no pg_constraint row that refers to it which violates very valid expectations that both users and pg may have. > Also, I think that it is important from the session viewpoint to perform a > swap with 2 valid indexes. If the process fails just before swapping > indexes user might want to do that himself and drop the old index, then use > the concurrent one. The most likely outcome will be to rerun REINDEX CONCURRENTLY. Which will then reindex one more index since it now has the old valid index and the new valid index. Also, I don't think its fair game to expose indexes that used to belong to a constraint without a constraint supporting it as valid indexes. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Mar 7, 2013 at 7:19 AM, Andres Freund <andres@2ndquadrant.com> wrote:
-- On 2013-03-07 05:26:31 +0900, Michael Paquier wrote:The strange think about "hoge_pkey_cct_cct" is that it seems to imply
> On Thu, Mar 7, 2013 at 2:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
> > On Thu, Mar 7, 2013 at 2:17 AM, Andres Freund <andres@2ndquadrant.com>
> > wrote:
> > >> Indexes:
> > >> "hoge_pkey" PRIMARY KEY, btree (i)
> > >> "hoge_pkey_cct" PRIMARY KEY, btree (i) INVALID
> > >> "hoge_pkey_cct1" PRIMARY KEY, btree (i) INVALID
> > >> "hoge_pkey_cct_cct" PRIMARY KEY, btree (i)
> > >
> > > Huh, why did that go through? It should have errored out?
> >
> > I'm not sure why. Anyway hoge_pkey_cct_cct should not appear or should
> > be marked as invalid, I think.
> >
> CHECK_FOR_INTERRUPTS were not added at each phase and they are needed in
> case process is interrupted by user. This has been mentioned in a pas
> review but it was missing, so it might have slipped out during a
> refactoring or smth. Btw, I am surprised to see that this *_cct_cct index
> has been created knowing that hoge_pkey_cct is invalid. I tried with the
> latest version of the patch and even the patch attached but couldn't
> reproduce it.
that an invalid index was reindexed concurrently?
But I don't see how it could happen either. Fujii, can you reproduce it?
Curious about that also.
> >> + The recommended recovery method in such cases is to drop theI don't think its legal. We cannot simply have two indexes with
> > concurrent
> > >> + index and try again to perform <command>REINDEX CONCURRENTLY</>.
> > >>
> > >> If an invalid index depends on the constraint like primary key, "drop
> > >> the concurrent
> > >> index" cannot actually drop the index. In this case, you need to issue
> > >> "alter table
> > >> ... drop constraint ..." to recover the situation. I think this
> > >> informataion should be
> > >> documented.
> > >
> > > I think we just shouldn't set ->isprimary on the temporary indexes. Now
> > > we switch only the relfilenodes and not the whole index, that should be
> > > perfectly fine.
> >
> > Sounds good. But, what about other constraint case like unique constraint?
> > Those other cases also can be resolved by not setting ->isprimary?
> >
> We should stick with the concurrent index being a twin of the index it
> rebuilds for consistency.
'indisprimary'. Especially not if bot are valid.
Also, there will be no pg_constraint row that refers to it which
violates very valid expectations that both users and pg may have.
So what to do with that?
Mark the concurrent index as valid, then validate it and finally mark it as invalid inside the same transaction at phase 4?
That's moving 2 lines of code...
Mark the concurrent index as valid, then validate it and finally mark it as invalid inside the same transaction at phase 4?
That's moving 2 lines of code...
Michael
On Thu, Mar 7, 2013 at 9:48 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
-- On Thu, Mar 7, 2013 at 7:19 AM, Andres Freund <andres@2ndquadrant.com> wrote:On 2013-03-07 05:26:31 +0900, Michael Paquier wrote:The strange think about "hoge_pkey_cct_cct" is that it seems to imply
> On Thu, Mar 7, 2013 at 2:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
> > On Thu, Mar 7, 2013 at 2:17 AM, Andres Freund <andres@2ndquadrant.com>
> > wrote:
> > >> Indexes:
> > >> "hoge_pkey" PRIMARY KEY, btree (i)
> > >> "hoge_pkey_cct" PRIMARY KEY, btree (i) INVALID
> > >> "hoge_pkey_cct1" PRIMARY KEY, btree (i) INVALID
> > >> "hoge_pkey_cct_cct" PRIMARY KEY, btree (i)
> > >
> > > Huh, why did that go through? It should have errored out?
> >
> > I'm not sure why. Anyway hoge_pkey_cct_cct should not appear or should
> > be marked as invalid, I think.
> >
> CHECK_FOR_INTERRUPTS were not added at each phase and they are needed in
> case process is interrupted by user. This has been mentioned in a pas
> review but it was missing, so it might have slipped out during a
> refactoring or smth. Btw, I am surprised to see that this *_cct_cct index
> has been created knowing that hoge_pkey_cct is invalid. I tried with the
> latest version of the patch and even the patch attached but couldn't
> reproduce it.
that an invalid index was reindexed concurrently?
But I don't see how it could happen either. Fujii, can you reproduce it?Curious about that also.
> >> + The recommended recovery method in such cases is to drop theI don't think its legal. We cannot simply have two indexes with
> > concurrent
> > >> + index and try again to perform <command>REINDEX CONCURRENTLY</>.
> > >>
> > >> If an invalid index depends on the constraint like primary key, "drop
> > >> the concurrent
> > >> index" cannot actually drop the index. In this case, you need to issue
> > >> "alter table
> > >> ... drop constraint ..." to recover the situation. I think this
> > >> informataion should be
> > >> documented.
> > >
> > > I think we just shouldn't set ->isprimary on the temporary indexes. Now
> > > we switch only the relfilenodes and not the whole index, that should be
> > > perfectly fine.
> >
> > Sounds good. But, what about other constraint case like unique constraint?
> > Those other cases also can be resolved by not setting ->isprimary?
> >
> We should stick with the concurrent index being a twin of the index it
> rebuilds for consistency.
'indisprimary'. Especially not if bot are valid.
Also, there will be no pg_constraint row that refers to it which
violates very valid expectations that both users and pg may have.So what to do with that?
Mark the concurrent index as valid, then validate it and finally mark it as invalid inside the same transaction at phase 4?
That's moving 2 lines of code...
Sorry phase 4 is the swap phase. Validation happens at phase 3.
Michael
On Thu, Mar 7, 2013 at 7:19 AM, Andres Freund <andres@2ndquadrant.com> wrote: > The strange think about "hoge_pkey_cct_cct" is that it seems to imply > that an invalid index was reindexed concurrently? > > But I don't see how it could happen either. Fujii, can you reproduce it? Yes, I can even with the latest version of the patch. The test case to reproduce it is: (Session 1) CREATE TABLE hoge (i int primary key); INSERT INTO hoge VALUES (generate_series(1,10)); (Session 2) BEGIN; SELECT * FROM hoge; (keep this session as it is) (Session 1) SET statement_timeout TO '1s'; REINDEX TABLE CONCURRENTLY hoge; \d hoge REINDEX TABLE CONCURRENTLY hoge; \d hoge Regards, -- Fujii Masao
On 2013-03-07 09:58:58 +0900, Michael Paquier wrote: > >> > >> + The recommended recovery method in such cases is to drop the > >> > > concurrent > >> > > >> + index and try again to perform <command>REINDEX > >> CONCURRENTLY</>. > >> > > >> > >> > > >> If an invalid index depends on the constraint like primary key, > >> "drop > >> > > >> the concurrent > >> > > >> index" cannot actually drop the index. In this case, you need to > >> issue > >> > > >> "alter table > >> > > >> ... drop constraint ..." to recover the situation. I think this > >> > > >> informataion should be > >> > > >> documented. > >> > > > > >> > > > I think we just shouldn't set ->isprimary on the temporary indexes. > >> Now > >> > > > we switch only the relfilenodes and not the whole index, that > >> should be > >> > > > perfectly fine. > >> > > > >> > > Sounds good. But, what about other constraint case like unique > >> constraint? > >> > > Those other cases also can be resolved by not setting ->isprimary? > >> > > > >> > We should stick with the concurrent index being a twin of the index it > >> > rebuilds for consistency. > >> > >> I don't think its legal. We cannot simply have two indexes with > >> 'indisprimary'. Especially not if bot are valid. > >> Also, there will be no pg_constraint row that refers to it which > >> violates very valid expectations that both users and pg may have. > >> > > So what to do with that? > > Mark the concurrent index as valid, then validate it and finally mark it > > as invalid inside the same transaction at phase 4? > > That's moving 2 lines of code... > > > Sorry phase 4 is the swap phase. Validation happens at phase 3. Why do you want to temporarily mark it as valid? I don't see any requirement that it is set to that during validate_index() (which imo is badly named, but...). I'd just set it to valid in the same transaction that does the swap. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Mar 8, 2013 at 1:41 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
-- On Thu, Mar 7, 2013 at 7:19 AM, Andres Freund <andres@2ndquadrant.com> wrote:Yes, I can even with the latest version of the patch. The test case to
> The strange think about "hoge_pkey_cct_cct" is that it seems to imply
> that an invalid index was reindexed concurrently?
>
> But I don't see how it could happen either. Fujii, can you reproduce it?
reproduce it is:
(Session 1)CREATE TABLE hoge (i int primary key);INSERT INTO hoge VALUES (generate_series(1,10));(Session 2)
BEGIN;
SELECT * FROM hoge;
(keep this session as it is)
(Session 1)
SET statement_timeout TO '1s';
REINDEX TABLE CONCURRENTLY hoge;
\d hoge
REINDEX TABLE CONCURRENTLY hoge;
\d hoge
I fixed this problem in the patch attached. It was caused by 2 things:
- The concurrent index was seen as valid from other backend between phases 3 and 4. So the concurrent index is made valid at phase 4, then swap is done and finally marked as invalid. So it remains invalid seen from the other sessions.
- index_set_state_flags used heap_inplace_update, which is not completely safe at swapping phase, so I had to extend it a bit to use a safe simple_heap_update at swap phase.
Regards,
- The concurrent index was seen as valid from other backend between phases 3 and 4. So the concurrent index is made valid at phase 4, then swap is done and finally marked as invalid. So it remains invalid seen from the other sessions.
- index_set_state_flags used heap_inplace_update, which is not completely safe at swapping phase, so I had to extend it a bit to use a safe simple_heap_update at swap phase.
Regards,
Michael
Attachment
On Fri, Mar 8, 2013 at 10:00 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > > > On Fri, Mar 8, 2013 at 1:41 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> On Thu, Mar 7, 2013 at 7:19 AM, Andres Freund <andres@2ndquadrant.com> >> wrote: >> > The strange think about "hoge_pkey_cct_cct" is that it seems to imply >> > that an invalid index was reindexed concurrently? >> > >> > But I don't see how it could happen either. Fujii, can you reproduce it? >> >> Yes, I can even with the latest version of the patch. The test case to >> reproduce it is: >> >> (Session 1) >> CREATE TABLE hoge (i int primary key); >> INSERT INTO hoge VALUES (generate_series(1,10)); >> >> (Session 2) >> BEGIN; >> SELECT * FROM hoge; >> (keep this session as it is) >> >> (Session 1) >> SET statement_timeout TO '1s'; >> REINDEX TABLE CONCURRENTLY hoge; >> \d hoge >> REINDEX TABLE CONCURRENTLY hoge; >> \d hoge > > I fixed this problem in the patch attached. It was caused by 2 things: > - The concurrent index was seen as valid from other backend between phases 3 > and 4. So the concurrent index is made valid at phase 4, then swap is done > and finally marked as invalid. So it remains invalid seen from the other > sessions. > - index_set_state_flags used heap_inplace_update, which is not completely > safe at swapping phase, so I had to extend it a bit to use a safe > simple_heap_update at swap phase. Thanks! + <para> + Concurrent indexes based on a <literal>PRIMARY KEY</> or an <literal> + EXCLUSION</> constraint need to be dropped with <literal>ALTER TABLE Typo: s/EXCLUSION/EXCLUDE I encountered a segmentation fault when I ran REINDEX CONCURRENTLY. The test case to reproduce the segmentation fault is: 1. Install btree_gist 2. Run btree_gist's regression test (i.e., make installcheck) 3. Log in contrib_regression database after the regression test 4. Execute REINDEX TABLE CONCURRENTLY moneytmp Regards, -- Fujii Masao
On Sat, Mar 9, 2013 at 1:37 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
+ <para>
+ Concurrent indexes based on a <literal>PRIMARY KEY</> or an <literal>
+ EXCLUSION</> constraint need to be dropped with <literal>ALTER TABLE
Typo: s/EXCLUSION/EXCLUDE
Thanks. This is corrected.
I encountered a segmentation fault when I ran REINDEX CONCURRENTLY.
The test case to reproduce the segmentation fault is:
1. Install btree_gist
2. Run btree_gist's regression test (i.e., make installcheck)
3. Log in contrib_regression database after the regression test
4. Execute REINDEX TABLE CONCURRENTLY moneytmp
Oops. I simply forgot to take into account the case of system attributes when building column names in index_concurrent_create. Fixed in new version attached.
Regards,
--
Michael
Attachment
On Sat, Mar 9, 2013 at 1:31 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > > > On Sat, Mar 9, 2013 at 1:37 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> + <para> >> + Concurrent indexes based on a <literal>PRIMARY KEY</> or an >> <literal> >> + EXCLUSION</> constraint need to be dropped with <literal>ALTER >> TABLE >> >> Typo: s/EXCLUSION/EXCLUDE > > Thanks. This is corrected. > >> >> I encountered a segmentation fault when I ran REINDEX CONCURRENTLY. >> The test case to reproduce the segmentation fault is: >> >> 1. Install btree_gist >> 2. Run btree_gist's regression test (i.e., make installcheck) >> 3. Log in contrib_regression database after the regression test >> 4. Execute REINDEX TABLE CONCURRENTLY moneytmp > > Oops. I simply forgot to take into account the case of system attributes > when building column names in index_concurrent_create. Fixed in new version > attached. Thanks for updating the patch! I found the problem that the patch changed the behavior of ALTER TABLE SET TABLESPACE so that it moves also the index on the specified table to new tablespace. Per the document of ALTER TABLE, this is not right behavior. I think that it's worth adding new option for concurrent rebuilding into reindexdb command. It's better to implement this separately from core patch, though. You need to add the description of locking of REINDEX CONCURRENTLY into mvcc.sgml, I think. + Rebuild a table concurrently: + +<programlisting> +REINDEX TABLE CONCURRENTLY my_broken_table; Obviously REINDEX cannot rebuild a table ;) Regards, -- Fujii Masao
On Fri, Mar 8, 2013 at 1:46 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Why do you want to temporarily mark it as valid? I don't see any > requirement that it is set to that during validate_index() (which imo is > badly named, but...). > I'd just set it to valid in the same transaction that does the swap. +1. I cannot realize yet why isprimary flag needs to be set even in the invalid index. In current patch, we can easily get into the inconsistent situation, i.e., a table having more than one primary key indexes. Regards, -- Fujii Masao
On Sun, Mar 10, 2013 at 3:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Thanks for updating the patch! - "SELECT reltoastidxid " - "FROM info_rels i JOIN pg_catalog.pg_class c " - " ON i.reloid = c.oid")); + "SELECT indexrelid " + "FROM info_rels i " + " JOIN pg_catalog.pg_class c " + " ON i.reloid = c.oid " + " JOIN pg_catalog.pg_index p " + " ON i.reloid = p.indrelid " + "WHERE p.indexrelid >= %u ", FirstNormalObjectId)); This new SQL doesn't seem to be right. Old one doesn't pick up any indexes other than toast index, but new one seems to do. Regards, -- Fujii Masao
On Sun, Mar 10, 2013 at 4:50 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
-- On Sun, Mar 10, 2013 at 3:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote:- "SELECT reltoastidxid "
> Thanks for updating the patch!
- "FROM info_rels i JOIN pg_catalog.pg_class c "
- " ON i.reloid = c.oid"));
+ "SELECT indexrelid "
+ "FROM info_rels i "
+ " JOIN pg_catalog.pg_class c "
+ " ON i.reloid = c.oid "
+ " JOIN pg_catalog.pg_index p "
+ " ON i.reloid = p.indrelid "
+ "WHERE p.indexrelid >= %u ", FirstNormalObjectId));
This new SQL doesn't seem to be right. Old one doesn't pick up any indexes
other than toast index, but new one seems to do.
Indeed, it was selecting all indexes...
I replaced it by this query reducing the selection of indexes for toast relations:
- "SELECT reltoastidxid "
- "FROM info_rels i JOIN pg_catalog.pg_class c "
- " ON i.reloid = c.oid"));
+ "SELECT indexrelid "
+ "FROM pg_index "
+ "WHERE indrelid IN (SELECT reltoastrelid "
+ " FROM pg_class "
+ " WHERE oid >= %u "
+ " AND reltoastrelid != %u)",
+ FirstNormalObjectId, InvalidOid));
Will send patch soon...
I replaced it by this query reducing the selection of indexes for toast relations:
- "SELECT reltoastidxid "
- "FROM info_rels i JOIN pg_catalog.pg_class c "
- " ON i.reloid = c.oid"));
+ "SELECT indexrelid "
+ "FROM pg_index "
+ "WHERE indrelid IN (SELECT reltoastrelid "
+ " FROM pg_class "
+ " WHERE oid >= %u "
+ " AND reltoastrelid != %u)",
+ FirstNormalObjectId, InvalidOid));
Will send patch soon...
Michael
Please find attached updated version. I also corrected the problem of the query in pg_upgrade when fetching Oids of indexes of toast relation.
On Sun, Mar 10, 2013 at 3:48 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
MichaelI found the problem that the patch changed the behavior of
ALTER TABLE SET TABLESPACE so that it moves also
the index on the specified table to new tablespace. Per the
document of ALTER TABLE, this is not right behavior.
Oops. Fixed in the patch attached. The bug was in the toastrelidxid patch, not REINDEX CONCURRENTLY core.
I think that it's worth adding new option for concurrent rebuilding
into reindexdb command. It's better to implement this separately
from core patch, though.
Yeah, agreed. It is not that much complicated. And this should be done after this patch is finished.
You need to add the description of locking of REINDEX CONCURRENTLY
into mvcc.sgml, I think.
OK, I added some reference to that in the docs. I also added a paragraph about the lock used during process.
+ Rebuild a table concurrently:
+
+<programlisting>
+REINDEX TABLE CONCURRENTLY my_broken_table;
OK... OK... Documentation should be polished more... I changed this paragraph a bit to mention that read and write operations can be performed on the table in this case.
--
--
Attachment
I have been working on improving the code of the 2 patches:
1) reltoastidxid removal:
- Improvement of mechanism in tuptoaster.c to fetch the first valid index for toast value deletion and fetch
- Added a macro called RelationGetIndexListIfValid that avoids recompiling the index list with list_copy as RelationGetIndexList does. Not using a macro resulted in increased shared memory usage when multiple toast values were added inside the same query (stuff like "insert into tab values (generate_series(1,1000), '2k_long_text')")
- Fix a bug with pg_dump and binary upgrade. One valid index is necessary for a given toast relation.
2) reindex concurrently:
- correction of some comments
- fix for index_concurrent_set_dead where process did not wait that other backends released lock on parent relation
- addition of a error message in index_concurrent_drop if it is tried to drop a live index. Dropping a live index with only ShareUpdate lock is dangerous
I am also planning to test the potential performance impact of the patch removing reltoastidxid with scripts of the type attached. I don't really know if it can be quantified but I'll give a try with some methods (not yet completely defined).
--
Michael
1) reltoastidxid removal:
- Improvement of mechanism in tuptoaster.c to fetch the first valid index for toast value deletion and fetch
- Added a macro called RelationGetIndexListIfValid that avoids recompiling the index list with list_copy as RelationGetIndexList does. Not using a macro resulted in increased shared memory usage when multiple toast values were added inside the same query (stuff like "insert into tab values (generate_series(1,1000), '2k_long_text')")
- Fix a bug with pg_dump and binary upgrade. One valid index is necessary for a given toast relation.
2) reindex concurrently:
- correction of some comments
- fix for index_concurrent_set_dead where process did not wait that other backends released lock on parent relation
- addition of a error message in index_concurrent_drop if it is tried to drop a live index. Dropping a live index with only ShareUpdate lock is dangerous
I am also planning to test the potential performance impact of the patch removing reltoastidxid with scripts of the type attached. I don't really know if it can be quantified but I'll give a try with some methods (not yet completely defined).
--
Michael
Attachment
On Wed, Mar 13, 2013 at 9:04 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > I have been working on improving the code of the 2 patches: I found pg_dump dumps even the invalid index. But pg_dump should ignore the invalid index? This problem exists even without REINDEX CONCURRENTLY patch. So we might need to implement the bugfix patch separately rather than including the bugfix code in your patches. Probably the backport would be required. Thought? We should add the concurrent reindex option into reindexdb command? This can be really separate patch, though. Regards, -- Fujii Masao
On 2013/03/17, at 0:35, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Mar 13, 2013 at 9:04 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> I have been working on improving the code of the 2 patches: > > I found pg_dump dumps even the invalid index. But pg_dump should > ignore the invalid index? > This problem exists even without REINDEX CONCURRENTLY patch. So we might need to > implement the bugfix patch separately rather than including the bugfix > code in your patches. > Probably the backport would be required. Thought? Hum... Indeed, they shouldn't be included... Perhaps this is already known? > > We should add the concurrent reindex option into reindexdb command? > This can be really > separate patch, though. Yes, they definitely should be separated for simplicity. Btw, those patches seem trivial, I'll send them. Michael
Please find attached the patches wanted:
- 20130317_reindexdb_concurrently.patch, adding an option -c/--concurrently to reindexdb
Note that I added an error inside reindexdb for options "-s -c" as REINDEX CONCURRENTLY does not support SYSTEM.
- 20130317_dump_only_valid_index.patch, a 1-line patch that makes pg_dump not take a dump of invalid indexes. This patch can be backpatched to 9.0.
Michael
- 20130317_reindexdb_concurrently.patch, adding an option -c/--concurrently to reindexdb
Note that I added an error inside reindexdb for options "-s -c" as REINDEX CONCURRENTLY does not support SYSTEM.
- 20130317_dump_only_valid_index.patch, a 1-line patch that makes pg_dump not take a dump of invalid indexes. This patch can be backpatched to 9.0.
On Sun, Mar 17, 2013 at 3:31 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
-- On 2013/03/17, at 0:35, Fujii Masao <masao.fujii@gmail.com> wrote:Hum... Indeed, they shouldn't be included... Perhaps this is already known?
> On Wed, Mar 13, 2013 at 9:04 PM, Michael Paquier
> I found pg_dump dumps even the invalid index. But pg_dump should
> ignore the invalid index?
> This problem exists even without REINDEX CONCURRENTLY patch. So we might need to
> implement the bugfix patch separately rather than including the bugfix
> code in your patches.
> Probably the backport would be required. Thought?
Note that there have been some recent discussions about that. This *problem* also concerned pg_upgrade.
http://www.postgresql.org/message-id/20121207141236.GB4699@alvh.no-ip.org
http://www.postgresql.org/message-id/20121207141236.GB4699@alvh.no-ip.org
Michael
Attachment
On Sun, Mar 17, 2013 at 9:24 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Please find attached the patches wanted: > - 20130317_dump_only_valid_index.patch, a 1-line patch that makes pg_dump > not take a dump of invalid indexes. This patch can be backpatched to 9.0. Don't indisready and indislive need to be checked? The patch seems to change pg_dump so that it ignores an invalid index only when the remote server version >= 9.0. But why not when the remote server version < 9.0? I think that you should start new thread to get much attention about this patch if there is no enough feedback. > Note that there have been some recent discussions about that. This *problem* > also concerned pg_upgrade. > http://www.postgresql.org/message-id/20121207141236.GB4699@alvh.no-ip.org What's the conclusion of this discussion? pg_dump --binary-upgrade also should ignore an invalid index? pg_upgrade needs to be changed together? Regards, -- Fujii Masao
On Wed, Mar 13, 2013 at 9:04 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > I have been working on improving the code of the 2 patches: > 1) reltoastidxid removal: <snip> > - Fix a bug with pg_dump and binary upgrade. One valid index is necessary > for a given toast relation. Is this bugfix related to the following? appendPQExpBuffer(upgrade_query, - "SELECT c.reltoastrelid, t.reltoastidxid " + "SELECT c.reltoastrelid, t.indexrelid " "FROM pg_catalog.pg_class c LEFT JOIN" - "pg_catalog.pg_class t ON (c.reltoastrelid = t.oid) " - "WHERE c.oid = '%u'::pg_catalog.oid;", + "pg_catalog.pg_index t ON (c.reltoastrelid = t.indrelid) " + "WHERE c.oid = '%u'::pg_catalog.oid AND t.indisvalid " + "LIMIT 1", Don't indisready and indislive need to be checked? Why is LIMIT 1 required? The toast table can have more than one toast indexes? Regards, -- Fujii Masao
On Tue, Mar 19, 2013 at 3:03 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
-- On Sun, Mar 17, 2013 at 9:24 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Please find attached the patches wanted:> - 20130317_dump_only_valid_index.patch, a 1-line patch that makes pg_dumpDon't indisready and indislive need to be checked?
> not take a dump of invalid indexes. This patch can be backpatched to 9.0.
The patch seems to change pg_dump so that it ignores an invalid index only
when the remote server version >= 9.0. But why not when the remote server
version < 9.0?
I think that you should start new thread to get much attention about this patch
if there is no enough feedback.
Yeah... Will send a message about that...
What's the conclusion of this discussion? pg_dump --binary-upgrade also should
> Note that there have been some recent discussions about that. This *problem*
> also concerned pg_upgrade.
> http://www.postgresql.org/message-id/20121207141236.GB4699@alvh.no-ip.org
ignore an invalid index? pg_upgrade needs to be changed together?
The conclusion is that pg_dump should not need to include invalid indexes if it is
to create them as valid index during restore. However I haven't seen any patch...
to create them as valid index during restore. However I haven't seen any patch...
Michael
On Tue, Mar 19, 2013 at 3:24 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
-- On Wed, Mar 13, 2013 at 9:04 PM, Michael Paquier<snip>
<michael.paquier@gmail.com> wrote:
> I have been working on improving the code of the 2 patches:
> 1) reltoastidxid removal:> - Fix a bug with pg_dump and binary upgrade. One valid index is necessaryIs this bugfix related to the following?
> for a given toast relation.
appendPQExpBuffer(upgrade_query,
- "SELECT c.reltoastrelid, t.reltoastidxid "
+ "SELECT c.reltoastrelid, t.indexrelid "
"FROM pg_catalog.pg_class c LEFT JOIN "
- "pg_catalog.pg_class t ON (c.reltoastrelid = t.oid) "
- "WHERE c.oid = '%u'::pg_catalog.oid;",
+ "pg_catalog.pg_index t ON (c.reltoastrelid = t.indrelid) "
+ "WHERE c.oid = '%u'::pg_catalog.oid AND t.indisvalid "
+ "LIMIT 1",
Yes.
Don't indisready and indislive need to be checked?
An index is valid if it is already ready and line. We could add such check for safely but I don't think it is necessary.
Why is LIMIT 1 required? The toast table can have more than one toast indexes?
It cannot have more than one VALID index, so yes as long as a check on indisvalid is here there is no need to worry about a LIMIT condition. I only thought of that as a safeguard. The same thing applies to the addition of a condition based on indislive and indisready.
Michael
On Tue, Mar 19, 2013 at 8:54 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
-- On Tue, Mar 19, 2013 at 3:03 AM, Fujii Masao <masao.fujii@gmail.com> wrote:On Sun, Mar 17, 2013 at 9:24 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Please find attached the patches wanted:> - 20130317_dump_only_valid_index.patch, a 1-line patch that makes pg_dumpDon't indisready and indislive need to be checked?
> not take a dump of invalid indexes. This patch can be backpatched to 9.0.
The patch seems to change pg_dump so that it ignores an invalid index only
when the remote server version >= 9.0. But why not when the remote server
version < 9.0?
I think that you should start new thread to get much attention about this patch
if there is no enough feedback.Yeah... Will send a message about that...
What's the conclusion of this discussion? pg_dump --binary-upgrade also should
> Note that there have been some recent discussions about that. This *problem*
> also concerned pg_upgrade.
> http://www.postgresql.org/message-id/20121207141236.GB4699@alvh.no-ip.org
ignore an invalid index? pg_upgrade needs to be changed together?The conclusion is that pg_dump should not need to include invalid indexes if it is
to create them as valid index during restore. However I haven't seen any patch...
The fix has been done inside pg_upgrade:
http://momjian.us/main/blogs/pgblog/2012.html#December_14_2012
Nothing has been done for pg_dump.
http://momjian.us/main/blogs/pgblog/2012.html#December_14_2012
Nothing has been done for pg_dump.
Michael
Is someone planning to provide additional feedback about this patch at some point?<br />Thanks,<br />-- <br />Michael<br/>
Hi,
Please find new patches realigned with HEAD. There were conflicts with commits done recently.
Thanks,
--
Michael
Please find new patches realigned with HEAD. There were conflicts with commits done recently.
Thanks,
--
Michael
Attachment
On 2013-03-22 07:38:36 +0900, Michael Paquier wrote: > Is someone planning to provide additional feedback about this patch at some > point? Yes, now that I have returned from my holidays - or well, am returning from them, I do plan to. But it should probably get some implementation level review from somebody but Fujii and me... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Mar 23, 2013 at 10:20 PM, Andres Freund <andres@2ndquadrant.com> wrote:
MichaelOn 2013-03-22 07:38:36 +0900, Michael Paquier wrote:Yes, now that I have returned from my holidays - or well, am returning
> Is someone planning to provide additional feedback about this patch at some
> point?
from them, I do plan to. But it should probably get some implementation
level review from somebody but Fujii and me...
Yeah, it would be good to have an extra pair of fresh eyes looking at those patches.
Thanks,
--
Thanks,
--
On Sun, Mar 24, 2013 at 12:37 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > > > On Sat, Mar 23, 2013 at 10:20 PM, Andres Freund <andres@2ndquadrant.com> > wrote: >> >> On 2013-03-22 07:38:36 +0900, Michael Paquier wrote: >> > Is someone planning to provide additional feedback about this patch at >> > some >> > point? >> >> Yes, now that I have returned from my holidays - or well, am returning >> from them, I do plan to. But it should probably get some implementation >> level review from somebody but Fujii and me... > > Yeah, it would be good to have an extra pair of fresh eyes looking at those > patches. Probably I don't have enough time to review the patch thoroughly. It's quite helpful if someone becomes another reviewer of this patch. > Please find new patches realigned with HEAD. There were conflicts with commits done recently. ISTM you failed to make the patches from your repository. 20130323_1_toastindex_v7.patch contains all the changes of 20130323_2_reindex_concurrently_v25.patch Regards, -- Fujii Masao
On Wed, Mar 27, 2013 at 3:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
-- ISTM you failed to make the patches from your repository.
20130323_1_toastindex_v7.patch contains all the changes of
20130323_2_reindex_concurrently_v25.patch
Oops, sorry I haven't noticed.
Please find correct versions attached (realigned with latest head at the same time).
Please find correct versions attached (realigned with latest head at the same time).
Michael
Attachment
On Wed, Mar 27, 2013 at 8:26 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > > > On Wed, Mar 27, 2013 at 3:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >> ISTM you failed to make the patches from your repository. >> 20130323_1_toastindex_v7.patch contains all the changes of >> 20130323_2_reindex_concurrently_v25.patch > > Oops, sorry I haven't noticed. > Please find correct versions attached (realigned with latest head at the > same time). Thanks! - reltoastidxid = rel->rd_rel->reltoastidxid; + /* Fetch the list of indexes on toast relation if necessary */ + if (OidIsValid(reltoastrelid)) + { + Relation toastRel = relation_open(reltoastrelid, lockmode); + RelationGetIndexList(toastRel); + reltoastidxids = list_copy(toastRel->rd_indexlist); + relation_close(toastRel, NoLock); list_copy() seems not to be required here. We can just set reltoastidxids to the return list of RelationGetIndexList(). Since we call relation_open() with lockmode, ISTM that we should also call relation_close() with the same lockmode instead of NoLock. No? - if (OidIsValid(reltoastidxid)) - ATExecSetTableSpace(reltoastidxid, newTableSpace, lockmode); + foreach(lc, reltoastidxids) + { + Oid idxid = lfirst_oid(lc); + if (OidIsValid(idxid)) + ATExecSetTableSpace(idxid, newTableSpace, lockmode); Since idxid is the pg_index.indexrelid, ISTM it should never be invalid. If this is true, the check of OidIsValid(idxid) is not required. Regards, -- Fujii Masao
Thanks for the comments. Please find updated patches attached.
Michael
On Thu, Mar 28, 2013 at 3:12 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
-- - reltoastidxid = rel->rd_rel->reltoastidxid;
+ /* Fetch the list of indexes on toast relation if necessary */
+ if (OidIsValid(reltoastrelid))
+ {
+ Relation toastRel = relation_open(reltoastrelid, lockmode);
+ RelationGetIndexList(toastRel);
+ reltoastidxids = list_copy(toastRel->rd_indexlist);
+ relation_close(toastRel, NoLock);
list_copy() seems not to be required here. We can just set reltoastidxids to
the return list of RelationGetIndexList().
Good catch. I thought that I took care of such things in previous versions at
all the places.
all the places.
Since we call relation_open() with lockmode, ISTM that we should also call
relation_close() with the same lockmode instead of NoLock. No?
Agreed on that.
- if (OidIsValid(reltoastidxid))
- ATExecSetTableSpace(reltoastidxid, newTableSpace, lockmode);
+ foreach(lc, reltoastidxids)
+ {
+ Oid idxid = lfirst_oid(lc);
+ if (OidIsValid(idxid))
+ ATExecSetTableSpace(idxid, newTableSpace, lockmode);
Since idxid is the pg_index.indexrelid, ISTM it should never be invalid.
If this is true, the check of OidIsValid(idxid) is not required.
Indeed...
Michael
Attachment
On 2013-03-28 10:18:45 +0900, Michael Paquier wrote: > On Thu, Mar 28, 2013 at 3:12 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Since we call relation_open() with lockmode, ISTM that we should also call > > relation_close() with the same lockmode instead of NoLock. No? > > > Agreed on that. That doesn't really hold true generally, its often sensible to hold the lock till the end of the transaction, which is what not specifying a lock at close implies. Greetings, Andres Freund
On 2013-03-19 08:57:31 +0900, Michael Paquier wrote: > On Tue, Mar 19, 2013 at 3:24 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > > > On Wed, Mar 13, 2013 at 9:04 PM, Michael Paquier > > <michael.paquier@gmail.com> wrote: > > > I have been working on improving the code of the 2 patches: > > > 1) reltoastidxid removal: > > <snip> > > > - Fix a bug with pg_dump and binary upgrade. One valid index is necessary > > > for a given toast relation. > > > > Is this bugfix related to the following? > > > > appendPQExpBuffer(upgrade_query, > > - "SELECT c.reltoastrelid, > > t.reltoastidxid " > > + "SELECT c.reltoastrelid, > > t.indexrelid " > > "FROM pg_catalog.pg_class c LEFT > > JOIN " > > - "pg_catalog.pg_class t ON > > (c.reltoastrelid = t.oid) " > > - "WHERE c.oid = > > '%u'::pg_catalog.oid;", > > + "pg_catalog.pg_index t ON > > (c.reltoastrelid = t.indrelid) " > > + "WHERE c.oid = > > '%u'::pg_catalog.oid AND t.indisvalid " > > + "LIMIT 1", > > > Yes. > > > > Don't indisready and indislive need to be checked? > > > An index is valid if it is already ready and line. We could add such check > for safely but I don't think it is necessary. Note that thats not true for 9.2. live && !ready represents isdead there, since the need for that was only recognized after the release. Greetings, Andres Freund
On Thu, Mar 28, 2013 at 10:34 AM, Andres Freund <andres@anarazel.de> wrote: > On 2013-03-28 10:18:45 +0900, Michael Paquier wrote: >> On Thu, Mar 28, 2013 at 3:12 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> Since we call relation_open() with lockmode, ISTM that we should also call >> > relation_close() with the same lockmode instead of NoLock. No? >> > >> Agreed on that. > > That doesn't really hold true generally, its often sensible to hold the > lock till the end of the transaction, which is what not specifying a > lock at close implies. You're right. Even if we release the lock there, the lock is taken again soon and hold till the end of the transaction. There is no need to release the lock there. Regards, -- Fujii Masao
Hi,
I moved this patch to the next commit fest.
Thanks,I moved this patch to the next commit fest.
--
Michael
Michael
Hi all,
Please find attached the latest versions of REINDEX CONCURRENTLY for the 1st commit fest of 9.4:
- 20130606_1_remove_reltoastidxid_v9.patch, removing reltoastidxid, to allow a toast relation to have multiple indexes running in parallel (extra indexes could be created by a REINDEX CONCURRENTLY processed)
- 20130606_2_reindex_concurrently_v26.patch, correcting some comments and fixed a lock in index_concurrent_create on an index relation not released at the end of a transaction
Those patches have been generated with context diffs...Please find attached the latest versions of REINDEX CONCURRENTLY for the 1st commit fest of 9.4:
- 20130606_1_remove_reltoastidxid_v9.patch, removing reltoastidxid, to allow a toast relation to have multiple indexes running in parallel (extra indexes could be created by a REINDEX CONCURRENTLY processed)
- 20130606_2_reindex_concurrently_v26.patch, correcting some comments and fixed a lock in index_concurrent_create on an index relation not released at the end of a transaction
--
Michael
Michael
Attachment
On Thu, Jun 6, 2013 at 1:29 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Hi all, > > Please find attached the latest versions of REINDEX CONCURRENTLY for the 1st > commit fest of 9.4: > - 20130606_1_remove_reltoastidxid_v9.patch, removing reltoastidxid, to allow > a toast relation to have multiple indexes running in parallel (extra indexes > could be created by a REINDEX CONCURRENTLY processed) > - 20130606_2_reindex_concurrently_v26.patch, correcting some comments and > fixed a lock in index_concurrent_create on an index relation not released at > the end of a transaction Could you let me know how this patch has something to do with MVCC catalog access patch? Should we wait for MVCC catalog access patch to be committed before starting to review this patch? Regards, -- Fujii Masao
On 2013-06-17 04:20:03 +0900, Fujii Masao wrote: > On Thu, Jun 6, 2013 at 1:29 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > Hi all, > > > > Please find attached the latest versions of REINDEX CONCURRENTLY for the 1st > > commit fest of 9.4: > > - 20130606_1_remove_reltoastidxid_v9.patch, removing reltoastidxid, to allow > > a toast relation to have multiple indexes running in parallel (extra indexes > > could be created by a REINDEX CONCURRENTLY processed) > > - 20130606_2_reindex_concurrently_v26.patch, correcting some comments and > > fixed a lock in index_concurrent_create on an index relation not released at > > the end of a transaction > > Could you let me know how this patch has something to do with MVCC catalog > access patch? Should we wait for MVCC catalog access patch to be committed > before starting to review this patch? I wondered the same. The MVCC catalog patch, if applied, would make it possible to make the actual relfilenode swap concurrently instead of requiring to take access exlusive locks which obviously is way nicer. On the other hand, that function is only a really small part of this patch, so it seems quite possible to make another pass at it before relying on mvcc catalog scans. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jun 17, 2013 at 5:23 AM, Andres Freund <andres@2ndquadrant.com> wrote:
-- I wondered the same. The MVCC catalog patch, if applied, would make itOn 2013-06-17 04:20:03 +0900, Fujii Masao wrote:
> On Thu, Jun 6, 2013 at 1:29 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > Hi all,
> >
> > Please find attached the latest versions of REINDEX CONCURRENTLY for the 1st
> > commit fest of 9.4:
> > - 20130606_1_remove_reltoastidxid_v9.patch, removing reltoastidxid, to allow
> > a toast relation to have multiple indexes running in parallel (extra indexes
> > could be created by a REINDEX CONCURRENTLY processed)
> > - 20130606_2_reindex_concurrently_v26.patch, correcting some comments and
> > fixed a lock in index_concurrent_create on an index relation not released at
> > the end of a transaction
>
> Could you let me know how this patch has something to do with MVCC catalog
> access patch? Should we wait for MVCC catalog access patch to be committed
> before starting to review this patch?
possible to make the actual relfilenode swap concurrently instead of
requiring to take access exlusive locks which obviously is way nicer. On
the other hand, that function is only a really small part of this patch,
so it seems quite possible to make another pass at it before relying on
mvcc catalog scans.
As mentionned by Andres, the only thing that the MVCC catalog patch can improve here
is the index swap phase (index_concurrent_swap:index.c) where the relfilenode of the
old and new indexes are exchanged. Now an AccessExclusiveLock is taken on the 2 relations
being swap, we could leverage that to ShareUpdateExclusiveLock with the MVCC catalog
access I think.
Also, with the MVCC catalog patch in, we could add some isolation tests for
REINDEX CONCURRENTLY (there were some tests in one of the previous versions),
what is currently not possible due to the exclusive lock taken at swap phase.
REINDEX CONCURRENTLY (there were some tests in one of the previous versions),
what is currently not possible due to the exclusive lock taken at swap phase.
Btw, those are minor things in the patch, so I think that it would be better to not wait
for the MVCC catalog patch. Even if you think that it would be better to wait for it,
you could even begin with the 1st patch allowing a toast relation to have multiple
indexes (removal of reltoastidxid) which does not depend at all on it.
Thanks,
you could even begin with the 1st patch allowing a toast relation to have multiple
indexes (removal of reltoastidxid) which does not depend at all on it.
Thanks,
Michael
On 6/17/13 8:23 AM, Michael Paquier wrote: > As mentionned by Andres, the only thing that the MVCC catalog patch can > improve here > is the index swap phase (index_concurrent_swap:index.c) where the > relfilenode of the > old and new indexes are exchanged. Now an AccessExclusiveLock is taken > on the 2 relations > being swap, we could leverage that to ShareUpdateExclusiveLock with the > MVCC catalog > access I think. Without getting rid of the AccessExclusiveLock, REINDEX CONCURRENTLY is not really concurrent, at least not concurrent to the standard set by CREATE and DROP INDEX CONCURRENTLY.
On 2013-06-17 09:12:12 -0400, Peter Eisentraut wrote: > On 6/17/13 8:23 AM, Michael Paquier wrote: > > As mentionned by Andres, the only thing that the MVCC catalog patch can > > improve here > > is the index swap phase (index_concurrent_swap:index.c) where the > > relfilenode of the > > old and new indexes are exchanged. Now an AccessExclusiveLock is taken > > on the 2 relations > > being swap, we could leverage that to ShareUpdateExclusiveLock with the > > MVCC catalog > > access I think. > > Without getting rid of the AccessExclusiveLock, REINDEX CONCURRENTLY is > not really concurrent, at least not concurrent to the standard set by > CREATE and DROP INDEX CONCURRENTLY. Well, it still does the main body of work in a concurrent fashion, so I still don't see how that argument holds that much water. But anyway, the argument was only whether we could continue reviewing before the mvcc stuff goes in, not whether it can get committed before. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 6/17/13 9:19 AM, Andres Freund wrote: >> Without getting rid of the AccessExclusiveLock, REINDEX CONCURRENTLY is >> not really concurrent, at least not concurrent to the standard set by >> CREATE and DROP INDEX CONCURRENTLY. > > Well, it still does the main body of work in a concurrent fashion, so I > still don't see how that argument holds that much water. The reason we added DROP INDEX CONCURRENTLY is so that you don't get stuck in a lock situation like long-running-transaction <- DROP INDEX <- everything else If we accepted REINDEX CONCURRENTLY as currently proposed, then it would have the same problem. I don't think we should accept a REINDEX CONCURRENTLY implementation that is worse in that respect than a manual CREATE INDEX CONCURRENTLY + DROP INDEX CONCURRENTLY combination.
On 2013-06-17 11:03:35 -0400, Peter Eisentraut wrote: > On 6/17/13 9:19 AM, Andres Freund wrote: > >> Without getting rid of the AccessExclusiveLock, REINDEX CONCURRENTLY is > >> not really concurrent, at least not concurrent to the standard set by > >> CREATE and DROP INDEX CONCURRENTLY. > > > > Well, it still does the main body of work in a concurrent fashion, so I > > still don't see how that argument holds that much water. > > The reason we added DROP INDEX CONCURRENTLY is so that you don't get > stuck in a lock situation like > > long-running-transaction <- DROP INDEX <- everything else > > If we accepted REINDEX CONCURRENTLY as currently proposed, then it would > have the same problem. > > I don't think we should accept a REINDEX CONCURRENTLY implementation > that is worse in that respect than a manual CREATE INDEX CONCURRENTLY + > DROP INDEX CONCURRENTLY combination. Well, it can do lots stuff that DROP/CREATE CONCURRENTLY can't: * reindex primary keys * reindex keys referenced by foreign keys * reindex exclusion constraints * reindex toast tables * do all that for a whole database so I don't think that comparison is fair. Having it would have made several previous point releases far less painful (e.g. 9.1.6/9.2.1). But anyway, the as I said "the argument was only whether we could continue reviewing before the mvcc stuff goes in, not whether it can get committed before.". I don't think we a have need to decide whether REINDEX CONCURRENTLY can go in with the short exclusive lock unless we find unresolveable problems with the mvcc patch. Which I very, very much hope not to be the case. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jun 17, 2013 at 9:23 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > > > > On Mon, Jun 17, 2013 at 5:23 AM, Andres Freund <andres@2ndquadrant.com> > wrote: >> >> On 2013-06-17 04:20:03 +0900, Fujii Masao wrote: >> > On Thu, Jun 6, 2013 at 1:29 PM, Michael Paquier >> > <michael.paquier@gmail.com> wrote: >> > > Hi all, >> > > >> > > Please find attached the latest versions of REINDEX CONCURRENTLY for >> > > the 1st >> > > commit fest of 9.4: >> > > - 20130606_1_remove_reltoastidxid_v9.patch, removing reltoastidxid, to >> > > allow >> > > a toast relation to have multiple indexes running in parallel (extra >> > > indexes >> > > could be created by a REINDEX CONCURRENTLY processed) >> > > - 20130606_2_reindex_concurrently_v26.patch, correcting some comments >> > > and >> > > fixed a lock in index_concurrent_create on an index relation not >> > > released at >> > > the end of a transaction >> > >> > Could you let me know how this patch has something to do with MVCC >> > catalog >> > access patch? Should we wait for MVCC catalog access patch to be >> > committed >> > before starting to review this patch? >> >> I wondered the same. The MVCC catalog patch, if applied, would make it >> possible to make the actual relfilenode swap concurrently instead of >> requiring to take access exlusive locks which obviously is way nicer. On >> the other hand, that function is only a really small part of this patch, >> so it seems quite possible to make another pass at it before relying on >> mvcc catalog scans. > > As mentionned by Andres, the only thing that the MVCC catalog patch can > improve here > is the index swap phase (index_concurrent_swap:index.c) where the > relfilenode of the > old and new indexes are exchanged. Now an AccessExclusiveLock is taken on > the 2 relations > being swap, we could leverage that to ShareUpdateExclusiveLock with the MVCC > catalog > access I think. > > Also, with the MVCC catalog patch in, we could add some isolation tests for > REINDEX CONCURRENTLY (there were some tests in one of the previous > versions), > what is currently not possible due to the exclusive lock taken at swap > phase. > > Btw, those are minor things in the patch, so I think that it would be better > to not wait > for the MVCC catalog patch. Even if you think that it would be better to > wait for it, > you could even begin with the 1st patch allowing a toast relation to have > multiple > indexes (removal of reltoastidxid) which does not depend at all on it. Here are the review comments of the removal_of_reltoastidxid patch. I've not completed the review yet, but I'd like to post the current comments before going to bed ;) *** a/src/backend/catalog/system_views.sql - pg_stat_get_blocks_fetched(X.oid) - - pg_stat_get_blocks_hit(X.oid) AS tidx_blks_read, - pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit + pg_stat_get_blocks_fetched(X.indrelid) - + pg_stat_get_blocks_hit(X.indrelid) AS tidx_blks_read, + pg_stat_get_blocks_hit(X.indrelid) AS tidx_blks_hit ISTM that X.indrelid indicates the TOAST table not the TOAST index. Shouldn't we use X.indexrelid instead of X.indrelid? You changed some SQLs because of removal of reltoastidxid. Could you check that the original SQL and changed one return the same value, again? doc/src/sgml/diskusage.sgml > There will be one index on the > <acronym>TOAST</> table, if present. I'm not sure if multiple indexes on TOAST table are viewable by a user. If it's viewable, we need to correct the above description. doc/src/sgml/monitoring.sgml > <entry><structfield>tidx_blks_read</></entry> > <entry><type>bigint</></entry> > <entry>Number of disk blocks read from this table's TOAST table index (if any)</entry> > </row> > <row> > <entry><structfield>tidx_blks_hit</></entry> > <entry><type>bigint</></entry> > <entry>Number of buffer hits in this table's TOAST table index (if any)</entry> For the same reason as the above, we need to change "index" to "indexes" in these descriptions? *** a/src/bin/pg_dump/pg_dump.c + "SELECT c.reltoastrelid, t.indexrelid " "FROM pg_catalog.pg_class c LEFT JOIN" - "pg_catalog.pg_class t ON (c.reltoastrelid = t.oid) " - "WHERE c.oid = '%u'::pg_catalog.oid;", + "pg_catalog.pg_index t ON (c.reltoastrelid = t.indrelid) " + "WHERE c.oid = '%u'::pg_catalog.oid AND t.indisvalid " + "LIMIT 1", Is there the case where TOAST table has more than one *valid* indexes? If yes, is it really okay to choose just one index by using LIMIT 1? If no, i.e., TOAST table should have only one valid index, we should get rid of LIMIT 1 and check that only one row is returned from this query. Fortunately, ISTM this check has been already done by the subsequent call of ExecuteSqlQueryForSingleRow(). Thought? Regards, -- Fujii Masao
> Well, it can do lots stuff that DROP/CREATE CONCURRENTLY can't: > * reindex primary keys > * reindex keys referenced by foreign keys > * reindex exclusion constraints > * reindex toast tables > * do all that for a whole database > so I don't think that comparison is fair. Having it would have made > several previous point releases far less painful (e.g. 9.1.6/9.2.1). FWIW, I have a client who needs this implementation enough that we're backporting it to 9.1 for them. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2013-06-17 12:52:36 -0700, Josh Berkus wrote: > > > Well, it can do lots stuff that DROP/CREATE CONCURRENTLY can't: > > * reindex primary keys > > * reindex keys referenced by foreign keys > > * reindex exclusion constraints > > * reindex toast tables > > * do all that for a whole database > > so I don't think that comparison is fair. Having it would have made > > several previous point releases far less painful (e.g. 9.1.6/9.2.1). > > FWIW, I have a client who needs this implementation enough that we're > backporting it to 9.1 for them. Wait. What? Unless you break catalog compatibility that's not safely possible using this implementation. Greetings, Andres Freund PS: Josh, minor thing, but could you please not trim the CC list, at least when I am on it? -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > PS: Josh, minor thing, but could you please not trim the CC list, at > least when I am on it? Yes, it's annoying. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 06/17/2013 01:40 PM, Alvaro Herrera wrote: > Andres Freund wrote: > >> PS: Josh, minor thing, but could you please not trim the CC list, at >> least when I am on it? > > Yes, it's annoying. I also get private comments from people who don't want me to cc them when they are already on the list. I can't satisfy everyone. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2013-06-17 13:46:07 -0700, Josh Berkus wrote: > On 06/17/2013 01:40 PM, Alvaro Herrera wrote: > > Andres Freund wrote: > > > >> PS: Josh, minor thing, but could you please not trim the CC list, at > >> least when I am on it? > > > > Yes, it's annoying. > > I also get private comments from people who don't want me to cc them > when they are already on the list. I can't satisfy everyone. Given that nobody but you trims the CC list I don't find that a convincing argument. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
An updated patch for the toast part is attached. On Tue, Jun 18, 2013 at 3:26 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Here are the review comments of the removal_of_reltoastidxid patch. > I've not completed the review yet, but I'd like to post the current comments > before going to bed ;) > > *** a/src/backend/catalog/system_views.sql > - pg_stat_get_blocks_fetched(X.oid) - > - pg_stat_get_blocks_hit(X.oid) AS tidx_blks_read, > - pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit > + pg_stat_get_blocks_fetched(X.indrelid) - > + pg_stat_get_blocks_hit(X.indrelid) AS tidx_blks_read, > + pg_stat_get_blocks_hit(X.indrelid) AS tidx_blks_hit > > ISTM that X.indrelid indicates the TOAST table not the TOAST index. > Shouldn't we use X.indexrelid instead of X.indrelid? Indeed good catch! We need in this case the statistics on the index and here I used the table OID. Btw, I also noticed that as multiple indexes may be involved for a given toast relation, it makes sense to actually calculate tidx_blks_read and tidx_blks_hit as the sum of all stats of the indexes. > You changed some SQLs because of removal of reltoastidxid. > Could you check that the original SQL and changed one return > the same value, again? Sure, here are some results I am getting for pg_statio_all_tables with a simple example to get stats on a table that has a toast relation. With patch (after correcting to indexrelid and defining stats as a sum): ioltas=# select relname, toast_blks_hit, tidx_blks_read from pg_statio_all_tables where relname ='aa'; relname | toast_blks_hit | tidx_blks_read ---------+----------------+---------------- aa | 433313 | 829 (1 row) With master: relname | toast_blks_hit | tidx_blks_read ---------+----------------+---------------- aa | 433313 | 829 (1 row) So the results are the same. > > doc/src/sgml/diskusage.sgml >> There will be one index on the >> <acronym>TOAST</> table, if present. > > I'm not sure if multiple indexes on TOAST table are viewable by a user. > If it's viewable, we need to correct the above description. AFAIK, toast indexes are not directly visible to the user. ioltas=# \d aa Table "public.aa" Column | Type | Modifiers --------+---------+----------- a | integer | b | text | ioltas=# select l.relname from pg_class c join pg_class l on (c.reltoastrelid = l.oid) where c.relname = 'aa'; relname ---------------- pg_toast_16386 (1 row) However you can still query the schema pg_toast to get details about a toast relation. ioltas=# \d pg_toast.pg_toast_16386_index Index "pg_toast.pg_toast_16386_index" Column | Type | Definition -----------+---------+------------ chunk_id | oid | chunk_id chunk_seq | integer | chunk_seq primary key, btree, for table "pg_toast.pg_toast_16386" > > doc/src/sgml/monitoring.sgml >> <entry><structfield>tidx_blks_read</></entry> >> <entry><type>bigint</></entry> >> <entry>Number of disk blocks read from this table's TOAST table index (if any)</entry> >> </row> >> <row> >> <entry><structfield>tidx_blks_hit</></entry> >> <entry><type>bigint</></entry> >> <entry>Number of buffer hits in this table's TOAST table index (if any)</entry> > > For the same reason as the above, we need to change "index" to "indexes" > in these descriptions? Yes it makes sense. Changed it this way. After some more search with grep, I haven't noticed any other places where it would be necessary to correct the docs. > > *** a/src/bin/pg_dump/pg_dump.c > + "SELECT c.reltoastrelid, t.indexrelid " > "FROM pg_catalog.pg_class c LEFT JOIN " > - "pg_catalog.pg_class t ON (c.reltoastrelid = t.oid) " > - "WHERE c.oid = '%u'::pg_catalog.oid;", > + "pg_catalog.pg_index t ON (c.reltoastrelid = t.indrelid) " > + "WHERE c.oid = '%u'::pg_catalog.oid AND t.indisvalid " > + "LIMIT 1", > > Is there the case where TOAST table has more than one *valid* indexes? I just rechecked the patch and is answer is no. The concurrent index is set as valid inside the same transaction as swap. So only the backend performing the swap will be able to see two valid toast indexes at the same time. > If yes, is it really okay to choose just one index by using LIMIT 1? > If no, i.e., TOAST table should have only one valid index, we should get rid > of LIMIT 1 and check that only one row is returned from this query. > Fortunately, ISTM this check has been already done by the subsequent > call of ExecuteSqlQueryForSingleRow(). Thought? Hum, this is debatable, but for simplicity of pg_dump code, let's remove it this LIMIT clause and rely on the assumption that a toast relation can only have one valid index at a given moment. -- Michael
Attachment
Hi, On 2013-06-18 10:53:25 +0900, Michael Paquier wrote: > diff --git a/contrib/pg_upgrade/info.c b/contrib/pg_upgrade/info.c > index c381f11..3a6342c 100644 > --- a/contrib/pg_upgrade/info.c > +++ b/contrib/pg_upgrade/info.c > @@ -321,12 +321,17 @@ get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo) > "INSERT INTO info_rels " > "SELECT reltoastrelid " > "FROM info_rels i JOIN pg_catalog.pg_class c " > - " ON i.reloid = c.oid")); > + " ON i.reloid = c.oid " > + " AND c.reltoastrelid != %u", InvalidOid)); > PQclear(executeQueryOrDie(conn, > "INSERT INTO info_rels " > - "SELECT reltoastidxid " > - "FROM info_rels i JOIN pg_catalog.pg_class c " > - " ON i.reloid = c.oid")); > + "SELECT indexrelid " > + "FROM pg_index " > + "WHERE indrelid IN (SELECT reltoastrelid " > + " FROM pg_class " > + " WHERE oid >= %u " > + " AND reltoastrelid != %u)", > + FirstNormalObjectId, InvalidOid)); What's the idea behind the >= here? I think we should ignore the invalid indexes in that SELECT? > @@ -1392,19 +1390,62 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class, > } > > /* > - * If we're swapping two toast tables by content, do the same for their > - * indexes. > + * If we're swapping two toast tables by content, do the same for all of > + * their indexes. The swap can actually be safely done only if the > + * relations have indexes. > */ > if (swap_toast_by_content && > - relform1->reltoastidxid && relform2->reltoastidxid) > - swap_relation_files(relform1->reltoastidxid, > - relform2->reltoastidxid, > - target_is_pg_class, > - swap_toast_by_content, > - is_internal, > - InvalidTransactionId, > - InvalidMultiXactId, > - mapped_tables); > + relform1->reltoastrelid && > + relform2->reltoastrelid) > + { > + Relation toastRel1, toastRel2; > + > + /* Open relations */ > + toastRel1 = heap_open(relform1->reltoastrelid, AccessExclusiveLock); > + toastRel2 = heap_open(relform2->reltoastrelid, AccessExclusiveLock); > + > + /* Obtain index list */ > + RelationGetIndexList(toastRel1); > + RelationGetIndexList(toastRel2); > + > + /* Check if the swap is possible for all the toast indexes */ > + if (list_length(toastRel1->rd_indexlist) == 1 && > + list_length(toastRel2->rd_indexlist) == 1) > + { > + ListCell *lc1, *lc2; > + > + /* Now swap each couple */ > + lc2 = list_head(toastRel2->rd_indexlist); > + foreach(lc1, toastRel1->rd_indexlist) > + { > + Oid indexOid1 = lfirst_oid(lc1); > + Oid indexOid2 = lfirst_oid(lc2); > + swap_relation_files(indexOid1, > + indexOid2, > + target_is_pg_class, > + swap_toast_by_content, > + is_internal, > + InvalidTransactionId, > + InvalidMultiXactId, > + mapped_tables); > + lc2 = lnext(lc2); > + } Why are you iterating over the indexlists after checking they are both of length == 1? Looks like the code would be noticeably shorter without that. > + } > + else > + { > + /* > + * As this code path is only taken by shared catalogs, who cannot > + * have multiple indexes on their toast relation, simply return > + * an error. > + */ > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot swap relation files of a shared catalog with multiple indexes on toast relation"))); > + } > + Absolutely minor thing, using an elog() seems to be better here since that uses the appropriate error code for some codepath that's not expected to be executed. > /* Clean up. */ > heap_freetuple(reltup1); > @@ -1529,12 +1570,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap, > if (OidIsValid(newrel->rd_rel->reltoastrelid)) > { > Relation toastrel; > - Oid toastidx; > char NewToastName[NAMEDATALEN]; > + ListCell *lc; > + int count = 0; > > toastrel = relation_open(newrel->rd_rel->reltoastrelid, > AccessShareLock); > - toastidx = toastrel->rd_rel->reltoastidxid; > + RelationGetIndexList(toastrel); > relation_close(toastrel, AccessShareLock); > > /* rename the toast table ... */ > @@ -1543,11 +1585,23 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap, > RenameRelationInternal(newrel->rd_rel->reltoastrelid, > NewToastName, true); > > - /* ... and its index too */ > - snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index", > - OIDOldHeap); > - RenameRelationInternal(toastidx, > - NewToastName, true); > + /* ... and its indexes too */ > + foreach(lc, toastrel->rd_indexlist) > + { > + /* > + * The first index keeps the former toast name and the > + * following entries have a suffix appended. > + */ > + if (count == 0) > + snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index", > + OIDOldHeap); > + else > + snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index_%d", > + OIDOldHeap, count); > + RenameRelationInternal(lfirst_oid(lc), > + NewToastName, true); > + count++; > + } > } > relation_close(newrel, NoLock); > } Is it actually possible to get here with multiple toast indexes? > diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c > index ec956ad..ac42389 100644 > --- a/src/bin/pg_dump/pg_dump.c > +++ b/src/bin/pg_dump/pg_dump.c > @@ -2781,16 +2781,16 @@ binary_upgrade_set_pg_class_oids(Archive *fout, > Oid pg_class_reltoastidxid; > > appendPQExpBuffer(upgrade_query, > - "SELECT c.reltoastrelid, t.reltoastidxid " > + "SELECT c.reltoastrelid, t.indexrelid " > "FROM pg_catalog.pg_class c LEFT JOIN " > - "pg_catalog.pg_class t ON (c.reltoastrelid = t.oid) " > - "WHERE c.oid = '%u'::pg_catalog.oid;", > + "pg_catalog.pg_index t ON (c.reltoastrelid = t.indrelid) " > + "WHERE c.oid = '%u'::pg_catalog.oid AND t.indisvalid;", > pg_class_oid); This possibly needs a version qualification due to querying indisalid. How far back do we support pg_upgrade? > diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h > index 8ac2549..31309ed 100644 > --- a/src/include/utils/relcache.h > +++ b/src/include/utils/relcache.h > @@ -29,6 +29,16 @@ typedef struct RelationData *Relation; > typedef Relation *RelationPtr; > > /* > + * RelationGetIndexListIfValid > + * Get index list of relation without recomputing it. > + */ > +#define RelationGetIndexListIfValid(rel) \ > +do { \ > + if (rel->rd_indexvalid == 0) \ > + RelationGetIndexList(rel); \ > +} while(0) Isn't this function misnamed and should be RelationGetIndexListIfInValid? Going to do some performance tests now. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2013-06-18 11:35:10 +0200, Andres Freund wrote: > Going to do some performance tests now. Ok, so ran the worst case load I could think of and didn't notice any relevant performance changes. The test I ran was: CREATE TABLE test_toast(id serial primary key, data text); ALTER TABLE test_toast ALTER COLUMN data SET STORAGE external; INSERT INTO test_toast(data) SELECT repeat('a', 8000) FROM generate_series(1, 200000); VACUUM FREEZE test_toast; And then with that: \setrandom id 1 200000 SELECT id, substring(data, 1, 10) FROM test_toast WHERE id = :id; Which should really stress the potentially added overhead since we're doing many toast accesses, but always only fetch one chunk. One other thing: Your latest patch forgot to adjust rules.out, so make check didn't pass... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jun 18, 2013 at 10:53 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > An updated patch for the toast part is attached. > > On Tue, Jun 18, 2013 at 3:26 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> Here are the review comments of the removal_of_reltoastidxid patch. >> I've not completed the review yet, but I'd like to post the current comments >> before going to bed ;) >> >> *** a/src/backend/catalog/system_views.sql >> - pg_stat_get_blocks_fetched(X.oid) - >> - pg_stat_get_blocks_hit(X.oid) AS tidx_blks_read, >> - pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit >> + pg_stat_get_blocks_fetched(X.indrelid) - >> + pg_stat_get_blocks_hit(X.indrelid) AS tidx_blks_read, >> + pg_stat_get_blocks_hit(X.indrelid) AS tidx_blks_hit >> >> ISTM that X.indrelid indicates the TOAST table not the TOAST index. >> Shouldn't we use X.indexrelid instead of X.indrelid? > Indeed good catch! We need in this case the statistics on the index > and here I used the table OID. Btw, I also noticed that as multiple > indexes may be involved for a given toast relation, it makes sense to > actually calculate tidx_blks_read and tidx_blks_hit as the sum of all > stats of the indexes. Yep. You seem to need to change X.indexrelid to X.indrelid in GROUP clause. Otherwise, you may get two rows of the same table from pg_statio_all_tables. >> doc/src/sgml/diskusage.sgml >>> There will be one index on the >>> <acronym>TOAST</> table, if present. + table (see <xref linkend="storage-toast">). There will be one valid index + on the <acronym>TOAST</> table, if present. There also might be indexes When I used gdb and tracked the code path of concurrent reindex patch, I found it's possible that more than one *valid* toast indexes appear. Those multiple valid toast indexes are viewable, for example, from pg_indexes. I'm not sure whether this is the bug of concurrent reindex patch. But if it's not, you seem to need to change the above description again. >> *** a/src/bin/pg_dump/pg_dump.c >> + "SELECT c.reltoastrelid, t.indexrelid " >> "FROM pg_catalog.pg_class c LEFT JOIN " >> - "pg_catalog.pg_class t ON (c.reltoastrelid = t.oid) " >> - "WHERE c.oid = '%u'::pg_catalog.oid;", >> + "pg_catalog.pg_index t ON (c.reltoastrelid = t.indrelid) " >> + "WHERE c.oid = '%u'::pg_catalog.oid AND t.indisvalid " >> + "LIMIT 1", >> >> Is there the case where TOAST table has more than one *valid* indexes? > I just rechecked the patch and is answer is no. The concurrent index > is set as valid inside the same transaction as swap. So only the > backend performing the swap will be able to see two valid toast > indexes at the same time. According to my quick gdb testing, this seems not to be true.... Regards, -- Fujii Masao
On Tue, Jun 18, 2013 at 9:54 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > On 2013-06-18 11:35:10 +0200, Andres Freund wrote: >> Going to do some performance tests now. > > Ok, so ran the worst case load I could think of and didn't notice > any relevant performance changes. > > The test I ran was: > > CREATE TABLE test_toast(id serial primary key, data text); > ALTER TABLE test_toast ALTER COLUMN data SET STORAGE external; > INSERT INTO test_toast(data) SELECT repeat('a', 8000) FROM generate_series(1, 200000); > VACUUM FREEZE test_toast; > > And then with that: > \setrandom id 1 200000 > SELECT id, substring(data, 1, 10) FROM test_toast WHERE id = :id; > > Which should really stress the potentially added overhead since we're > doing many toast accesses, but always only fetch one chunk. Sounds really good! Regards, -- Fujii Masao
On Wed, Jun 19, 2013 at 12:36 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Tue, Jun 18, 2013 at 10:53 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> An updated patch for the toast part is attached. >> >> On Tue, Jun 18, 2013 at 3:26 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> Here are the review comments of the removal_of_reltoastidxid patch. >>> I've not completed the review yet, but I'd like to post the current comments >>> before going to bed ;) >>> >>> *** a/src/backend/catalog/system_views.sql >>> - pg_stat_get_blocks_fetched(X.oid) - >>> - pg_stat_get_blocks_hit(X.oid) AS tidx_blks_read, >>> - pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit >>> + pg_stat_get_blocks_fetched(X.indrelid) - >>> + pg_stat_get_blocks_hit(X.indrelid) AS tidx_blks_read, >>> + pg_stat_get_blocks_hit(X.indrelid) AS tidx_blks_hit >>> >>> ISTM that X.indrelid indicates the TOAST table not the TOAST index. >>> Shouldn't we use X.indexrelid instead of X.indrelid? >> Indeed good catch! We need in this case the statistics on the index >> and here I used the table OID. Btw, I also noticed that as multiple >> indexes may be involved for a given toast relation, it makes sense to >> actually calculate tidx_blks_read and tidx_blks_hit as the sum of all >> stats of the indexes. > > Yep. You seem to need to change X.indexrelid to X.indrelid in GROUP clause. > Otherwise, you may get two rows of the same table from pg_statio_all_tables. I changed it a little bit in a different way in my latest patch by adding a sum on all the indexes when getting tidx_blks stats. >>> doc/src/sgml/diskusage.sgml >>>> There will be one index on the >>>> <acronym>TOAST</> table, if present. > > + table (see <xref linkend="storage-toast">). There will be one valid index > + on the <acronym>TOAST</> table, if present. There also might be indexes > > When I used gdb and tracked the code path of concurrent reindex patch, > I found it's possible that more than one *valid* toast indexes appear. Those > multiple valid toast indexes are viewable, for example, from pg_indexes. > I'm not sure whether this is the bug of concurrent reindex patch. But > if it's not, > you seem to need to change the above description again. Not sure about that. The latest code is made such as only one valid index is present on the toast relation at the same time. > >>> *** a/src/bin/pg_dump/pg_dump.c >>> + "SELECT c.reltoastrelid, t.indexrelid " >>> "FROM pg_catalog.pg_class c LEFT JOIN " >>> - "pg_catalog.pg_class t ON (c.reltoastrelid = t.oid) " >>> - "WHERE c.oid = '%u'::pg_catalog.oid;", >>> + "pg_catalog.pg_index t ON (c.reltoastrelid = t.indrelid) " >>> + "WHERE c.oid = '%u'::pg_catalog.oid AND t.indisvalid " >>> + "LIMIT 1", >>> >>> Is there the case where TOAST table has more than one *valid* indexes? >> I just rechecked the patch and is answer is no. The concurrent index >> is set as valid inside the same transaction as swap. So only the >> backend performing the swap will be able to see two valid toast >> indexes at the same time. > > According to my quick gdb testing, this seems not to be true.... Well, I have to disagree. I am not able to reproduce it. Which version did you use? Here is what I get with the latest version of REINDEX CONCURRENTLY patch... I checked with the following process: 1) Create this table: CREATE TABLE aa (a int, b text); ALTER TABLE aa ALTER COLUMN b SET STORAGE EXTERNAL; 2) Create session 1 and take a breakpoint on ReindexRelationConcurrently:indexcmds.c 3) Launch REINDEX TABLE CONCURRENTLY aa 4) With a 2nd session, Go through all the phases of the process and scanned the validity of toast indexes with the following ioltas=# select pg_class.relname, indisvalid, indisready from pg_class, pg_index where pg_class.reltoastrelid = pg_index.indrelid and pg_class.relname = 'aa';relname | indisvalid | indisready ---------+------------+------------aa | t | taa | f | t (2 rows) When scanning all the phases with the 2nd psql session (the concurrent index creation, build, validation, swap, and drop of the concurrent index), I saw at no moment that indisvalid was set at true for the two indexes at the same time. indisready was of course changed to prepare the concurrent index to be ready for inserts, but that was all and this is part of the process. -- Michael
Please find an updated patch. The regression test rules has been updated, and all the comments are addressed. On Tue, Jun 18, 2013 at 6:35 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > On 2013-06-18 10:53:25 +0900, Michael Paquier wrote: >> diff --git a/contrib/pg_upgrade/info.c b/contrib/pg_upgrade/info.c >> index c381f11..3a6342c 100644 >> --- a/contrib/pg_upgrade/info.c >> +++ b/contrib/pg_upgrade/info.c >> @@ -321,12 +321,17 @@ get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo) >> "INSERT INTO info_rels " >> "SELECT reltoastrelid " >> "FROM info_rels i JOIN pg_catalog.pg_class c " >> - " ON i.reloid = c.oid")); >> + " ON i.reloid = c.oid " >> + " AND c.reltoastrelid != %u", InvalidOid)); >> PQclear(executeQueryOrDie(conn, >> "INSERT INTO info_rels " >> - "SELECT reltoastidxid " >> - "FROM info_rels i JOIN pg_catalog.pg_class c " >> - " ON i.reloid = c.oid")); >> + "SELECT indexrelid " >> + "FROM pg_index " >> + "WHERE indrelid IN (SELECT reltoastrelid " >> + " FROM pg_class " >> + " WHERE oid >= %u " >> + " AND reltoastrelid != %u)", >> + FirstNormalObjectId, InvalidOid)); > > What's the idea behind the >= here? It is here to avoid fetching the toast relations of system tables. But I see your point, the inner query fetching the toast OIDs should do a join on the exising info_rels and not try to do a join on a plain pg_index, so changed this way. > I think we should ignore the invalid indexes in that SELECT? Yes indeed, it doesn't make sense to grab invalid toast indexes. Changed this way. >> @@ -1392,19 +1390,62 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class, >> } >> >> /* >> - * If we're swapping two toast tables by content, do the same for their >> - * indexes. >> + * If we're swapping two toast tables by content, do the same for all of >> + * their indexes. The swap can actually be safely done only if the >> + * relations have indexes. >> */ >> if (swap_toast_by_content && >> - relform1->reltoastidxid && relform2->reltoastidxid) >> - swap_relation_files(relform1->reltoastidxid, >> - relform2->reltoastidxid, >> - target_is_pg_class, >> - swap_toast_by_content, >> - is_internal, >> - InvalidTransactionId, >> - InvalidMultiXactId, >> - mapped_tables); >> + relform1->reltoastrelid && >> + relform2->reltoastrelid) >> + { >> + Relation toastRel1, toastRel2; >> + >> + /* Open relations */ >> + toastRel1 = heap_open(relform1->reltoastrelid, AccessExclusiveLock); >> + toastRel2 = heap_open(relform2->reltoastrelid, AccessExclusiveLock); >> + >> + /* Obtain index list */ >> + RelationGetIndexList(toastRel1); >> + RelationGetIndexList(toastRel2); >> + >> + /* Check if the swap is possible for all the toast indexes */ >> + if (list_length(toastRel1->rd_indexlist) == 1 && >> + list_length(toastRel2->rd_indexlist) == 1) >> + { >> + ListCell *lc1, *lc2; >> + >> + /* Now swap each couple */ >> + lc2 = list_head(toastRel2->rd_indexlist); >> + foreach(lc1, toastRel1->rd_indexlist) >> + { >> + Oid indexOid1 = lfirst_oid(lc1); >> + Oid indexOid2 = lfirst_oid(lc2); >> + swap_relation_files(indexOid1, >> + indexOid2, >> + target_is_pg_class, >> + swap_toast_by_content, >> + is_internal, >> + InvalidTransactionId, >> + InvalidMultiXactId, >> + mapped_tables); >> + lc2 = lnext(lc2); >> + } > > Why are you iterating over the indexlists after checking they are both > of length == 1? Looks like the code would be noticeably shorter without > that. OK. Modified this way. >> + } >> + else >> + { >> + /* >> + * As this code path is only taken by shared catalogs, who cannot >> + * have multiple indexes on their toast relation, simply return >> + * an error. >> + */ >> + ereport(ERROR, >> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), >> + errmsg("cannot swap relation files of a shared catalog with multiple indexes ontoast relation"))); >> + } >> + > > Absolutely minor thing, using an elog() seems to be better here since > that uses the appropriate error code for some codepath that's not > expected to be executed. OK. Modified this way. > >> /* Clean up. */ >> heap_freetuple(reltup1); >> @@ -1529,12 +1570,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap, >> if (OidIsValid(newrel->rd_rel->reltoastrelid)) >> { >> Relation toastrel; >> - Oid toastidx; >> char NewToastName[NAMEDATALEN]; >> + ListCell *lc; >> + int count = 0; >> >> toastrel = relation_open(newrel->rd_rel->reltoastrelid, >> AccessShareLock); >> - toastidx = toastrel->rd_rel->reltoastidxid; >> + RelationGetIndexList(toastrel); >> relation_close(toastrel, AccessShareLock); >> >> /* rename the toast table ... */ >> @@ -1543,11 +1585,23 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap, >> RenameRelationInternal(newrel->rd_rel->reltoastrelid, >> NewToastName, true); >> >> - /* ... and its index too */ >> - snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index", >> - OIDOldHeap); >> - RenameRelationInternal(toastidx, >> - NewToastName, true); >> + /* ... and its indexes too */ >> + foreach(lc, toastrel->rd_indexlist) >> + { >> + /* >> + * The first index keeps the former toast name and the >> + * following entries have a suffix appended. >> + */ >> + if (count == 0) >> + snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index", >> + OIDOldHeap); >> + else >> + snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index_%d", >> + OIDOldHeap, count); >> + RenameRelationInternal(lfirst_oid(lc), >> + NewToastName, true); >> + count++; >> + } >> } >> relation_close(newrel, NoLock); >> } > > Is it actually possible to get here with multiple toast indexes? Actually it is possible. finish_heap_swap is also called for example in ALTER TABLE where rewriting the table (phase 3), so I think it is better to protect this code path this way. >> diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c >> index ec956ad..ac42389 100644 >> --- a/src/bin/pg_dump/pg_dump.c >> +++ b/src/bin/pg_dump/pg_dump.c >> @@ -2781,16 +2781,16 @@ binary_upgrade_set_pg_class_oids(Archive *fout, >> Oid pg_class_reltoastidxid; >> >> appendPQExpBuffer(upgrade_query, >> - "SELECT c.reltoastrelid, t.reltoastidxid " >> + "SELECT c.reltoastrelid, t.indexrelid " >> "FROM pg_catalog.pg_class c LEFT JOIN " >> - "pg_catalog.pg_class t ON (c.reltoastrelid = t.oid) " >> - "WHERE c.oid = '%u'::pg_catalog.oid;", >> + "pg_catalog.pg_index t ON (c.reltoastrelid = t.indrelid) " >> + "WHERE c.oid = '%u'::pg_catalog.oid AND t.indisvalid;", >> pg_class_oid); > > This possibly needs a version qualification due to querying > indisvalid. How far back do we support pg_upgrade? By having a look at the docs, pg_upgrade has been added in 9.0 and support upgrades for version >= 8.3.X. indisvalid has been added in 8.2 so we are fine. > >> diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h >> index 8ac2549..31309ed 100644 >> --- a/src/include/utils/relcache.h >> +++ b/src/include/utils/relcache.h >> @@ -29,6 +29,16 @@ typedef struct RelationData *Relation; >> typedef Relation *RelationPtr; >> >> /* >> + * RelationGetIndexListIfValid >> + * Get index list of relation without recomputing it. >> + */ >> +#define RelationGetIndexListIfValid(rel) \ >> +do { \ >> + if (rel->rd_indexvalid == 0) \ >> + RelationGetIndexList(rel); \ >> +} while(0) > > Isn't this function misnamed and should be > RelationGetIndexListIfInValid? When naming that; I had more in mind: "get the list of indexes if it is already there". It looks more intuitive to my mind. -- Michael
Attachment
On 2013-06-19 09:55:24 +0900, Michael Paquier wrote: > Please find an updated patch. The regression test rules has been > updated, and all the comments are addressed. > > On Tue, Jun 18, 2013 at 6:35 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > Hi, > > > > On 2013-06-18 10:53:25 +0900, Michael Paquier wrote: > >> diff --git a/contrib/pg_upgrade/info.c b/contrib/pg_upgrade/info.c > >> index c381f11..3a6342c 100644 > >> --- a/contrib/pg_upgrade/info.c > >> +++ b/contrib/pg_upgrade/info.c > >> @@ -321,12 +321,17 @@ get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo) > >> "INSERT INTO info_rels " > >> "SELECT reltoastrelid " > >> "FROM info_rels i JOIN pg_catalog.pg_class c " > >> - " ON i.reloid = c.oid")); > >> + " ON i.reloid = c.oid " > >> + " AND c.reltoastrelid != %u", InvalidOid)); > >> PQclear(executeQueryOrDie(conn, > >> "INSERT INTO info_rels " > >> - "SELECT reltoastidxid " > >> - "FROM info_rels i JOIN pg_catalog.pg_class c " > >> - " ON i.reloid = c.oid")); > >> + "SELECT indexrelid " > >> + "FROM pg_index " > >> + "WHERE indrelid IN (SELECT reltoastrelid " > >> + " FROM pg_class " > >> + " WHERE oid >= %u " > >> + " AND reltoastrelid != %u)", > >> + FirstNormalObjectId, InvalidOid)); > > > > What's the idea behind the >= here? > It is here to avoid fetching the toast relations of system tables. But > I see your point, the inner query fetching the toast OIDs should do a > join on the exising info_rels and not try to do a join on a plain > pg_index, so changed this way. I'd also rather not introduce knowledge about FirstNormalObjectId into client applications... But you fixed it already. > >> /* Clean up. */ > >> heap_freetuple(reltup1); > >> @@ -1529,12 +1570,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap, > >> if (OidIsValid(newrel->rd_rel->reltoastrelid)) > >> { > >> Relation toastrel; > >> - Oid toastidx; > >> char NewToastName[NAMEDATALEN]; > >> + ListCell *lc; > >> + int count = 0; > >> > >> toastrel = relation_open(newrel->rd_rel->reltoastrelid, > >> AccessShareLock); > >> - toastidx = toastrel->rd_rel->reltoastidxid; > >> + RelationGetIndexList(toastrel); > >> relation_close(toastrel, AccessShareLock); > >> > >> /* rename the toast table ... */ > >> @@ -1543,11 +1585,23 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap, > >> RenameRelationInternal(newrel->rd_rel->reltoastrelid, > >> NewToastName, true); > >> > >> - /* ... and its index too */ > >> - snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index", > >> - OIDOldHeap); > >> - RenameRelationInternal(toastidx, > >> - NewToastName, true); > >> + /* ... and its indexes too */ > >> + foreach(lc, toastrel->rd_indexlist) > >> + { > >> + /* > >> + * The first index keeps the former toast name and the > >> + * following entries have a suffix appended. > >> + */ > >> + if (count == 0) > >> + snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index", > >> + OIDOldHeap); > >> + else > >> + snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index_%d", > >> + OIDOldHeap, count); > >> + RenameRelationInternal(lfirst_oid(lc), > >> + NewToastName, true); > >> + count++; > >> + } > >> } > >> relation_close(newrel, NoLock); > >> } > > > > Is it actually possible to get here with multiple toast indexes? > Actually it is possible. finish_heap_swap is also called for example > in ALTER TABLE where rewriting the table (phase 3), so I think it is > better to protect this code path this way. But why would we copy invalid toast indexes over to the new relation? Shouldn't the new relation have been freshly built in the previous steps? > >> diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h > >> index 8ac2549..31309ed 100644 > >> --- a/src/include/utils/relcache.h > >> +++ b/src/include/utils/relcache.h > >> @@ -29,6 +29,16 @@ typedef struct RelationData *Relation; > >> typedef Relation *RelationPtr; > >> > >> /* > >> + * RelationGetIndexListIfValid > >> + * Get index list of relation without recomputing it. > >> + */ > >> +#define RelationGetIndexListIfValid(rel) \ > >> +do { \ > >> + if (rel->rd_indexvalid == 0) \ > >> + RelationGetIndexList(rel); \ > >> +} while(0) > > > > Isn't this function misnamed and should be > > RelationGetIndexListIfInValid? > When naming that; I had more in mind: "get the list of indexes if it > is already there". It looks more intuitive to my mind. I can't follow. RelationGetIndexListIfValid() doesn't return anything. And it doesn't do anything if the list is already valid. It only does something iff the list currently is invalid. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jun 21, 2013 at 6:19 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-06-19 09:55:24 +0900, Michael Paquier wrote: >> >> /* Clean up. */ >> >> heap_freetuple(reltup1); >> >> @@ -1529,12 +1570,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap, >> >> if (OidIsValid(newrel->rd_rel->reltoastrelid)) >> >> { >> >> Relation toastrel; >> >> - Oid toastidx; >> >> char NewToastName[NAMEDATALEN]; >> >> + ListCell *lc; >> >> + int count = 0; >> >> >> >> toastrel = relation_open(newrel->rd_rel->reltoastrelid, >> >> AccessShareLock); >> >> - toastidx = toastrel->rd_rel->reltoastidxid; >> >> + RelationGetIndexList(toastrel); >> >> relation_close(toastrel, AccessShareLock); >> >> >> >> /* rename the toast table ... */ >> >> @@ -1543,11 +1585,23 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap, >> >> RenameRelationInternal(newrel->rd_rel->reltoastrelid, >> >> NewToastName, true); >> >> >> >> - /* ... and its index too */ >> >> - snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index", >> >> - OIDOldHeap); >> >> - RenameRelationInternal(toastidx, >> >> - NewToastName, true); >> >> + /* ... and its indexes too */ >> >> + foreach(lc, toastrel->rd_indexlist) >> >> + { >> >> + /* >> >> + * The first index keeps the former toast name and the >> >> + * following entries have a suffix appended. >> >> + */ >> >> + if (count == 0) >> >> + snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index", >> >> + OIDOldHeap); >> >> + else >> >> + snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index_%d", >> >> + OIDOldHeap, count); >> >> + RenameRelationInternal(lfirst_oid(lc), >> >> + NewToastName, true); >> >> + count++; >> >> + } >> >> } >> >> relation_close(newrel, NoLock); >> >> } >> > >> > Is it actually possible to get here with multiple toast indexes? >> Actually it is possible. finish_heap_swap is also called for example >> in ALTER TABLE where rewriting the table (phase 3), so I think it is >> better to protect this code path this way. > > But why would we copy invalid toast indexes over to the new relation? > Shouldn't the new relation have been freshly built in the previous > steps? What do you think about that? Using only the first valid index would be enough? > >> >> diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h >> >> index 8ac2549..31309ed 100644 >> >> --- a/src/include/utils/relcache.h >> >> +++ b/src/include/utils/relcache.h >> >> @@ -29,6 +29,16 @@ typedef struct RelationData *Relation; >> >> typedef Relation *RelationPtr; >> >> >> >> /* >> >> + * RelationGetIndexListIfValid >> >> + * Get index list of relation without recomputing it. >> >> + */ >> >> +#define RelationGetIndexListIfValid(rel) \ >> >> +do { \ >> >> + if (rel->rd_indexvalid == 0) \ >> >> + RelationGetIndexList(rel); \ >> >> +} while(0) >> > >> > Isn't this function misnamed and should be >> > RelationGetIndexListIfInValid? >> When naming that; I had more in mind: "get the list of indexes if it >> is already there". It looks more intuitive to my mind. > > I can't follow. RelationGetIndexListIfValid() doesn't return > anything. And it doesn't do anything if the list is already valid. It > only does something iff the list currently is invalid. In this case RelationGetIndexListIfInvalid? -- Michael
On 2013-06-21 20:54:34 +0900, Michael Paquier wrote: > On Fri, Jun 21, 2013 at 6:19 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-06-19 09:55:24 +0900, Michael Paquier wrote: > >> >> @@ -1529,12 +1570,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap, > >> > Is it actually possible to get here with multiple toast indexes? > >> Actually it is possible. finish_heap_swap is also called for example > >> in ALTER TABLE where rewriting the table (phase 3), so I think it is > >> better to protect this code path this way. > > > > But why would we copy invalid toast indexes over to the new relation? > > Shouldn't the new relation have been freshly built in the previous > > steps? > What do you think about that? Using only the first valid index would be enough? What I am thinking about is the following: When we rewrite a relation, we build a completely new toast relation. Which will only have one index, right? So I don't see how this could could be correct if we deal with multiple indexes. In fact, the current patch's swap_relation_files throws an error if there are multiple ones around. > >> >> diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h > >> >> index 8ac2549..31309ed 100644 > >> >> --- a/src/include/utils/relcache.h > >> >> +++ b/src/include/utils/relcache.h > >> >> @@ -29,6 +29,16 @@ typedef struct RelationData *Relation; > >> >> typedef Relation *RelationPtr; > >> >> > >> >> /* > >> >> + * RelationGetIndexListIfValid > >> >> + * Get index list of relation without recomputing it. > >> >> + */ > >> >> +#define RelationGetIndexListIfValid(rel) \ > >> >> +do { \ > >> >> + if (rel->rd_indexvalid == 0) \ > >> >> + RelationGetIndexList(rel); \ > >> >> +} while(0) > >> > > >> > Isn't this function misnamed and should be > >> > RelationGetIndexListIfInValid? > >> When naming that; I had more in mind: "get the list of indexes if it > >> is already there". It looks more intuitive to my mind. > > > > I can't follow. RelationGetIndexListIfValid() doesn't return > > anything. And it doesn't do anything if the list is already valid. It > > only does something iff the list currently is invalid. > In this case RelationGetIndexListIfInvalid? Yep. Suggested that above ;). Maybe RelationFetchIndexListIfInvalid()? Hm. Looking at how this is currently used - I am afraid it's not correct... the reason RelationGetIndexList() returns a copy is that cache invalidations will throw away that list. And you do index_open() while iterating over it which will accept invalidation messages. Mybe it's better to try using RelationGetIndexList directly and measure whether that has a measurable impact= Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
OK let's finalize this patch first. I'll try to send an updated patch within today. On Fri, Jun 21, 2013 at 10:47 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-06-21 20:54:34 +0900, Michael Paquier wrote: >> On Fri, Jun 21, 2013 at 6:19 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > On 2013-06-19 09:55:24 +0900, Michael Paquier wrote: >> >> >> @@ -1529,12 +1570,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap, >> >> > Is it actually possible to get here with multiple toast indexes? >> >> Actually it is possible. finish_heap_swap is also called for example >> >> in ALTER TABLE where rewriting the table (phase 3), so I think it is >> >> better to protect this code path this way. >> > >> > But why would we copy invalid toast indexes over to the new relation? >> > Shouldn't the new relation have been freshly built in the previous >> > steps? >> What do you think about that? Using only the first valid index would be enough? > > What I am thinking about is the following: When we rewrite a relation, > we build a completely new toast relation. Which will only have one > index, right? So I don't see how this could could be correct if we deal > with multiple indexes. In fact, the current patch's swap_relation_files > throws an error if there are multiple ones around. Yes, OK. Let me have a look at the code of CLUSTER more in details before giving a precise answer, but I'll try to remove that renaming part. Btw, I'd like to add an assertion in the code at least to prevent wrong use of this code path. >> >> >> diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h >> >> >> index 8ac2549..31309ed 100644 >> >> >> --- a/src/include/utils/relcache.h >> >> >> +++ b/src/include/utils/relcache.h >> >> >> @@ -29,6 +29,16 @@ typedef struct RelationData *Relation; >> >> >> typedef Relation *RelationPtr; >> >> >> >> >> >> /* >> >> >> + * RelationGetIndexListIfValid >> >> >> + * Get index list of relation without recomputing it. >> >> >> + */ >> >> >> +#define RelationGetIndexListIfValid(rel) \ >> >> >> +do { \ >> >> >> + if (rel->rd_indexvalid == 0) \ >> >> >> + RelationGetIndexList(rel); \ >> >> >> +} while(0) >> >> > >> >> > Isn't this function misnamed and should be >> >> > RelationGetIndexListIfInValid? >> >> When naming that; I had more in mind: "get the list of indexes if it >> >> is already there". It looks more intuitive to my mind. >> > >> > I can't follow. RelationGetIndexListIfValid() doesn't return >> > anything. And it doesn't do anything if the list is already valid. It >> > only does something iff the list currently is invalid. >> In this case RelationGetIndexListIfInvalid? > > Yep. Suggested that above ;). Maybe RelationFetchIndexListIfInvalid()? > > Hm. Looking at how this is currently used - I am afraid it's not > correct... the reason RelationGetIndexList() returns a copy is that > cache invalidations will throw away that list. And you do index_open() > while iterating over it which will accept invalidation messages. > Mybe it's better to try using RelationGetIndexList directly and measure > whether that has a measurable impact= Yes, I was wondering about potential memory leak that list_copy could introduce in tuptoaster.c when doing a bulk insert, that's the only reason why I added this macro. -- Michael
On Fri, Jun 21, 2013 at 10:47 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Hm. Looking at how this is currently used - I am afraid it's not > correct... the reason RelationGetIndexList() returns a copy is that > cache invalidations will throw away that list. And you do index_open() > while iterating over it which will accept invalidation messages. > Mybe it's better to try using RelationGetIndexList directly and measure > whether that has a measurable impact= By looking at the comments of RelationGetIndexList:relcache.c, actually the method of the patch is correct because in the event of a shared cache invalidation, rd_indexvalid is set to 0 when the index list is reset, so the index list would get recomputed even in the case of shared mem invalidation. -- Michael
OK, please find attached a new patch for the toast part. IMHO, the patch is now in a pretty good shape... But I cannot judge for others. On Fri, Jun 21, 2013 at 10:47 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-06-21 20:54:34 +0900, Michael Paquier wrote: >> On Fri, Jun 21, 2013 at 6:19 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > On 2013-06-19 09:55:24 +0900, Michael Paquier wrote: >> >> >> @@ -1529,12 +1570,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap, >> >> > Is it actually possible to get here with multiple toast indexes? >> >> Actually it is possible. finish_heap_swap is also called for example >> >> in ALTER TABLE where rewriting the table (phase 3), so I think it is >> >> better to protect this code path this way. >> > >> > But why would we copy invalid toast indexes over to the new relation? >> > Shouldn't the new relation have been freshly built in the previous >> > steps? >> What do you think about that? Using only the first valid index would be enough? > > What I am thinking about is the following: When we rewrite a relation, > we build a completely new toast relation. Which will only have one > index, right? So I don't see how this could could be correct if we deal > with multiple indexes. In fact, the current patch's swap_relation_files > throws an error if there are multiple ones around. I have reworked the code in cluster.c and made the changes more consistent knowing that a given toast relation should only have one valid index. This minimizes modifications where relfilenode is swapped for toast indexes as now the swap is done only on the unique valid indexes that a toast relation has. Also, I removed the error that was in previouss versions triggered when a toast relation had more than one index. >> >> >> diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h >> >> >> index 8ac2549..31309ed 100644 >> >> >> --- a/src/include/utils/relcache.h >> >> >> +++ b/src/include/utils/relcache.h >> >> >> @@ -29,6 +29,16 @@ typedef struct RelationData *Relation; >> >> >> typedef Relation *RelationPtr; >> >> >> >> >> >> /* >> >> >> + * RelationGetIndexListIfValid >> >> >> + * Get index list of relation without recomputing it. >> >> >> + */ >> >> >> +#define RelationGetIndexListIfValid(rel) \ >> >> >> +do { \ >> >> >> + if (rel->rd_indexvalid == 0) \ >> >> >> + RelationGetIndexList(rel); \ >> >> >> +} while(0) >> >> > >> >> > Isn't this function misnamed and should be >> >> > RelationGetIndexListIfInValid? >> >> When naming that; I had more in mind: "get the list of indexes if it >> >> is already there". It looks more intuitive to my mind. >> > >> > I can't follow. RelationGetIndexListIfValid() doesn't return >> > anything. And it doesn't do anything if the list is already valid. It >> > only does something iff the list currently is invalid. >> In this case RelationGetIndexListIfInvalid? > > Yep. Suggested that above ;). Maybe RelationFetchIndexListIfInvalid()? Changed the function name this way. Also, I ran quickly the performance test that Andres sent previously on my MBA and I couldn't notice any difference in performance. master branch + patch: tps = 2034.339242 (including connections establishing) tps = 2034.406515 (excluding connections establishing) master branch: tps = 2083.172009 (including connections establishing) tps = 2083.237669 (excluding connections establishing) Thanks, -- Michael
Attachment
On 2013-06-22 12:50:52 +0900, Michael Paquier wrote: > On Fri, Jun 21, 2013 at 10:47 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > Hm. Looking at how this is currently used - I am afraid it's not > > correct... the reason RelationGetIndexList() returns a copy is that > > cache invalidations will throw away that list. And you do index_open() > > while iterating over it which will accept invalidation messages. > > Mybe it's better to try using RelationGetIndexList directly and measure > > whether that has a measurable impact= > By looking at the comments of RelationGetIndexList:relcache.c, > actually the method of the patch is correct because in the event of a > shared cache invalidation, rd_indexvalid is set to 0 when the index > list is reset, so the index list would get recomputed even in the case > of shared mem invalidation. The problem I see is something else. Consider code like the following: RelationFetchIndexListIfInvalid(toastrel); foreach(lc, toastrel->rd_indexlist) toastidxs[i++] = index_open(lfirst_oid(lc), RowExclusiveLock); index_open calls relation_open calls LockRelationOid which does: if (res != LOCKACQUIRE_ALREADY_HELD) AcceptInvalidationMessages(); So, what might happen is that you open the first index, which accepts an invalidation message which in turn might delete the indexlist. Which means we would likely read invalid memory if there are two indexes. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Jun 22, 2013 at 10:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-06-22 12:50:52 +0900, Michael Paquier wrote: >> On Fri, Jun 21, 2013 at 10:47 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > Hm. Looking at how this is currently used - I am afraid it's not >> > correct... the reason RelationGetIndexList() returns a copy is that >> > cache invalidations will throw away that list. And you do index_open() >> > while iterating over it which will accept invalidation messages. >> > Mybe it's better to try using RelationGetIndexList directly and measure >> > whether that has a measurable impact= >> By looking at the comments of RelationGetIndexList:relcache.c, >> actually the method of the patch is correct because in the event of a >> shared cache invalidation, rd_indexvalid is set to 0 when the index >> list is reset, so the index list would get recomputed even in the case >> of shared mem invalidation. > > The problem I see is something else. Consider code like the following: > > RelationFetchIndexListIfInvalid(toastrel); > foreach(lc, toastrel->rd_indexlist) > toastidxs[i++] = index_open(lfirst_oid(lc), RowExclusiveLock); > > index_open calls relation_open calls LockRelationOid which does: > if (res != LOCKACQUIRE_ALREADY_HELD) > AcceptInvalidationMessages(); > > So, what might happen is that you open the first index, which accepts an > invalidation message which in turn might delete the indexlist. Which > means we would likely read invalid memory if there are two indexes. And I imagine that you have the same problem even with RelationGetIndexList, not only RelationGetIndexListIfInvalid, because this would appear as long as you try to open more than 1 index with an index list. -- Michael
On 2013-06-22 22:45:26 +0900, Michael Paquier wrote: > On Sat, Jun 22, 2013 at 10:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-06-22 12:50:52 +0900, Michael Paquier wrote: > >> By looking at the comments of RelationGetIndexList:relcache.c, > >> actually the method of the patch is correct because in the event of a > >> shared cache invalidation, rd_indexvalid is set to 0 when the index > >> list is reset, so the index list would get recomputed even in the case > >> of shared mem invalidation. > > > > The problem I see is something else. Consider code like the following: > > > > RelationFetchIndexListIfInvalid(toastrel); > > foreach(lc, toastrel->rd_indexlist) > > toastidxs[i++] = index_open(lfirst_oid(lc), RowExclusiveLock); > > > > index_open calls relation_open calls LockRelationOid which does: > > if (res != LOCKACQUIRE_ALREADY_HELD) > > AcceptInvalidationMessages(); > > > > So, what might happen is that you open the first index, which accepts an > > invalidation message which in turn might delete the indexlist. Which > > means we would likely read invalid memory if there are two indexes. > And I imagine that you have the same problem even with > RelationGetIndexList, not only RelationGetIndexListIfInvalid, because > this would appear as long as you try to open more than 1 index with an > index list. No. RelationGetIndexList() returns a copy of the list for exactly that reason. The danger is not to see an outdated list - we should be protected by locks against that - but looking at uninitialized or reused memory. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund escribió: > On 2013-06-22 22:45:26 +0900, Michael Paquier wrote: > > And I imagine that you have the same problem even with > > RelationGetIndexList, not only RelationGetIndexListIfInvalid, because > > this would appear as long as you try to open more than 1 index with an > > index list. > > No. RelationGetIndexList() returns a copy of the list for exactly that > reason. The danger is not to see an outdated list - we should be > protected by locks against that - but looking at uninitialized or reused > memory. Are we doing this only to save some palloc traffic? Could we do this by, say, teaching list_copy() to have a special case for lists of ints and oids that allocates all the cells in a single palloc chunk? (This has the obvious problem that list_free no longer works, of course. But I think that specific problem can be easily fixed. Not sure if it causes more breakage elsewhere.) Alternatively, I guess we could grab an uncopied list, then copy the items individually into a locally allocated array, avoiding list_copy. We'd need to iterate differently than with foreach(). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
OK. Please find an updated patch for the toast part. On Sat, Jun 22, 2013 at 10:48 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-06-22 22:45:26 +0900, Michael Paquier wrote: >> On Sat, Jun 22, 2013 at 10:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > On 2013-06-22 12:50:52 +0900, Michael Paquier wrote: >> >> By looking at the comments of RelationGetIndexList:relcache.c, >> >> actually the method of the patch is correct because in the event of a >> >> shared cache invalidation, rd_indexvalid is set to 0 when the index >> >> list is reset, so the index list would get recomputed even in the case >> >> of shared mem invalidation. >> > >> > The problem I see is something else. Consider code like the following: >> > >> > RelationFetchIndexListIfInvalid(toastrel); >> > foreach(lc, toastrel->rd_indexlist) >> > toastidxs[i++] = index_open(lfirst_oid(lc), RowExclusiveLock); >> > >> > index_open calls relation_open calls LockRelationOid which does: >> > if (res != LOCKACQUIRE_ALREADY_HELD) >> > AcceptInvalidationMessages(); >> > >> > So, what might happen is that you open the first index, which accepts an >> > invalidation message which in turn might delete the indexlist. Which >> > means we would likely read invalid memory if there are two indexes. >> And I imagine that you have the same problem even with >> RelationGetIndexList, not only RelationGetIndexListIfInvalid, because >> this would appear as long as you try to open more than 1 index with an >> index list. > > No. RelationGetIndexList() returns a copy of the list for exactly that > reason. The danger is not to see an outdated list - we should be > protected by locks against that - but looking at uninitialized or reused > memory. OK, so I removed RelationGetIndexListIfInvalid (such things could be an optimization for another patch) and replaced it by calls to RelationGetIndexList to get a copy of rd_indexlist in a local list variable, list free'd when it is not necessary anymore. It looks that there is nothing left for this patch, no? -- Michael
Attachment
On Wed, Jun 19, 2013 at 9:50 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Jun 19, 2013 at 12:36 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Tue, Jun 18, 2013 at 10:53 AM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> An updated patch for the toast part is attached. >>> >>> On Tue, Jun 18, 2013 at 3:26 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>> Here are the review comments of the removal_of_reltoastidxid patch. >>>> I've not completed the review yet, but I'd like to post the current comments >>>> before going to bed ;) >>>> >>>> *** a/src/backend/catalog/system_views.sql >>>> - pg_stat_get_blocks_fetched(X.oid) - >>>> - pg_stat_get_blocks_hit(X.oid) AS tidx_blks_read, >>>> - pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit >>>> + pg_stat_get_blocks_fetched(X.indrelid) - >>>> + pg_stat_get_blocks_hit(X.indrelid) AS tidx_blks_read, >>>> + pg_stat_get_blocks_hit(X.indrelid) AS tidx_blks_hit >>>> >>>> ISTM that X.indrelid indicates the TOAST table not the TOAST index. >>>> Shouldn't we use X.indexrelid instead of X.indrelid? >>> Indeed good catch! We need in this case the statistics on the index >>> and here I used the table OID. Btw, I also noticed that as multiple >>> indexes may be involved for a given toast relation, it makes sense to >>> actually calculate tidx_blks_read and tidx_blks_hit as the sum of all >>> stats of the indexes. >> >> Yep. You seem to need to change X.indexrelid to X.indrelid in GROUP clause. >> Otherwise, you may get two rows of the same table from pg_statio_all_tables. > I changed it a little bit in a different way in my latest patch by > adding a sum on all the indexes when getting tidx_blks stats. > >>>> doc/src/sgml/diskusage.sgml >>>>> There will be one index on the >>>>> <acronym>TOAST</> table, if present. >> >> + table (see <xref linkend="storage-toast">). There will be one valid index >> + on the <acronym>TOAST</> table, if present. There also might be indexes >> >> When I used gdb and tracked the code path of concurrent reindex patch, >> I found it's possible that more than one *valid* toast indexes appear. Those >> multiple valid toast indexes are viewable, for example, from pg_indexes. >> I'm not sure whether this is the bug of concurrent reindex patch. But >> if it's not, >> you seem to need to change the above description again. > Not sure about that. The latest code is made such as only one valid > index is present on the toast relation at the same time. > >> >>>> *** a/src/bin/pg_dump/pg_dump.c >>>> + "SELECT c.reltoastrelid, t.indexrelid " >>>> "FROM pg_catalog.pg_class c LEFT JOIN " >>>> - "pg_catalog.pg_class t ON (c.reltoastrelid = t.oid) " >>>> - "WHERE c.oid = '%u'::pg_catalog.oid;", >>>> + "pg_catalog.pg_index t ON (c.reltoastrelid = t.indrelid) " >>>> + "WHERE c.oid = '%u'::pg_catalog.oid AND t.indisvalid " >>>> + "LIMIT 1", >>>> >>>> Is there the case where TOAST table has more than one *valid* indexes? >>> I just rechecked the patch and is answer is no. The concurrent index >>> is set as valid inside the same transaction as swap. So only the >>> backend performing the swap will be able to see two valid toast >>> indexes at the same time. >> >> According to my quick gdb testing, this seems not to be true.... > Well, I have to disagree. I am not able to reproduce it. Which version > did you use? Here is what I get with the latest version of REINDEX > CONCURRENTLY patch... I checked with the following process: Sorry. This is my mistake. Regards, -- Fujii Masao
On Sun, Jun 23, 2013 at 3:34 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > OK. Please find an updated patch for the toast part. > > On Sat, Jun 22, 2013 at 10:48 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> On 2013-06-22 22:45:26 +0900, Michael Paquier wrote: >>> On Sat, Jun 22, 2013 at 10:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: >>> > On 2013-06-22 12:50:52 +0900, Michael Paquier wrote: >>> >> By looking at the comments of RelationGetIndexList:relcache.c, >>> >> actually the method of the patch is correct because in the event of a >>> >> shared cache invalidation, rd_indexvalid is set to 0 when the index >>> >> list is reset, so the index list would get recomputed even in the case >>> >> of shared mem invalidation. >>> > >>> > The problem I see is something else. Consider code like the following: >>> > >>> > RelationFetchIndexListIfInvalid(toastrel); >>> > foreach(lc, toastrel->rd_indexlist) >>> > toastidxs[i++] = index_open(lfirst_oid(lc), RowExclusiveLock); >>> > >>> > index_open calls relation_open calls LockRelationOid which does: >>> > if (res != LOCKACQUIRE_ALREADY_HELD) >>> > AcceptInvalidationMessages(); >>> > >>> > So, what might happen is that you open the first index, which accepts an >>> > invalidation message which in turn might delete the indexlist. Which >>> > means we would likely read invalid memory if there are two indexes. >>> And I imagine that you have the same problem even with >>> RelationGetIndexList, not only RelationGetIndexListIfInvalid, because >>> this would appear as long as you try to open more than 1 index with an >>> index list. >> >> No. RelationGetIndexList() returns a copy of the list for exactly that >> reason. The danger is not to see an outdated list - we should be >> protected by locks against that - but looking at uninitialized or reused >> memory. > OK, so I removed RelationGetIndexListIfInvalid (such things could be > an optimization for another patch) and replaced it by calls to > RelationGetIndexList to get a copy of rd_indexlist in a local list > variable, list free'd when it is not necessary anymore. > > It looks that there is nothing left for this patch, no? Compile error ;) gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -g -I../../../src/include -c -o index.o index.c index.c: In function 'index_constraint_create': index.c:1257: error: too many arguments to function 'index_update_stats' index.c: At top level: index.c:1785: error: conflicting types for 'index_update_stats' index.c:106: error: previous declaration of 'index_update_stats' was here index.c: In function 'index_update_stats': index.c:1881: error: 'FormData_pg_class' has no member named 'reltoastidxid' index.c:1883: error: 'FormData_pg_class' has no member named 'reltoastidxid' make[3]: *** [index.o] Error 1 make[2]: *** [catalog-recursive] Error 2 make[1]: *** [install-backend-recurse] Error 2 make: *** [install-src-recurse] Error 2 Regards, -- Fujii Masao
On Mon, Jun 24, 2013 at 7:22 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Compile error ;) It looks like filterdiff did not work correctly when generating the latest patch with context diffs, I cannot apply it cleanly wither. This is perhaps due to a wrong manipulation from me. Please try the attached that has been generated as a raw git output. It works correctly with a git apply. I just checked. -- Michael
Attachment
On 2013-06-24 07:46:34 +0900, Michael Paquier wrote: > On Mon, Jun 24, 2013 at 7:22 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > > Compile error ;) > It looks like filterdiff did not work correctly when generating the > latest patch with context diffs, I cannot apply it cleanly wither. > This is perhaps due to a wrong manipulation from me. Please try the > attached that has been generated as a raw git output. It works > correctly with a git apply. I just checked. Did you check whether that introduces a performance regression? > /* ---------- > + * toast_get_valid_index > + * > + * Get the valid index of given toast relation. A toast relation can only > + * have one valid index at the same time. The lock taken on the index > + * relations is released at the end of this function call. > + */ > +Oid > +toast_get_valid_index(Oid toastoid, LOCKMODE lock) > +{ > + ListCell *lc; > + List *indexlist; > + int num_indexes, i = 0; > + Oid validIndexOid; > + Relation validIndexRel; > + Relation *toastidxs; > + Relation toastrel; > + > + /* Get the index list of relation */ > + toastrel = heap_open(toastoid, lock); > + indexlist = RelationGetIndexList(toastrel); > + num_indexes = list_length(indexlist); > + > + /* Open all the index relations */ > + toastidxs = (Relation *) palloc(num_indexes * sizeof(Relation)); > + foreach(lc, indexlist) > + toastidxs[i++] = index_open(lfirst_oid(lc), lock); > + > + /* Fetch valid toast index */ > + validIndexRel = toast_index_fetch_valid(toastidxs, num_indexes); > + validIndexOid = RelationGetRelid(validIndexRel); > + > + /* Close all the index relations */ > + for (i = 0; i < num_indexes; i++) > + index_close(toastidxs[i], lock); > + pfree(toastidxs); > + list_free(indexlist); > + > + heap_close(toastrel, lock); > + return validIndexOid; > +} Just to make sure, could you check we've found a valid index? > static bool > -toastrel_valueid_exists(Relation toastrel, Oid valueid) > +toastrel_valueid_exists(Relation toastrel, Oid valueid, LOCKMODE lockmode) > { > bool result = false; > ScanKeyData toastkey; > SysScanDesc toastscan; > + int i = 0; > + int num_indexes; > + Relation *toastidxs; > + Relation validtoastidx; > + ListCell *lc; > + List *indexlist; > + > + /* Ensure that the list of indexes of toast relation is computed */ > + indexlist = RelationGetIndexList(toastrel); > + num_indexes = list_length(indexlist); > + > + /* Open each index relation necessary */ > + toastidxs = (Relation *) palloc(num_indexes * sizeof(Relation)); > + foreach(lc, indexlist) > + toastidxs[i++] = index_open(lfirst_oid(lc), lockmode); > + > + /* Fetch a valid index relation */ > + validtoastidx = toast_index_fetch_valid(toastidxs, num_indexes); Those 10 lines are repeated multiple times, in different functions. Maybe move them into toast_index_fetch_valid and rename that to Relation * toast_open_indexes(Relation toastrel, LOCKMODE mode, size_t *numindexes, size_t valididx); That way we also wouldn't fetch/copy the indexlist twice in some functions. > + /* Clean up */ > + for (i = 0; i < num_indexes; i++) > + index_close(toastidxs[i], lockmode); > + list_free(indexlist); > + pfree(toastidxs); The indexlist could already be freed inside the function proposed above... > diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c > index 8294b29..2b777da 100644 > --- a/src/backend/commands/tablecmds.c > +++ b/src/backend/commands/tablecmds.c > @@ -8782,7 +8783,13 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode) > errmsg("cannot move temporary tables of other sessions"))); > > + foreach(lc, reltoastidxids) > + { > + Oid toastidxid = lfirst_oid(lc); > + if (OidIsValid(toastidxid)) > + ATExecSetTableSpace(toastidxid, newTableSpace, lockmode); > + } Copy & pasted OidIsValid(), shouldn't be necessary anymore. Otherwise I think there's not really much left to be done. Fujii? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > Otherwise I think there's not really much left to be done. Fujii? Well, other than the fact that we've not got MVCC catalog scans yet. regards, tom lane
On 2013-06-24 09:57:24 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > Otherwise I think there's not really much left to be done. Fujii? > > Well, other than the fact that we've not got MVCC catalog scans yet. That statement was only about about the patch dealing the removal of reltoastidxid. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jun 24, 2013 at 7:39 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-06-24 07:46:34 +0900, Michael Paquier wrote: >> On Mon, Jun 24, 2013 at 7:22 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> > Compile error ;) >> It looks like filterdiff did not work correctly when generating the >> latest patch with context diffs, I cannot apply it cleanly wither. >> This is perhaps due to a wrong manipulation from me. Please try the >> attached that has been generated as a raw git output. It works >> correctly with a git apply. I just checked. > > Did you check whether that introduces a performance regression? > > >> /* ---------- >> + * toast_get_valid_index >> + * >> + * Get the valid index of given toast relation. A toast relation can only >> + * have one valid index at the same time. The lock taken on the index >> + * relations is released at the end of this function call. >> + */ >> +Oid >> +toast_get_valid_index(Oid toastoid, LOCKMODE lock) >> +{ >> + ListCell *lc; >> + List *indexlist; >> + int num_indexes, i = 0; >> + Oid validIndexOid; >> + Relation validIndexRel; >> + Relation *toastidxs; >> + Relation toastrel; >> + >> + /* Get the index list of relation */ >> + toastrel = heap_open(toastoid, lock); >> + indexlist = RelationGetIndexList(toastrel); >> + num_indexes = list_length(indexlist); >> + >> + /* Open all the index relations */ >> + toastidxs = (Relation *) palloc(num_indexes * sizeof(Relation)); >> + foreach(lc, indexlist) >> + toastidxs[i++] = index_open(lfirst_oid(lc), lock); >> + >> + /* Fetch valid toast index */ >> + validIndexRel = toast_index_fetch_valid(toastidxs, num_indexes); >> + validIndexOid = RelationGetRelid(validIndexRel); >> + >> + /* Close all the index relations */ >> + for (i = 0; i < num_indexes; i++) >> + index_close(toastidxs[i], lock); >> + pfree(toastidxs); >> + list_free(indexlist); >> + >> + heap_close(toastrel, lock); >> + return validIndexOid; >> +} > > Just to make sure, could you check we've found a valid index? > >> static bool >> -toastrel_valueid_exists(Relation toastrel, Oid valueid) >> +toastrel_valueid_exists(Relation toastrel, Oid valueid, LOCKMODE lockmode) >> { >> bool result = false; >> ScanKeyData toastkey; >> SysScanDesc toastscan; >> + int i = 0; >> + int num_indexes; >> + Relation *toastidxs; >> + Relation validtoastidx; >> + ListCell *lc; >> + List *indexlist; >> + >> + /* Ensure that the list of indexes of toast relation is computed */ >> + indexlist = RelationGetIndexList(toastrel); >> + num_indexes = list_length(indexlist); >> + >> + /* Open each index relation necessary */ >> + toastidxs = (Relation *) palloc(num_indexes * sizeof(Relation)); >> + foreach(lc, indexlist) >> + toastidxs[i++] = index_open(lfirst_oid(lc), lockmode); >> + >> + /* Fetch a valid index relation */ >> + validtoastidx = toast_index_fetch_valid(toastidxs, num_indexes); > > Those 10 lines are repeated multiple times, in different > functions. Maybe move them into toast_index_fetch_valid and rename that > to > Relation * > toast_open_indexes(Relation toastrel, LOCKMODE mode, size_t *numindexes, size_t valididx); > > That way we also wouldn't fetch/copy the indexlist twice in some > functions. > >> + /* Clean up */ >> + for (i = 0; i < num_indexes; i++) >> + index_close(toastidxs[i], lockmode); >> + list_free(indexlist); >> + pfree(toastidxs); > > The indexlist could already be freed inside the function proposed > above... > > >> diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c >> index 8294b29..2b777da 100644 >> --- a/src/backend/commands/tablecmds.c >> +++ b/src/backend/commands/tablecmds.c >> @@ -8782,7 +8783,13 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode) >> errmsg("cannot move temporary tables of other sessions"))); >> > >> + foreach(lc, reltoastidxids) >> + { >> + Oid toastidxid = lfirst_oid(lc); >> + if (OidIsValid(toastidxid)) >> + ATExecSetTableSpace(toastidxid, newTableSpace, lockmode); >> + } > > Copy & pasted OidIsValid(), shouldn't be necessary anymore. > > > Otherwise I think there's not really much left to be done. Fujii? Yep, will check. Regards, -- Fujii Masao
On Mon, Jun 24, 2013 at 11:06 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-06-24 09:57:24 -0400, Tom Lane wrote: >> Andres Freund <andres@2ndquadrant.com> writes: >> > Otherwise I think there's not really much left to be done. Fujii? >> >> Well, other than the fact that we've not got MVCC catalog scans yet. > > That statement was only about about the patch dealing the removal of > reltoastidxid. Partially my mistake. It is not that obvious just based on the name of this thread, so I should have moved the review of this particular patch to another thread. -- Michael
Patch updated according to comments. On Mon, Jun 24, 2013 at 7:39 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-06-24 07:46:34 +0900, Michael Paquier wrote: >> On Mon, Jun 24, 2013 at 7:22 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> > Compile error ;) >> It looks like filterdiff did not work correctly when generating the >> latest patch with context diffs, I cannot apply it cleanly wither. >> This is perhaps due to a wrong manipulation from me. Please try the >> attached that has been generated as a raw git output. It works >> correctly with a git apply. I just checked. > > Did you check whether that introduces a performance regression? I don't notice any difference. Here are some results on one of my boxes with a single client using your previous test case. master: tps = 1753.374740 (including connections establishing) tps = 1753.505288 (excluding connections establishing) master + patch: tps = 1738.354976 (including connections establishing) tps = 1738.482424 (excluding connections establishing) >> /* ---------- >> + * toast_get_valid_index >> + * >> + * Get the valid index of given toast relation. A toast relation can only >> + * have one valid index at the same time. The lock taken on the index >> + * relations is released at the end of this function call. >> + */ >> +Oid >> +toast_get_valid_index(Oid toastoid, LOCKMODE lock) >> +{ >> + ListCell *lc; >> + List *indexlist; >> + int num_indexes, i = 0; >> + Oid validIndexOid; >> + Relation validIndexRel; >> + Relation *toastidxs; >> + Relation toastrel; >> + >> + /* Get the index list of relation */ >> + toastrel = heap_open(toastoid, lock); >> + indexlist = RelationGetIndexList(toastrel); >> + num_indexes = list_length(indexlist); >> + >> + /* Open all the index relations */ >> + toastidxs = (Relation *) palloc(num_indexes * sizeof(Relation)); >> + foreach(lc, indexlist) >> + toastidxs[i++] = index_open(lfirst_oid(lc), lock); >> + >> + /* Fetch valid toast index */ >> + validIndexRel = toast_index_fetch_valid(toastidxs, num_indexes); >> + validIndexOid = RelationGetRelid(validIndexRel); >> + >> + /* Close all the index relations */ >> + for (i = 0; i < num_indexes; i++) >> + index_close(toastidxs[i], lock); >> + pfree(toastidxs); >> + list_free(indexlist); >> + >> + heap_close(toastrel, lock); >> + return validIndexOid; >> +} > > Just to make sure, could you check we've found a valid index? Added an elog(ERROR) if valid index is not found. > >> static bool >> -toastrel_valueid_exists(Relation toastrel, Oid valueid) >> +toastrel_valueid_exists(Relation toastrel, Oid valueid, LOCKMODE lockmode) >> { >> bool result = false; >> ScanKeyData toastkey; >> SysScanDesc toastscan; >> + int i = 0; >> + int num_indexes; >> + Relation *toastidxs; >> + Relation validtoastidx; >> + ListCell *lc; >> + List *indexlist; >> + >> + /* Ensure that the list of indexes of toast relation is computed */ >> + indexlist = RelationGetIndexList(toastrel); >> + num_indexes = list_length(indexlist); >> + >> + /* Open each index relation necessary */ >> + toastidxs = (Relation *) palloc(num_indexes * sizeof(Relation)); >> + foreach(lc, indexlist) >> + toastidxs[i++] = index_open(lfirst_oid(lc), lockmode); >> + >> + /* Fetch a valid index relation */ >> + validtoastidx = toast_index_fetch_valid(toastidxs, num_indexes); > > Those 10 lines are repeated multiple times, in different > functions. Maybe move them into toast_index_fetch_valid and rename that > to > Relation * > toast_open_indexes(Relation toastrel, LOCKMODE mode, size_t *numindexes, size_t valididx); > > That way we also wouldn't fetch/copy the indexlist twice in some > functions. Good suggestion, this makes the code cleaner. However I didn't use exactly what you suggested: static int toast_open_indexes(Relation toastrel, LOCKMODE lock, Relation **toastidxs, int *num_indexes); static void toast_close_indexes(Relation *toastidxs, int num_indexes, LOCKMODE lock); toast_open_indexes returns the position of valid index in the array of toast indexes. This looked clearer to me when coding. > >> + /* Clean up */ >> + for (i = 0; i < num_indexes; i++) >> + index_close(toastidxs[i], lockmode); >> + list_free(indexlist); >> + pfree(toastidxs); > > The indexlist could already be freed inside the function proposed > above... Done. >> diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c >> index 8294b29..2b777da 100644 >> --- a/src/backend/commands/tablecmds.c >> +++ b/src/backend/commands/tablecmds.c >> @@ -8782,7 +8783,13 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode) >> errmsg("cannot move temporary tables of other sessions"))); >> > >> + foreach(lc, reltoastidxids) >> + { >> + Oid toastidxid = lfirst_oid(lc); >> + if (OidIsValid(toastidxid)) >> + ATExecSetTableSpace(toastidxid, newTableSpace, lockmode); >> + } > > Copy & pasted OidIsValid(), shouldn't be necessary anymore. Yep, indeed. If there are no indexes list would be simply empty. Thanks for your patience. -- Michael
Attachment
On Tue, Jun 25, 2013 at 8:15 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > Patch updated according to comments. Thanks for updating the patch! When I ran VACUUM FULL, I got the following error. ERROR: attempt to apply a mapping to unmapped relation 16404 STATEMENT: vacuum full; Could you let me clear why toast_save_datum needs to update even invalid toast index? It's required only for REINDEX CONCURRENTLY? @@ -1573,7 +1648,7 @@ toastid_valueid_exists(Oid toastrelid, Oid valueid) toastrel = heap_open(toastrelid, AccessShareLock); - result = toastrel_valueid_exists(toastrel, valueid); + result = toastrel_valueid_exists(toastrel, valueid, AccessShareLock); toastid_valueid_exists() is used only in toast_save_datum(). So we should use RowExclusiveLock here rather than AccessShareLock? + * toast_open_indexes + * + * Get an array of index relations associated to the given toast relation + * and return as well the position of the valid index used by the toast + * relation in this array. It is the responsability of the caller of this Typo: responsibility toast_open_indexes(Relation toastrel, + LOCKMODE lock, + Relation **toastidxs, + int *num_indexes) +{ + int i = 0; + int res = 0; + bool found = false; + List *indexlist; + ListCell *lc; + + /* Get index list of relation */ + indexlist = RelationGetIndexList(toastrel); What about adding the assertion which checks that the return value of RelationGetIndexList() is not NIL? When I ran pg_upgrade for the upgrade from 9.2 to HEAD (with patch), I got the following error. Without the patch, that succeeded. command: "/dav/reindex/bin/pg_dump" --host "/dav/reindex" --port 50432 --username "postgres" --schema-only --quote-all-identifiers --binary-upgrade --format=custom --file="pg_upgrade_dump_12270.custom" "postgres" >> "pg_upgrade_dump_12270.log" 2>&1 pg_dump: query returned 0 rows instead of one: SELECT c.reltoastrelid, t.indexrelid FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_index t ON (c.reltoastrelid = t.indrelid) WHERE c.oid = '16390'::pg_catalog.oid AND t.indisvalid; Regards, -- Fujii Masao
On Wed, Jun 26, 2013 at 1:06 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Thanks for updating the patch! And thanks for taking time to look at that. I updated the patch according to your comments, except for the VACUUM FULL problem. Please see patch attached and below for more details. > When I ran VACUUM FULL, I got the following error. > > ERROR: attempt to apply a mapping to unmapped relation 16404 > STATEMENT: vacuum full; This can be reproduced when doing a vacuum full on pg_proc, pg_shdescription or pg_db_role_setting for example, or relations that have no relfilenode (mapped catalogs), and a toast relation. I still have no idea what is happening here but I am looking at it. As this patch removes reltoastidxid, could that removal have effect on the relation mapping of mapped catalogs? Does someone have an idea? > Could you let me clear why toast_save_datum needs to update even invalid toast > index? It's required only for REINDEX CONCURRENTLY? Because an invalid index might be marked as indisready, so ready to receive inserts. Yes this is a requirement for REINDEX CONCURRENTLY, and in a more general way a requirement for a relation that includes in rd_indexlist indexes that are live, ready but not valid. Just based on this remark I spotted a bug in my patch for tuptoaster.c where we could insert a new index tuple entry in toast_save_datum for an index live but not ready. Fixed that by adding an additional check to the flag indisready before calling index_insert. > @@ -1573,7 +1648,7 @@ toastid_valueid_exists(Oid toastrelid, Oid valueid) > > toastrel = heap_open(toastrelid, AccessShareLock); > > - result = toastrel_valueid_exists(toastrel, valueid); > + result = toastrel_valueid_exists(toastrel, valueid, AccessShareLock); > > toastid_valueid_exists() is used only in toast_save_datum(). So we should use > RowExclusiveLock here rather than AccessShareLock? Makes sense. > + * toast_open_indexes > + * > + * Get an array of index relations associated to the given toast relation > + * and return as well the position of the valid index used by the toast > + * relation in this array. It is the responsability of the caller of this > > Typo: responsibility Done. > toast_open_indexes(Relation toastrel, > + LOCKMODE lock, > + Relation **toastidxs, > + int *num_indexes) > +{ > + int i = 0; > + int res = 0; > + bool found = false; > + List *indexlist; > + ListCell *lc; > + > + /* Get index list of relation */ > + indexlist = RelationGetIndexList(toastrel); > > What about adding the assertion which checks that the return value > of RelationGetIndexList() is not NIL? Done. > When I ran pg_upgrade for the upgrade from 9.2 to HEAD (with patch), > I got the following error. Without the patch, that succeeded. > > command: "/dav/reindex/bin/pg_dump" --host "/dav/reindex" --port 50432 > --username "postgres" --schema-only --quote-all-identifiers > --binary-upgrade --format=custom > --file="pg_upgrade_dump_12270.custom" "postgres" >> > "pg_upgrade_dump_12270.log" 2>&1 > pg_dump: query returned 0 rows instead of one: SELECT c.reltoastrelid, > t.indexrelid FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_index > t ON (c.reltoastrelid = t.indrelid) WHERE c.oid = > '16390'::pg_catalog.oid AND t.indisvalid; This issue is reproducible easily by having more than 1 table using toast indexes in the cluster to be upgraded. The error was on pg_dump side when trying to do a binary upgrade. In order to fix that, I changed the code binary_upgrade_set_pg_class_oids:pg_dump.c to fetch the index associated to a toast relation only if there is a toast relation. This adds one extra step in the process for each having a toast relation, but makes the code clearer. Note that I checked pg_upgrade down to 8.4... -- Michael
Attachment
On 2013-06-28 16:30:16 +0900, Michael Paquier wrote: > > When I ran VACUUM FULL, I got the following error. > > > > ERROR: attempt to apply a mapping to unmapped relation 16404 > > STATEMENT: vacuum full; > This can be reproduced when doing a vacuum full on pg_proc, > pg_shdescription or pg_db_role_setting for example, or relations that > have no relfilenode (mapped catalogs), and a toast relation. I still > have no idea what is happening here but I am looking at it. As this > patch removes reltoastidxid, could that removal have effect on the > relation mapping of mapped catalogs? Does someone have an idea? I'd guess you broke "swap_toast_by_content" case in cluster.c? We cannot change the oid of a mapped relation (including indexes) since pg_class in other databases wouldn't get the news. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jun 28, 2013 at 4:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-06-28 16:30:16 +0900, Michael Paquier wrote: >> > When I ran VACUUM FULL, I got the following error. >> > >> > ERROR: attempt to apply a mapping to unmapped relation 16404 >> > STATEMENT: vacuum full; >> This can be reproduced when doing a vacuum full on pg_proc, >> pg_shdescription or pg_db_role_setting for example, or relations that >> have no relfilenode (mapped catalogs), and a toast relation. I still >> have no idea what is happening here but I am looking at it. As this >> patch removes reltoastidxid, could that removal have effect on the >> relation mapping of mapped catalogs? Does someone have an idea? > > I'd guess you broke "swap_toast_by_content" case in cluster.c? We cannot > change the oid of a mapped relation (including indexes) since pg_class > in other databases wouldn't get the news. Yeah, I thought that something was broken in swap_relation_files, but after comparing the code path taken by my code and master, and the different function calls I can't find any difference. I'm assuming that there is something wrong in tuptoaster.c in the fact of opening toast index relations in order to get the Oids to be swapped... But so far nothing I am just not sure... -- Michael
Hi all, Please find attached an updated version of the patch removing reltoastidxid (with and w/o context diffs), patch fixing the vacuum full issue. With this fix, all the comments are addressed. On Fri, Jun 28, 2013 at 5:07 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Jun 28, 2013 at 4:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> On 2013-06-28 16:30:16 +0900, Michael Paquier wrote: >>> > When I ran VACUUM FULL, I got the following error. >>> > >>> > ERROR: attempt to apply a mapping to unmapped relation 16404 >>> > STATEMENT: vacuum full; >>> This can be reproduced when doing a vacuum full on pg_proc, >>> pg_shdescription or pg_db_role_setting for example, or relations that >>> have no relfilenode (mapped catalogs), and a toast relation. I still >>> have no idea what is happening here but I am looking at it. As this >>> patch removes reltoastidxid, could that removal have effect on the >>> relation mapping of mapped catalogs? Does someone have an idea? >> >> I'd guess you broke "swap_toast_by_content" case in cluster.c? We cannot >> change the oid of a mapped relation (including indexes) since pg_class >> in other databases wouldn't get the news. > Yeah, I thought that something was broken in swap_relation_files, but > after comparing the code path taken by my code and master, and the > different function calls I can't find any difference. I'm assuming > that there is something wrong in tuptoaster.c in the fact of opening > toast index relations in order to get the Oids to be swapped... But so > far nothing I am just not sure... The error was indeed in swap_relation_files, when trying to swap toast indexes. The code path doing the toast index swap was taken not for toast relations but for their parent relations, creating weird behavior for mapped catalogs at relation cache level it seems. Regards, -- Michael
Attachment
On Mon, Jul 1, 2013 at 9:31 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > Hi all, > > Please find attached an updated version of the patch removing > reltoastidxid (with and w/o context diffs), patch fixing the vacuum > full issue. With this fix, all the comments are addressed. Thanks for updating the patch! I have one question related to VACUUM FULL problem. What happens if we run VACUUM FULL when there is an invalid toast index? The invalid toast index is rebuilt and marked as valid, i.e., there can be multiple valid toast indexes? Regards, -- Fujii Masao
On Tue, Jul 2, 2013 at 7:36 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Jul 1, 2013 at 9:31 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> Hi all, >> >> Please find attached an updated version of the patch removing >> reltoastidxid (with and w/o context diffs), patch fixing the vacuum >> full issue. With this fix, all the comments are addressed. > > Thanks for updating the patch! > > I have one question related to VACUUM FULL problem. What happens > if we run VACUUM FULL when there is an invalid toast index? The invalid > toast index is rebuilt and marked as valid, i.e., there can be multiple valid > toast indexes? The invalid toast indexes are not rebuilt. With the design of this patch, toast relations can only have one valid index at the same time, and this is also the path taken by REINDEX CONCURRENTLY for toast relations. This process is managed by this code in cluster.c, only the valid index of toast relation is taken into account when rebuilding relations: *************** *** 1393,1410 **** swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class, /* * If we're swapping two toast tables by content, do the same for their ! * indexes. */ if (swap_toast_by_content && ! relform1->reltoastidxid && relform2->reltoastidxid) ! swap_relation_files(relform1->reltoastidxid, ! relform2->reltoastidxid, target_is_pg_class, swap_toast_by_content, is_internal, InvalidTransactionId, InvalidMultiXactId, mapped_tables); /* Clean up. */ heap_freetuple(reltup1); --- 1392,1421 ---- /* * If we're swapping two toast tables by content, do the same for their ! * valid index. The swap can actually be safely done only if the relations ! * have indexes. */ if (swap_toast_by_content && ! relform1->relkind == RELKIND_TOASTVALUE && ! relform2->relkind == RELKIND_TOASTVALUE) ! { ! Oid toastIndex1, toastIndex2; ! ! /* Get valid index for each relation */ ! toastIndex1 = toast_get_valid_index(r1, ! AccessExclusiveLock); ! toastIndex2 = toast_get_valid_index(r2, ! AccessExclusiveLock); ! ! swap_relation_files(toastIndex1, ! toastIndex2, target_is_pg_class, swap_toast_by_content, is_internal, InvalidTransactionId, InvalidMultiXactId, mapped_tables); + } Regards, -- Michael
On Fri, Jun 28, 2013 at 4:30 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Jun 26, 2013 at 1:06 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> Thanks for updating the patch! > And thanks for taking time to look at that. I updated the patch > according to your comments, except for the VACUUM FULL problem. Please > see patch attached and below for more details. > >> When I ran VACUUM FULL, I got the following error. >> >> ERROR: attempt to apply a mapping to unmapped relation 16404 >> STATEMENT: vacuum full; > This can be reproduced when doing a vacuum full on pg_proc, > pg_shdescription or pg_db_role_setting for example, or relations that > have no relfilenode (mapped catalogs), and a toast relation. I still > have no idea what is happening here but I am looking at it. As this > patch removes reltoastidxid, could that removal have effect on the > relation mapping of mapped catalogs? Does someone have an idea? > >> Could you let me clear why toast_save_datum needs to update even invalid toast >> index? It's required only for REINDEX CONCURRENTLY? > Because an invalid index might be marked as indisready, so ready to > receive inserts. Yes this is a requirement for REINDEX CONCURRENTLY, > and in a more general way a requirement for a relation that includes > in rd_indexlist indexes that are live, ready but not valid. Just based > on this remark I spotted a bug in my patch for tuptoaster.c where we > could insert a new index tuple entry in toast_save_datum for an index > live but not ready. Fixed that by adding an additional check to the > flag indisready before calling index_insert. > >> @@ -1573,7 +1648,7 @@ toastid_valueid_exists(Oid toastrelid, Oid valueid) >> >> toastrel = heap_open(toastrelid, AccessShareLock); >> >> - result = toastrel_valueid_exists(toastrel, valueid); >> + result = toastrel_valueid_exists(toastrel, valueid, AccessShareLock); >> >> toastid_valueid_exists() is used only in toast_save_datum(). So we should use >> RowExclusiveLock here rather than AccessShareLock? > Makes sense. > >> + * toast_open_indexes >> + * >> + * Get an array of index relations associated to the given toast relation >> + * and return as well the position of the valid index used by the toast >> + * relation in this array. It is the responsability of the caller of this >> >> Typo: responsibility > Done. > >> toast_open_indexes(Relation toastrel, >> + LOCKMODE lock, >> + Relation **toastidxs, >> + int *num_indexes) >> +{ >> + int i = 0; >> + int res = 0; >> + bool found = false; >> + List *indexlist; >> + ListCell *lc; >> + >> + /* Get index list of relation */ >> + indexlist = RelationGetIndexList(toastrel); >> >> What about adding the assertion which checks that the return value >> of RelationGetIndexList() is not NIL? > Done. > >> When I ran pg_upgrade for the upgrade from 9.2 to HEAD (with patch), >> I got the following error. Without the patch, that succeeded. >> >> command: "/dav/reindex/bin/pg_dump" --host "/dav/reindex" --port 50432 >> --username "postgres" --schema-only --quote-all-identifiers >> --binary-upgrade --format=custom >> --file="pg_upgrade_dump_12270.custom" "postgres" >> >> "pg_upgrade_dump_12270.log" 2>&1 >> pg_dump: query returned 0 rows instead of one: SELECT c.reltoastrelid, >> t.indexrelid FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_index >> t ON (c.reltoastrelid = t.indrelid) WHERE c.oid = >> '16390'::pg_catalog.oid AND t.indisvalid; > This issue is reproducible easily by having more than 1 table using > toast indexes in the cluster to be upgraded. The error was on pg_dump > side when trying to do a binary upgrade. In order to fix that, I > changed the code binary_upgrade_set_pg_class_oids:pg_dump.c to fetch > the index associated to a toast relation only if there is a toast > relation. This adds one extra step in the process for each having a > toast relation, but makes the code clearer. Note that I checked > pg_upgrade down to 8.4... Why did you remove the check of indisvalid from the --binary-upgrade SQL? Without this check, if there is the invalid toast index, more than one rows are returned and ExecuteSqlQueryForSingleRow() would cause the error. + foreach(lc, indexlist) + *toastidxs[i++] = index_open(lfirst_oid(lc), lock); *toastidxs[i++] should be (*toastidxs)[i++]. Otherwise, segmentation fault can happen. For now I've not found any other big problem except the above. Regards, -- Fujii Masao
On Wed, Jul 3, 2013 at 5:22 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Why did you remove the check of indisvalid from the --binary-upgrade SQL? > Without this check, if there is the invalid toast index, more than one rows are > returned and ExecuteSqlQueryForSingleRow() would cause the error. > > + foreach(lc, indexlist) > + *toastidxs[i++] = index_open(lfirst_oid(lc), lock); > > *toastidxs[i++] should be (*toastidxs)[i++]. Otherwise, segmentation fault can > happen. > > For now I've not found any other big problem except the above. OK cool, updated version attached. If you guys think that the attached version is fine (only the reltoasyidxid removal part), perhaps it would be worth committing it as Robert also committed the MVCC catalog patch today. So we would be able to focus on the core feature asap with the 2nd patch, and the removal of AccessExclusiveLock at swap step. Regards, -- Michael
Attachment
On Wed, Jul 3, 2013 at 5:43 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Jul 3, 2013 at 5:22 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> Why did you remove the check of indisvalid from the --binary-upgrade SQL? >> Without this check, if there is the invalid toast index, more than one rows are >> returned and ExecuteSqlQueryForSingleRow() would cause the error. >> >> + foreach(lc, indexlist) >> + *toastidxs[i++] = index_open(lfirst_oid(lc), lock); >> >> *toastidxs[i++] should be (*toastidxs)[i++]. Otherwise, segmentation fault can >> happen. >> >> For now I've not found any other big problem except the above. system_views.sql - GROUP BY C.oid, N.nspname, C.relname, T.oid, X.oid; + GROUP BY C.oid, N.nspname, C.relname, T.oid, X.indexrelid; I found another problem. X.indexrelid should be X.indrelid. Otherwise, when there is the invalid toast index, more than one rows are returned for the same relation. > OK cool, updated version attached. If you guys think that the attached > version is fine (only the reltoasyidxid removal part), perhaps it > would be worth committing it as Robert also committed the MVCC catalog > patch today. So we would be able to focus on the core feature asap > with the 2nd patch, and the removal of AccessExclusiveLock at swap > step. Yep, will do. Maybe today. Regards, -- Fujii Masao
Updated version of this patch attached. At the same time I changed toastrel_valueid_exists back to its former shape by removing the extra LOCKMODE argument I added to pass argument a lock to toast_open_indexes and toast_close_indexes as at all the places only RowExclusiveLock is used. On Wed, Jul 3, 2013 at 5:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Jul 3, 2013 at 5:43 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Wed, Jul 3, 2013 at 5:22 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> Why did you remove the check of indisvalid from the --binary-upgrade SQL? >>> Without this check, if there is the invalid toast index, more than one rows are >>> returned and ExecuteSqlQueryForSingleRow() would cause the error. >>> >>> + foreach(lc, indexlist) >>> + *toastidxs[i++] = index_open(lfirst_oid(lc), lock); >>> >>> *toastidxs[i++] should be (*toastidxs)[i++]. Otherwise, segmentation fault can >>> happen. >>> >>> For now I've not found any other big problem except the above. > > system_views.sql > - GROUP BY C.oid, N.nspname, C.relname, T.oid, X.oid; > + GROUP BY C.oid, N.nspname, C.relname, T.oid, X.indexrelid; > > I found another problem. X.indexrelid should be X.indrelid. Otherwise, when > there is the invalid toast index, more than one rows are returned for the same > relation. Indeed, fixed > >> OK cool, updated version attached. If you guys think that the attached >> version is fine (only the reltoasyidxid removal part), perhaps it >> would be worth committing it as Robert also committed the MVCC catalog >> patch today. So we would be able to focus on the core feature asap >> with the 2nd patch, and the removal of AccessExclusiveLock at swap >> step. > > Yep, will do. Maybe today. I also double-checked with gdb and the REINDEX CONCURRENTLY patch applied on top of the attached patch that the new code paths introduced in tuptoaster.c are fine. Regards, -- Michael
Attachment
On 2013-07-03 10:03:26 +0900, Michael Paquier wrote: > +static int > +toast_open_indexes(Relation toastrel, > + LOCKMODE lock, > + Relation **toastidxs, > + int *num_indexes) > + /* > + * Free index list, not necessary as relations are opened and a valid index > + * has been found. > + */ > + list_free(indexlist); Missing "anymore" or such. > index 9ee9ea2..23e0373 100644 > --- a/src/bin/pg_dump/pg_dump.c > +++ b/src/bin/pg_dump/pg_dump.c > @@ -2778,10 +2778,9 @@ binary_upgrade_set_pg_class_oids(Archive *fout, > PQExpBuffer upgrade_query = createPQExpBuffer(); > PGresult *upgrade_res; > Oid pg_class_reltoastrelid; > - Oid pg_class_reltoastidxid; > > appendPQExpBuffer(upgrade_query, > - "SELECT c.reltoastrelid, t.reltoastidxid " > + "SELECT c.reltoastrelid " > "FROM pg_catalog.pg_class c LEFT JOIN " > "pg_catalog.pg_class t ON (c.reltoastrelid = t.oid) " > "WHERE c.oid = '%u'::pg_catalog.oid;", > @@ -2790,7 +2789,6 @@ binary_upgrade_set_pg_class_oids(Archive *fout, > upgrade_res = ExecuteSqlQueryForSingleRow(fout, upgrade_query->data); > > pg_class_reltoastrelid = atooid(PQgetvalue(upgrade_res, 0, PQfnumber(upgrade_res, "reltoastrelid"))); > - pg_class_reltoastidxid = atooid(PQgetvalue(upgrade_res, 0, PQfnumber(upgrade_res, "reltoastidxid"))); > > appendPQExpBuffer(upgrade_buffer, > "\n-- For binary upgrade, must preserve pg_class oids\n"); > @@ -2803,6 +2801,10 @@ binary_upgrade_set_pg_class_oids(Archive *fout, > /* only tables have toast tables, not indexes */ > if (OidIsValid(pg_class_reltoastrelid)) > { > + PQExpBuffer index_query = createPQExpBuffer(); > + PGresult *index_res; > + Oid indexrelid; > + > /* > * One complexity is that the table definition might not require > * the creation of a TOAST table, and the TOAST table might have > @@ -2816,10 +2818,23 @@ binary_upgrade_set_pg_class_oids(Archive *fout, > "SELECT binary_upgrade.set_next_toast_pg_class_oid('%u'::pg_catalog.oid);\n", > pg_class_reltoastrelid); > > - /* every toast table has an index */ > + /* Every toast table has one valid index, so fetch it first... */ > + appendPQExpBuffer(index_query, > + "SELECT c.indexrelid " > + "FROM pg_catalog.pg_index c " > + "WHERE c.indrelid = %u AND c.indisvalid;", > + pg_class_reltoastrelid); > + index_res = ExecuteSqlQueryForSingleRow(fout, index_query->data); > + indexrelid = atooid(PQgetvalue(index_res, 0, > + PQfnumber(index_res, "indexrelid"))); > + > + /* Then set it */ > appendPQExpBuffer(upgrade_buffer, > "SELECT binary_upgrade.set_next_index_pg_class_oid('%u'::pg_catalog.oid);\n", > - pg_class_reltoastidxid); > + indexrelid); > + > + PQclear(index_res); > + destroyPQExpBuffer(index_query); Wouldn't it make more sense to fetch the toast index oid in the query ontop instead of making a query for every relation? Looking good! Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jul 3, 2013 at 11:16 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-07-03 10:03:26 +0900, Michael Paquier wrote: >> index 9ee9ea2..23e0373 100644 >> --- a/src/bin/pg_dump/pg_dump.c >> +++ b/src/bin/pg_dump/pg_dump.c >> @@ -2778,10 +2778,9 @@ binary_upgrade_set_pg_class_oids(Archive *fout, >> PQExpBuffer upgrade_query = createPQExpBuffer(); >> PGresult *upgrade_res; >> Oid pg_class_reltoastrelid; >> - Oid pg_class_reltoastidxid; >> >> appendPQExpBuffer(upgrade_query, >> - "SELECT c.reltoastrelid, t.reltoastidxid " >> + "SELECT c.reltoastrelid " >> "FROM pg_catalog.pg_class c LEFT JOIN " >> "pg_catalog.pg_class t ON (c.reltoastrelid = t.oid) " >> "WHERE c.oid = '%u'::pg_catalog.oid;", >> @@ -2790,7 +2789,6 @@ binary_upgrade_set_pg_class_oids(Archive *fout, >> upgrade_res = ExecuteSqlQueryForSingleRow(fout, upgrade_query->data); >> >> pg_class_reltoastrelid = atooid(PQgetvalue(upgrade_res, 0, PQfnumber(upgrade_res, "reltoastrelid"))); >> - pg_class_reltoastidxid = atooid(PQgetvalue(upgrade_res, 0, PQfnumber(upgrade_res, "reltoastidxid"))); >> >> appendPQExpBuffer(upgrade_buffer, >> "\n-- For binary upgrade, must preserve pg_class oids\n"); >> @@ -2803,6 +2801,10 @@ binary_upgrade_set_pg_class_oids(Archive *fout, >> /* only tables have toast tables, not indexes */ >> if (OidIsValid(pg_class_reltoastrelid)) >> { >> + PQExpBuffer index_query = createPQExpBuffer(); >> + PGresult *index_res; >> + Oid indexrelid; >> + >> /* >> * One complexity is that the table definition might not require >> * the creation of a TOAST table, and the TOAST table might have >> @@ -2816,10 +2818,23 @@ binary_upgrade_set_pg_class_oids(Archive *fout, >> "SELECT binary_upgrade.set_next_toast_pg_class_oid('%u'::pg_catalog.oid);\n", >> pg_class_reltoastrelid); >> >> - /* every toast table has an index */ >> + /* Every toast table has one valid index, so fetch it first... */ >> + appendPQExpBuffer(index_query, >> + "SELECT c.indexrelid " >> + "FROM pg_catalog.pg_index c " >> + "WHERE c.indrelid = %u AND c.indisvalid;", >> + pg_class_reltoastrelid); >> + index_res = ExecuteSqlQueryForSingleRow(fout, index_query->data); >> + indexrelid = atooid(PQgetvalue(index_res, 0, >> + PQfnumber(index_res, "indexrelid"))); >> + >> + /* Then set it */ >> appendPQExpBuffer(upgrade_buffer, >> "SELECT binary_upgrade.set_next_index_pg_class_oid('%u'::pg_catalog.oid);\n", >> - pg_class_reltoastidxid); >> + indexrelid); >> + >> + PQclear(index_res); >> + destroyPQExpBuffer(index_query); > > Wouldn't it make more sense to fetch the toast index oid in the query > ontop instead of making a query for every relation? With something like a CASE condition in the upper query for reltoastrelid? This code path is not only taken by indexes but also by tables. So I thought that it was cleaner and more readable to fetch the index OID only if necessary as a separate query. Regards, -- Michael
On 2013-07-04 02:32:32 +0900, Michael Paquier wrote: > > Wouldn't it make more sense to fetch the toast index oid in the query > > ontop instead of making a query for every relation? > With something like a CASE condition in the upper query for > reltoastrelid? This code path is not only taken by indexes but also by > tables. So I thought that it was cleaner and more readable to fetch > the index OID only if necessary as a separate query. A left join should do the trick? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jul 4, 2013 at 2:36 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-07-04 02:32:32 +0900, Michael Paquier wrote: >> > Wouldn't it make more sense to fetch the toast index oid in the query >> > ontop instead of making a query for every relation? +1 I changed the query that way. Updated version of the patch attached. Also I updated the rules.out because Michael changed the system_views.sql. Otherwise, the regression test would fail. Will commit this patch. Regards, -- Fujii Masao
Attachment
On Thu, Jul 4, 2013 at 2:41 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Thu, Jul 4, 2013 at 2:36 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> On 2013-07-04 02:32:32 +0900, Michael Paquier wrote: >>> > Wouldn't it make more sense to fetch the toast index oid in the query >>> > ontop instead of making a query for every relation? > > +1 > I changed the query that way. Updated version of the patch attached. > > Also I updated the rules.out because Michael changed the system_views.sql. > Otherwise, the regression test would fail. > > Will commit this patch. Committed. So, let's get to REINDEX CONCURRENTLY patch! Regards, -- Fujii Masao
On Thu, Jul 4, 2013 at 3:26 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Thu, Jul 4, 2013 at 2:41 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Thu, Jul 4, 2013 at 2:36 AM, Andres Freund <andres@2ndquadrant.com> wrote: >>> On 2013-07-04 02:32:32 +0900, Michael Paquier wrote: >>>> > Wouldn't it make more sense to fetch the toast index oid in the query >>>> > ontop instead of making a query for every relation? >> >> +1 >> I changed the query that way. Updated version of the patch attached. >> >> Also I updated the rules.out because Michael changed the system_views.sql. >> Otherwise, the regression test would fail. >> >> Will commit this patch. > > Committed. So, let's get to REINDEX CONCURRENTLY patch! Thanks for the hard work! I'll work on something based on MVCC catalogs, so at least lock will be lowered at swap phase and isolation tests will be added. -- Michael
Hi, I noticed some errors in the comments of the patch committed. Please find attached a patch to correct that. Regards, -- Michael
Attachment
On Thu, Jul 4, 2013 at 3:38 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Hi, > > I noticed some errors in the comments of the patch committed. Please > find attached a patch to correct that. Committed. Thanks! Regards, -- Fujii Masao
Hi all, Please find attached the patch using MVCC catalogs. I have split the previous core patch into 3 pieces to facilitate the review and reduce the size of the main patch as the previous core patch contained a lot of code refactoring. 0) 20130705_0_procarray.patch, this patch adds a set of generic APIs in procarray.c that can be used to wait for snapshots older than a given xmin, or to wait for some virtual locks. This code has been taken from CREATE/DROP INDEX CONCURRENTLY, and I think that this set of APIs could be used for the implementation os other concurrent DDLs. 1) 20130705_1_index_conc_struct.patch, this patch refactors a bit CREATE/DROP INDEX CONCURRENTLY to create 2 generic APIs for the build of a concurrent index, and the step where it is set as dead. 2) 20130705_2_reindex_concurrently_v28.patch, with the core feature. I have added some stuff here: - isolation tests, (perhaps it would be better to make the DML actions last longer in those tests?) - reduction of the lock used at swap phase from AccessExclusiveLock to ShareUpdateExclusiveLock, and added a wait before commit of swap phase for old snapshots at the end of swap phase to be sure that no transactions will use the old relfilenode that has been swapped after commit - doc update - simplified some APIs, like the removal of index_concurrent_clear_valid - fixed a bug where it was not possible to reindex concurrently a toast relation Patch 1 depends on 0, Patch 2 depends on 1 and 0. Patch 0 can be applied directly on master. The two first patches are pretty simple, patch 0 could even be quickly reviewed and approved to provide some more infrastructure that could be possibly used by some other patches around, like REFRESH CONCURRENTLY... I have also done some tests with the set of patches: - Manual testing, and checked that process went smoothly by taking some manual checkpoints during each phase of REINDEX CONCURRENTLY - Ran make check for regression and isolation tests - Ran make installcheck, and then REINDEX DATABASE CONCURRENTLY on the database regression that remained on server Regards, -- Michael
Attachment
Hi all, I am resending the patches after Fujii-san noticed a bug allowing to even drop valid toast indexes with the latest code... While looking at that, I found a couple of other bugs: - two bugs, now fixed, with the code path added in tablecmds.c to allow the manual drop of invalid toast indexes: -- Even a user having no permission on the parent toast table could drop an invalid toast index -- A lock on the parent toast relation was not taken as it is the case for all the indexes dropped with DROP INDEX - Trying to reindex concurrently a mapped catalog leads to an error. As they have no relfilenode, I think it makes sense to block reindex concurrently in this case, so I modified the core patch in this sense. Regards, On Fri, Jul 5, 2013 at 1:47 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Hi all, > > Please find attached the patch using MVCC catalogs. I have split the > previous core patch into 3 pieces to facilitate the review and reduce > the size of the main patch as the previous core patch contained a lot > of code refactoring. > 0) 20130705_0_procarray.patch, this patch adds a set of generic APIs > in procarray.c that can be used to wait for snapshots older than a > given xmin, or to wait for some virtual locks. This code has been > taken from CREATE/DROP INDEX CONCURRENTLY, and I think that this set > of APIs could be used for the implementation os other concurrent DDLs. > 1) 20130705_1_index_conc_struct.patch, this patch refactors a bit > CREATE/DROP INDEX CONCURRENTLY to create 2 generic APIs for the build > of a concurrent index, and the step where it is set as dead. > 2) 20130705_2_reindex_concurrently_v28.patch, with the core feature. I > have added some stuff here: > - isolation tests, (perhaps it would be better to make the DML actions > last longer in those tests?) > - reduction of the lock used at swap phase from AccessExclusiveLock to > ShareUpdateExclusiveLock, and added a wait before commit of swap phase > for old snapshots at the end of swap phase to be sure that no > transactions will use the old relfilenode that has been swapped after > commit > - doc update > - simplified some APIs, like the removal of index_concurrent_clear_valid > - fixed a bug where it was not possible to reindex concurrently a toast relation > Patch 1 depends on 0, Patch 2 depends on 1 and 0. Patch 0 can be > applied directly on master. > > The two first patches are pretty simple, patch 0 could even be quickly > reviewed and approved to provide some more infrastructure that could > be possibly used by some other patches around, like REFRESH > CONCURRENTLY... > > I have also done some tests with the set of patches: > - Manual testing, and checked that process went smoothly by taking > some manual checkpoints during each phase of REINDEX CONCURRENTLY > - Ran make check for regression and isolation tests > - Ran make installcheck, and then REINDEX DATABASE CONCURRENTLY on the > database regression that remained on server > > Regards, > -- > Michael -- Michael
Attachment
On Thu, Jul 11, 2013 at 5:11 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > I am resending the patches after Fujii-san noticed a bug allowing to > even drop valid toast indexes with the latest code... While looking at > that, I found a couple of other bugs: > - two bugs, now fixed, with the code path added in tablecmds.c to > allow the manual drop of invalid toast indexes: > -- Even a user having no permission on the parent toast table could > drop an invalid toast index > -- A lock on the parent toast relation was not taken as it is the case > for all the indexes dropped with DROP INDEX > - Trying to reindex concurrently a mapped catalog leads to an error. > As they have no relfilenode, I think it makes sense to block reindex > concurrently in this case, so I modified the core patch in this sense. This patch status has been changed to returned with feedback. -- Michael
Hi, I have been working a little bit more on this patch for the next commit fest. Compared to the previous version, I have removed the part of the code where process running REINDEX CONCURRENTLY was waiting for transactions holding a snapshot older than the snapshot xmin of process running REINDEX CONCURRENTLY at the validation and swap phase. At the validation phase, there was a risk that the newly-validated index might not contain deleted tuples before the snapshot used for validation was taken. I tried to break the code in this area by playing with multiple sessions but couldn't. Feel free to try the code and break it if you can! At the swap phase, the process running REINDEX CONCURRENTLY needed to wait for transactions that might have needed the older index information being swapped. As swap phase is done with an MVCC snapshot, this is not necessary anymore. Thanks to the removal of this code, I am not seeing anymore with this patch deadlocks that could occur when other sessions tried to take a ShareUpdateExclusiveLock on a relation with an ANALYZE for example. So multiple backends can kick in parallel REINDEX CONCURRENTLY or ANALYZE commands without risks of deadlock. Processes will just wait for locks as long as necessary. Regards, -- Michael
Attachment
On 2013-08-27 15:34:22 +0900, Michael Paquier wrote: > I have been working a little bit more on this patch for the next > commit fest. Compared to the previous version, I have removed the part > of the code where process running REINDEX CONCURRENTLY was waiting for > transactions holding a snapshot older than the snapshot xmin of > process running REINDEX CONCURRENTLY at the validation and swap phase. > At the validation phase, there was a risk that the newly-validated > index might not contain deleted tuples before the snapshot used for > validation was taken. I tried to break the code in this area by > playing with multiple sessions but couldn't. Feel free to try the code > and break it if you can! Hm. Do you have any justifications for removing those waits besides "I couldn't break it"? The logic for the concurrent indexing is pretty intricate and we've got it wrong a couple of times without noticing bugs for a long while, so I am really uncomfortable with statements like this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Aug 27, 2013 at 11:09 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-08-27 15:34:22 +0900, Michael Paquier wrote: >> I have been working a little bit more on this patch for the next >> commit fest. Compared to the previous version, I have removed the part >> of the code where process running REINDEX CONCURRENTLY was waiting for >> transactions holding a snapshot older than the snapshot xmin of >> process running REINDEX CONCURRENTLY at the validation and swap phase. >> At the validation phase, there was a risk that the newly-validated >> index might not contain deleted tuples before the snapshot used for >> validation was taken. I tried to break the code in this area by >> playing with multiple sessions but couldn't. Feel free to try the code >> and break it if you can! > > Hm. Do you have any justifications for removing those waits besides "I > couldn't break it"? The logic for the concurrent indexing is pretty > intricate and we've got it wrong a couple of times without noticing bugs > for a long while, so I am really uncomfortable with statements like this. Note that the waits on relation locks are not removed, only the wait phases involving old snapshots. During swap phase, process was waiting for transactions with older snapshots than the one taken by transaction doing the swap as they might hold the old index information. I think that we can get rid of it thanks to the MVCC snapshots as other backends are now able to see what is the correct index information to fetch. After doing the new index validation, index has all the tuples necessary, however it might not have taken into account tuples that have been deleted before the reference snapshot was taken. But, in the case of REINDEX CONCURRENTLY the index validated is not marked as valid as it is the case in CREATE INDEX CONCURRENTLY, the transaction doing the validation is directly committed. This index is thought as valid only after doing the swap phase, when relfilenodes are changed. I am sure you will find some flaws in this reasoning though :). Of course not having being able to break this code now with my picky tests taking targeted breakpoints does not mean that this code will not fail in a given scenario, just that I could not break it yet. Note also that removing those wait phases has the advantage to remove risks of deadlocks when an ANALYZE is run in parallel to REINDEX CONCURRENTLY as it was the case in the previous versions of the patch (reproducible when waiting for the old snapshots if a session takes ShareUpdateExclusiveLock on the same relation in parallel). -- Michael
On 2013-08-28 13:58:08 +0900, Michael Paquier wrote: > On Tue, Aug 27, 2013 at 11:09 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-08-27 15:34:22 +0900, Michael Paquier wrote: > >> I have been working a little bit more on this patch for the next > >> commit fest. Compared to the previous version, I have removed the part > >> of the code where process running REINDEX CONCURRENTLY was waiting for > >> transactions holding a snapshot older than the snapshot xmin of > >> process running REINDEX CONCURRENTLY at the validation and swap phase. > >> At the validation phase, there was a risk that the newly-validated > >> index might not contain deleted tuples before the snapshot used for > >> validation was taken. I tried to break the code in this area by > >> playing with multiple sessions but couldn't. Feel free to try the code > >> and break it if you can! > > > > Hm. Do you have any justifications for removing those waits besides "I > > couldn't break it"? The logic for the concurrent indexing is pretty > > intricate and we've got it wrong a couple of times without noticing bugs > > for a long while, so I am really uncomfortable with statements like this. > Note that the waits on relation locks are not removed, only the wait > phases involving old snapshots. > > During swap phase, process was waiting for transactions with older > snapshots than the one taken by transaction doing the swap as they > might hold the old index information. I think that we can get rid of > it thanks to the MVCC snapshots as other backends are now able to see > what is the correct index information to fetch. I don't see MVCC snapshots guaranteeing that. The only thing changed due to them is that other backends see a self consistent picture of the catalog (i.e. not either, neither or both versions of a tuple as earlier). It's still can be out of date. And we rely on those not being out of date. I need to look into the patch for more details. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Aug 28, 2013 at 9:02 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> During swap phase, process was waiting for transactions with older >> snapshots than the one taken by transaction doing the swap as they >> might hold the old index information. I think that we can get rid of >> it thanks to the MVCC snapshots as other backends are now able to see >> what is the correct index information to fetch. > > I don't see MVCC snapshots guaranteeing that. The only thing changed due > to them is that other backends see a self consistent picture of the > catalog (i.e. not either, neither or both versions of a tuple as > earlier). It's still can be out of date. And we rely on those not being > out of date. > > I need to look into the patch for more details. I agree with Andres. The only way in which the MVCC catalog snapshot patch helps is that you can now do a transactional update on a system catalog table without fearing that other backends will see the row as nonexistent or duplicated. They will see exactly one version of the row, just as you would naturally expect. However, a backend's syscaches can still contain old versions of rows, and they can still cache older versions of some tuples and newer versions of other tuples. Those caches only get reloaded when shared-invalidation messages are processed, and that only happens when the backend acquires a lock on a new relation. I have been of the opinion for some time now that the shared-invalidation code is not a particularly good design for much of what we need. Waiting for an old snapshot is often a proxy for waiting long enough that we can be sure every other backend will process the shared-invalidation message before it next uses any of the cached data that will be invalidated by that message. However, it would be better to be able to send invalidation messages in some way that causes them to processed more eagerly by other backends, and that provides some more specific feedback on whether or not they have actually been processed. Then we could send the invalidation messages, wait just until everyone confirms that they have been seen, which should hopefully happen quickly, and then proceed. This would probably lead to much shorter waits. Or maybe we should have individual backends process invalidations more frequently, and try to set things up so that once an invalidation is sent, the sending backend is immediately guaranteed that it will be processed soon enough, and thus it doesn't need to wait at all. This is all pie in the sky, though. I don't have a clear idea how to design something that's an improvement over the (rather intricate) system we have today. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-08-29 10:39:09 -0400, Robert Haas wrote: > I have been of the opinion for some time now that the > shared-invalidation code is not a particularly good design for much of > what we need. Waiting for an old snapshot is often a proxy for > waiting long enough that we can be sure every other backend will > process the shared-invalidation message before it next uses any of the > cached data that will be invalidated by that message. However, it > would be better to be able to send invalidation messages in some way > that causes them to processed more eagerly by other backends, and that > provides some more specific feedback on whether or not they have > actually been processed. Then we could send the invalidation > messages, wait just until everyone confirms that they have been seen, > which should hopefully happen quickly, and then proceed. Actually, the shared inval code already has that knowledge, doesn't it? ISTM all we'd need is have a "sequence number" of SI entries which has to be queryable. Then one can simply wait till all backends have consumed up to that id which we keep track of the furthest back backend in shmem. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, Looking at this version of the patch now: 1) comment for "Phase 4 of REINDEX CONCURRENTLY" ends with an incomplete sentence. 2) I don't think the drop algorithm used now is correct. Your index_concurrent_set_dead() sets both indisvalid = false and indislive = false at the same time. It does so after doing a WaitForVirtualLocks() - but that's not sufficient. Between waiting and setting indisvalid = false another transaction could start which then would start using that index. Which will not get updated anymore by other concurrent backends because of inislive = false. You really need to follow index_drop's lead here and first unset indisvalid then wait till nobody can use the index for querying anymore and only then unset indislive. 3) I am not sure if the swap algorithm used now actually is correct either. We have mvcc snapshots now, right, but we're still potentially taking separate snapshot for individual relcache lookups. What's stopping us from temporarily ending up with two relcache entries with the same relfilenode? Previously you swapped names - I think that might end up being easier, because having names temporarily confused isn't as bad as two indexes manipulating the same file. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Sep 16, 2013 at 10:38 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-08-29 10:39:09 -0400, Robert Haas wrote: >> I have been of the opinion for some time now that the >> shared-invalidation code is not a particularly good design for much of >> what we need. Waiting for an old snapshot is often a proxy for >> waiting long enough that we can be sure every other backend will >> process the shared-invalidation message before it next uses any of the >> cached data that will be invalidated by that message. However, it >> would be better to be able to send invalidation messages in some way >> that causes them to processed more eagerly by other backends, and that >> provides some more specific feedback on whether or not they have >> actually been processed. Then we could send the invalidation >> messages, wait just until everyone confirms that they have been seen, >> which should hopefully happen quickly, and then proceed. > > Actually, the shared inval code already has that knowledge, doesn't it? > ISTM all we'd need is have a "sequence number" of SI entries which has > to be queryable. Then one can simply wait till all backends have > consumed up to that id which we keep track of the furthest back backend > in shmem. In theory, yes, but in practice, there are a few difficulties. 1. We're not in a huge hurry to ensure that sinval notifications are delivered in a timely fashion. We know that sinval resets are bad, so if a backend is getting close to needing a sinval reset, we kick it in an attempt to get it to AcceptInvalidationMessages(). But if the sinval queue isn't filling up, there's no upper bound on the amount of time that can pass before a particular sinval is read. Therefore, the amount of time that passes before an idle backend is forced to drain the sinval queue can vary widely, from a fraction of a second to minutes, hours, or days. So it's kind of unappealing to think about making user-visible behavior dependent on how long it ends up taking. 2. Every time we add a new kind of sinval message, we increase the frequency of sinval resets, and those are bad. So any notifications that we choose to send this way had better be pretty low-volume. Considering the foregoing points, it's unclear to me whether we should try to improve sinval incrementally or replace it with something completely new. I'm sure that the above-mentioned problems are solvable, but I'm not sure how hairy it will be. On the other hand, designing something new could be pretty hairy, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-09-17 16:34:37 -0400, Robert Haas wrote: > On Mon, Sep 16, 2013 at 10:38 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > Actually, the shared inval code already has that knowledge, doesn't it? > > ISTM all we'd need is have a "sequence number" of SI entries which has > > to be queryable. Then one can simply wait till all backends have > > consumed up to that id which we keep track of the furthest back backend > > in shmem. > > In theory, yes, but in practice, there are a few difficulties. Agreed ;) > 1. We're not in a huge hurry to ensure that sinval notifications are > delivered in a timely fashion. We know that sinval resets are bad, so > if a backend is getting close to needing a sinval reset, we kick it in > an attempt to get it to AcceptInvalidationMessages(). But if the > sinval queue isn't filling up, there's no upper bound on the amount of > time that can pass before a particular sinval is read. Therefore, the > amount of time that passes before an idle backend is forced to drain > the sinval queue can vary widely, from a fraction of a second to > minutes, hours, or days. So it's kind of unappealing to think about > making user-visible behavior dependent on how long it ends up taking. Well, when we're signalling it's certainly faster than waiting for the other's snapshot to vanish which can take ages for normal backends. And we can signal when we wait for consumption without too many problems. Also, I think in most of the usecases we can simply not wait for any of the idle backends, those don't use the old definition anyway. > 2. Every time we add a new kind of sinval message, we increase the > frequency of sinval resets, and those are bad. So any notifications > that we choose to send this way had better be pretty low-volume. In pretty much all the cases where I can see the need for something like that, we already send sinval messages, so we should be able to piggbyback on those. > Considering the foregoing points, it's unclear to me whether we should > try to improve sinval incrementally or replace it with something > completely new. I'm sure that the above-mentioned problems are > solvable, but I'm not sure how hairy it will be. On the other hand, > designing something new could be pretty hairy, too. I am pretty sure there's quite a bit to improve around sinvals but I think any replacement would look surprisingly similar to what we have. So I think doing it incrementally is more realistic. And I am certainly scared by the thought of having to replace it without breaking corner cases all over. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Sep 17, 2013 at 7:04 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> 1. We're not in a huge hurry to ensure that sinval notifications are >> delivered in a timely fashion. We know that sinval resets are bad, so >> if a backend is getting close to needing a sinval reset, we kick it in >> an attempt to get it to AcceptInvalidationMessages(). But if the >> sinval queue isn't filling up, there's no upper bound on the amount of >> time that can pass before a particular sinval is read. Therefore, the >> amount of time that passes before an idle backend is forced to drain >> the sinval queue can vary widely, from a fraction of a second to >> minutes, hours, or days. So it's kind of unappealing to think about >> making user-visible behavior dependent on how long it ends up taking. > > Well, when we're signalling it's certainly faster than waiting for the > other's snapshot to vanish which can take ages for normal backends. And > we can signal when we wait for consumption without too many > problems. > Also, I think in most of the usecases we can simply not wait for any of > the idle backends, those don't use the old definition anyway. Possibly. It would need some thought, though. > I am pretty sure there's quite a bit to improve around sinvals but I > think any replacement would look surprisingly similar to what we > have. So I think doing it incrementally is more realistic. > And I am certainly scared by the thought of having to replace it without > breaking corner cases all over. I guess I was more thinking that we might want some parallel mechanism with somewhat different semantics. But that might be a bad idea anyway. On the flip side, if I had any clear idea how to adapt the current mechanism to suck less, I would have done it already. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Sorry for late reply, I am coming back poking at this patch a bit. One of the things that I am still unhappy with this patch are the potential deadlocks that can come up when for example another backend kicks another operation taking ShareUpdateExclusiveLock (ANALYZE or another REINDEX CONCURRENTLY) on the same relation as the one reindexed concurrently. This can happen because we need to wait at index validation phase as process might not have taken into account deleted tuples before the reference snapshot was taken. I played a little bit with a version of the code using no old snapshot waiting, but even if I couldn't break it directly, concurrent backends sometimes took incorrect tuples from heap. I unfortunately have no clear solution about how to solve that... Except making REINDEX CONCURRENTLY fail when validating the concurrent index with a clear error message not referencing to any deadlock, giving priority to other processes like for example ANALYZE, or other backends ready to kick another REINDEX CONCURRENTLY... Any ideas here are welcome, the patch attached does the implementation mentioned here. On Tue, Sep 17, 2013 at 12:37 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Looking at this version of the patch now: > 1) comment for "Phase 4 of REINDEX CONCURRENTLY" ends with an incomplete > sentence. Oops, thanks. > 2) I don't think the drop algorithm used now is correct. Your > index_concurrent_set_dead() sets both indisvalid = false and indislive = > false at the same time. It does so after doing a WaitForVirtualLocks() - > but that's not sufficient. Between waiting and setting indisvalid = > false another transaction could start which then would start using that > index. Which will not get updated anymore by other concurrent backends > because of inislive = false. > You really need to follow index_drop's lead here and first unset > indisvalid then wait till nobody can use the index for querying anymore > and only then unset indislive. Sorry, I do not follow you here. index_concurrent_set_dead calls index_set_state_flags that sets indislive and *indisready* to false, not indisvalid. The concurrent index never uses indisvalid = true so it can never be called by another backend for a read query. The drop algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw. > 3) I am not sure if the swap algorithm used now actually is correct > either. We have mvcc snapshots now, right, but we're still potentially > taking separate snapshot for individual relcache lookups. What's > stopping us from temporarily ending up with two relcache entries with > the same relfilenode? > Previously you swapped names - I think that might end up being easier, > because having names temporarily confused isn't as bad as two indexes > manipulating the same file. Actually, performing swap operation with names proves to be more difficult than it looks as it makes necessary a moment where both the old and new indexes are marked as valid for all the backends. The only reason for that is that index_set_state_flag assumes that a given xact has not yet done any transactional update when it is called, forcing to one the number of state flag that can be changed inside a transaction. This is a safe method IMO, and we shouldn't break that. Also, as far as I understood, this is something that we *want* to avoid to a REINDEX CONCURRENTLY process that might fail and come up with a double number of valid indexes for a given relation if it is performed for a table (or an index if reindex is done on an index). This is also a requirement for toast indexes where new code assumes that a toast relation can only have one single valid index at the same time. For those reasons the relfilenode approach is better. Regards, -- Michael
Attachment
On 2013-09-26 12:13:30 +0900, Michael Paquier wrote: > > 2) I don't think the drop algorithm used now is correct. Your > > index_concurrent_set_dead() sets both indisvalid = false and indislive = > > false at the same time. It does so after doing a WaitForVirtualLocks() - > > but that's not sufficient. Between waiting and setting indisvalid = > > false another transaction could start which then would start using that > > index. Which will not get updated anymore by other concurrent backends > > because of inislive = false. > > You really need to follow index_drop's lead here and first unset > > indisvalid then wait till nobody can use the index for querying anymore > > and only then unset indislive. > Sorry, I do not follow you here. index_concurrent_set_dead calls > index_set_state_flags that sets indislive and *indisready* to false, > not indisvalid. The concurrent index never uses indisvalid = true so > it can never be called by another backend for a read query. The drop > algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw. That makes it even worse... You can do the concurrent drop only in the following steps: 1) set indisvalid = false, no future relcache lookups will have it as valid 2) now wait for all transactions that potentially still can use the index for *querying* to finish. During that indisready*must* be true, otherwise the index will have outdated contents. 3) Mark the index as indislive = false, indisready = false. Anything using a newer relcache entry will now not update theindex. 4) Wait till all potential updaters of the index have finished. 5) Drop the index. With the patch's current scheme concurrent queries that use plans using the old index will get wrong results (at least in read committed) because concurrent writers will not update it anymore since it's marked indisready = false. This isn't a problem of the *new* index, it's a problem of the *old* one. Am I missing something? > > 3) I am not sure if the swap algorithm used now actually is correct > > either. We have mvcc snapshots now, right, but we're still potentially > > taking separate snapshot for individual relcache lookups. What's > > stopping us from temporarily ending up with two relcache entries with > > the same relfilenode? > > Previously you swapped names - I think that might end up being easier, > > because having names temporarily confused isn't as bad as two indexes > > manipulating the same file. > Actually, performing swap operation with names proves to be more > difficult than it looks as it makes necessary a moment where both the > old and new indexes are marked as valid for all the backends. But that doesn't make the current method correct, does it? > The only > reason for that is that index_set_state_flag assumes that a given xact > has not yet done any transactional update when it is called, forcing > to one the number of state flag that can be changed inside a > transaction. This is a safe method IMO, and we shouldn't break that. Part of that reasoning comes from the non-mvcc snapshot days, so it's not really up to date anymore. Even if you don't want to go through all that logic - which I'd understand quite well - you can just do it like: 1) start with: old index: valid, ready, live; new index: invalid, ready, live 2) one transaction: switch names from real_name => tmp_name, new_name => real_name 3) one transaction: mark real_name (which is the rebuilt index) as valid, and new_name as invalid Now, if we fail in the midst of 3, we'd have two indexes marked as valid. But that's unavoidable as long as you don't want to use transactions. Alternatively you could pass in a flag to use transactional updates, that should now be safe. At least, unless the old index still has "indexcheckxmin = true" with an xmin that's not old enough. But in that case we cannot do the concurrent reindex at all I think since we rely on the old index to be full valid. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 26, 2013 at 7:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-09-26 12:13:30 +0900, Michael Paquier wrote: >> > 2) I don't think the drop algorithm used now is correct. Your >> > index_concurrent_set_dead() sets both indisvalid = false and indislive = >> > false at the same time. It does so after doing a WaitForVirtualLocks() - >> > but that's not sufficient. Between waiting and setting indisvalid = >> > false another transaction could start which then would start using that >> > index. Which will not get updated anymore by other concurrent backends >> > because of inislive = false. >> > You really need to follow index_drop's lead here and first unset >> > indisvalid then wait till nobody can use the index for querying anymore >> > and only then unset indislive. > >> Sorry, I do not follow you here. index_concurrent_set_dead calls >> index_set_state_flags that sets indislive and *indisready* to false, >> not indisvalid. The concurrent index never uses indisvalid = true so >> it can never be called by another backend for a read query. The drop >> algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw. > > That makes it even worse... You can do the concurrent drop only in the > following steps: > 1) set indisvalid = false, no future relcache lookups will have it as valid indisvalid is never set to true for the concurrent index. Swap is done with concurrent index having indisvalid = false and former index with indisvalid = true. The concurrent index is validated with index_validate in a transaction before swap transaction. -- Michael
On 2013-09-26 20:40:40 +0900, Michael Paquier wrote: > On Thu, Sep 26, 2013 at 7:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-09-26 12:13:30 +0900, Michael Paquier wrote: > >> > 2) I don't think the drop algorithm used now is correct. Your > >> > index_concurrent_set_dead() sets both indisvalid = false and indislive = > >> > false at the same time. It does so after doing a WaitForVirtualLocks() - > >> > but that's not sufficient. Between waiting and setting indisvalid = > >> > false another transaction could start which then would start using that > >> > index. Which will not get updated anymore by other concurrent backends > >> > because of inislive = false. > >> > You really need to follow index_drop's lead here and first unset > >> > indisvalid then wait till nobody can use the index for querying anymore > >> > and only then unset indislive. > > > >> Sorry, I do not follow you here. index_concurrent_set_dead calls > >> index_set_state_flags that sets indislive and *indisready* to false, > >> not indisvalid. The concurrent index never uses indisvalid = true so > >> it can never be called by another backend for a read query. The drop > >> algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw. > > > > That makes it even worse... You can do the concurrent drop only in the > > following steps: > > 1) set indisvalid = false, no future relcache lookups will have it as valid > indisvalid is never set to true for the concurrent index. Swap is done > with concurrent index having indisvalid = false and former index with > indisvalid = true. The concurrent index is validated with > index_validate in a transaction before swap transaction. Yes. I've described how it *has* to be done, not how it's done. The current method of going straight to isready = false for the original index will result in wrong results because it's not updated anymore while it's still being used. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 26, 2013 at 8:43 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-09-26 20:40:40 +0900, Michael Paquier wrote: >> On Thu, Sep 26, 2013 at 7:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > On 2013-09-26 12:13:30 +0900, Michael Paquier wrote: >> >> > 2) I don't think the drop algorithm used now is correct. Your >> >> > index_concurrent_set_dead() sets both indisvalid = false and indislive = >> >> > false at the same time. It does so after doing a WaitForVirtualLocks() - >> >> > but that's not sufficient. Between waiting and setting indisvalid = >> >> > false another transaction could start which then would start using that >> >> > index. Which will not get updated anymore by other concurrent backends >> >> > because of inislive = false. >> >> > You really need to follow index_drop's lead here and first unset >> >> > indisvalid then wait till nobody can use the index for querying anymore >> >> > and only then unset indislive. >> > >> >> Sorry, I do not follow you here. index_concurrent_set_dead calls >> >> index_set_state_flags that sets indislive and *indisready* to false, >> >> not indisvalid. The concurrent index never uses indisvalid = true so >> >> it can never be called by another backend for a read query. The drop >> >> algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw. >> > >> > That makes it even worse... You can do the concurrent drop only in the >> > following steps: >> > 1) set indisvalid = false, no future relcache lookups will have it as valid > >> indisvalid is never set to true for the concurrent index. Swap is done >> with concurrent index having indisvalid = false and former index with >> indisvalid = true. The concurrent index is validated with >> index_validate in a transaction before swap transaction. > > Yes. I've described how it *has* to be done, not how it's done. > > The current method of going straight to isready = false for the original > index will result in wrong results because it's not updated anymore > while it's still being used. The index being dropped at the end of process is not the former index, but the concurrent index. The index used after REINDEX CONCURRENTLY is the old index but with the new relfilenode. Am I lacking of caffeine? It looks so... -- Michael
On 2013-09-26 20:47:33 +0900, Michael Paquier wrote: > On Thu, Sep 26, 2013 at 8:43 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-09-26 20:40:40 +0900, Michael Paquier wrote: > >> On Thu, Sep 26, 2013 at 7:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> > On 2013-09-26 12:13:30 +0900, Michael Paquier wrote: > >> >> > 2) I don't think the drop algorithm used now is correct. Your > >> >> > index_concurrent_set_dead() sets both indisvalid = false and indislive = > >> >> > false at the same time. It does so after doing a WaitForVirtualLocks() - > >> >> > but that's not sufficient. Between waiting and setting indisvalid = > >> >> > false another transaction could start which then would start using that > >> >> > index. Which will not get updated anymore by other concurrent backends > >> >> > because of inislive = false. > >> >> > You really need to follow index_drop's lead here and first unset > >> >> > indisvalid then wait till nobody can use the index for querying anymore > >> >> > and only then unset indislive. > >> > > >> >> Sorry, I do not follow you here. index_concurrent_set_dead calls > >> >> index_set_state_flags that sets indislive and *indisready* to false, > >> >> not indisvalid. The concurrent index never uses indisvalid = true so > >> >> it can never be called by another backend for a read query. The drop > >> >> algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw. > >> > > >> > That makes it even worse... You can do the concurrent drop only in the > >> > following steps: > >> > 1) set indisvalid = false, no future relcache lookups will have it as valid > > > >> indisvalid is never set to true for the concurrent index. Swap is done > >> with concurrent index having indisvalid = false and former index with > >> indisvalid = true. The concurrent index is validated with > >> index_validate in a transaction before swap transaction. > > > > Yes. I've described how it *has* to be done, not how it's done. > > > > The current method of going straight to isready = false for the original > > index will result in wrong results because it's not updated anymore > > while it's still being used. > The index being dropped at the end of process is not the former index, > but the concurrent index. The index used after REINDEX CONCURRENTLY is > the old index but with the new relfilenode. That's not relevant unless I miss something. After phase 4 both indexes are valid (although only the old one is flagged as such), but due to the switching of the relfilenodes backends could have either of both open, depending on the time they built the relcache entry. Right? Then you go ahead and mark the old index - which still might be used! - as dead in phase 5. Which means other backends might (again, depending on the time they have built the relcache entry) not update it anymore. In read committed we very well might go ahead and use the index with the same plan as before, but with a new snapshot. Which now will miss entries. Am I misunderstanding the algorithm you're using? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 26, 2013 at 8:56 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-09-26 20:47:33 +0900, Michael Paquier wrote: >> On Thu, Sep 26, 2013 at 8:43 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > On 2013-09-26 20:40:40 +0900, Michael Paquier wrote: >> >> On Thu, Sep 26, 2013 at 7:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> >> > On 2013-09-26 12:13:30 +0900, Michael Paquier wrote: >> >> >> > 2) I don't think the drop algorithm used now is correct. Your >> >> >> > index_concurrent_set_dead() sets both indisvalid = false and indislive = >> >> >> > false at the same time. It does so after doing a WaitForVirtualLocks() - >> >> >> > but that's not sufficient. Between waiting and setting indisvalid = >> >> >> > false another transaction could start which then would start using that >> >> >> > index. Which will not get updated anymore by other concurrent backends >> >> >> > because of inislive = false. >> >> >> > You really need to follow index_drop's lead here and first unset >> >> >> > indisvalid then wait till nobody can use the index for querying anymore >> >> >> > and only then unset indislive. >> >> > >> >> >> Sorry, I do not follow you here. index_concurrent_set_dead calls >> >> >> index_set_state_flags that sets indislive and *indisready* to false, >> >> >> not indisvalid. The concurrent index never uses indisvalid = true so >> >> >> it can never be called by another backend for a read query. The drop >> >> >> algorithm is made to be consistent with DROP INDEX CONCURRENTLY btw. >> >> > >> >> > That makes it even worse... You can do the concurrent drop only in the >> >> > following steps: >> >> > 1) set indisvalid = false, no future relcache lookups will have it as valid >> > >> >> indisvalid is never set to true for the concurrent index. Swap is done >> >> with concurrent index having indisvalid = false and former index with >> >> indisvalid = true. The concurrent index is validated with >> >> index_validate in a transaction before swap transaction. >> > >> > Yes. I've described how it *has* to be done, not how it's done. >> > >> > The current method of going straight to isready = false for the original >> > index will result in wrong results because it's not updated anymore >> > while it's still being used. > >> The index being dropped at the end of process is not the former index, >> but the concurrent index. The index used after REINDEX CONCURRENTLY is >> the old index but with the new relfilenode. > > That's not relevant unless I miss something. > > After phase 4 both indexes are valid (although only the old one is > flagged as such), but due to the switching of the relfilenodes backends > could have either of both open, depending on the time they built the > relcache entry. Right? > Then you go ahead and mark the old index - which still might be used! - > as dead in phase 5. Which means other backends might (again, depending > on the time they have built the relcache entry) not update it > anymore. In read committed we very well might go ahead and use the index > with the same plan as before, but with a new snapshot. Which now will > miss entries. In this case, doing a call to WaitForOldSnapshots after the swap phase is enough. It was included in past versions of the patch but removed in the last 2 versions. Btw, taking the problem from another viewpoint... This feature has now 3 patches, the 2 first patches doing only code refactoring. Could it be possible to have a look at those ones first? Straight-forward things should go first, simplifying the core feature evaluation. Regards, -- Michael
On 2013-09-27 05:41:26 +0900, Michael Paquier wrote: > In this case, doing a call to WaitForOldSnapshots after the swap phase > is enough. It was included in past versions of the patch but removed > in the last 2 versions. I don't think it is. I really, really suggest following the protocol used by index_drop down to the t and document every *slight* deviation carefully. We've had more than one bug in index_drop's concurrent feature. > Btw, taking the problem from another viewpoint... This feature has now > 3 patches, the 2 first patches doing only code refactoring. Could it > be possible to have a look at those ones first? Straight-forward > things should go first, simplifying the core feature evaluation. I haven't looked at them in detail, but they looked good on a quick pass. I'll make another pass, but that won't be before, say, Tuesday. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Michael Paquier escribió: > Btw, taking the problem from another viewpoint... This feature has now > 3 patches, the 2 first patches doing only code refactoring. Could it > be possible to have a look at those ones first? Straight-forward > things should go first, simplifying the core feature evaluation. I have pushed the first half of the first patch for now, revising it somewhat: I renamed the functions and put them in lmgr.c instead of procarray.c. I think the second half of that first patch (WaitForOldSnapshots) should be in index.c, not procarray.c either. I didn't look at the actual code in there. I already shipped Michael fixed versions of the remaining patches adjusting them to the changed API. I expect him to post them here. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Oct 2, 2013 at 6:06 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I have pushed the first half of the first patch for now, revising it > somewhat: I renamed the functions and put them in lmgr.c instead of > procarray.c. Great thanks. > I think the second half of that first patch (WaitForOldSnapshots) should > be in index.c, not procarray.c either. I didn't look at the actual code > in there. That's indexcmds.c in this case, not index.c. > I already shipped Michael fixed versions of the remaining patches > adjusting them to the changed API. I expect him to post them here. And here they are attached, with the following changes: - in 0002, WaitForOldSnapshots is renamed to WaitForOlderSnapshots. This sounds better... - in 0003, it looks that there was an error for the obtention of the parent table Oid when calling index_concurrent_heap. I believe that the lock that needs to be taken for RangeVarGetRelid is not NoLock but ShareUpdateExclusiveLock. So changed it this way. I also added some more comments at the top of each function for clarity. - in 0004, patch is updated to reflect the API changes done in 0002 and 0003. Each patch applied with its parents compiles, has no warnings AFAIK and passes regression/isolation tests. Working on 0004 by the end of the CF seems out of the way IMO, so I'd suggest focusing on 0002 and 0003 now, and I can put some time to finalize them for this CF. I think that we should perhaps split 0003 into 2 pieces, with one patch for the introduction of index_concurrent_build, and another for index_concurrent_set_dead. Comments are welcome about that though, and if people agree on that I'll do it once 0002 is finalized. Regards, -- Michael
Attachment
Marking this patch as "returned with feedback", I will not be able to work on that by the 15th of October. It would have been great to get the infrastructure patches 0002 and 0003 committed to minimize the work on the core patch, but well it is not the case. I am attaching as well a patch fixing some comments of index_drop, as mentioned by Andres in another thread, such as it doesn't get lost in the flow. Thanks to all for the involvement. Regards, -- Michael
Attachment
On 2013-10-02 13:16:06 +0900, Michael Paquier wrote: > Each patch applied with its parents compiles, has no warnings AFAIK > and passes regression/isolation tests. Working on 0004 by the end of > the CF seems out of the way IMO, so I'd suggest focusing on 0002 and > 0003 now, and I can put some time to finalize them for this CF. I > think that we should perhaps split 0003 into 2 pieces, with one patch > for the introduction of index_concurrent_build, and another for > index_concurrent_set_dead. Comments are welcome about that though, and > if people agree on that I'll do it once 0002 is finalized. FWIW I don't think splitting of index_concurrent_build is worthwile... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services