Thread: create index concurrently - duplicate index to reduce time without an index
Hi, We want to reindex the database behind a production service without interrupting the service. I had an idea for creating the index with a new name then dropping the existing index and renaming the new one - and it seemsto work and would reduce the time without an index to be minimal. I tried: psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1 on R_OBJT_ACCESS (object_id,user_id);' # would check if that worked before proceeding # psql -d ICAT -c 'drop index idx_objt_access1;' psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to idx_objt_access1;' But then a colleague pointed out that maybe the name of the index is not meaningful and this might not be any use. Can any experts confirm the validity of this approach or shoot it down? Thanks, Gareth
Re: create index concurrently - duplicate index to reduce time without an index
From
Szymon Guz
Date:
2010/6/1 <Gareth.Williams@csiro.au>
Hi,
We want to reindex the database behind a production service without interrupting the service.
I had an idea for creating the index with a new name then dropping the existing index and renaming the new one - and it seems to work and would reduce the time without an index to be minimal. I tried:
psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1
on R_OBJT_ACCESS (object_id,user_id);'
# would check if that worked before proceeding #
psql -d ICAT -c 'drop index idx_objt_access1;'
psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to idx_objt_access1;'
But then a colleague pointed out that maybe the name of the index is not meaningful and this might not be any use.
Can any experts confirm the validity of this approach or shoot it down?
The index name is not used for planning query execution (most important thing is the index definition), but of course it is important to have some logical name convention that usually helps in fast understanding database schema. Name the index as you want, database really doesn't understand the names.
regards
Szymon Guz
Re: create index concurrently - duplicate index to reduce time without an index
From
Szymon Guz
Date:
2010/6/1 <Gareth.Williams@csiro.au>
> From: Szymon Guz [mailto:mabewlun@gmail.com]
-snip-Thanks Szymon,>> 2010/6/1 <Gareth.Williams@csiro.au>
>> Hi,
>> We want to reindex the database behind a production service without interrupting the service.
>> I had an idea for creating the index with a new name then dropping the existing index and renaming the new one - and it seems to work and would reduce the time without an index to be minimal. I tried:
psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1
on R_OBJT_ACCESS (object_id,user_id);'
# would check if that worked before proceeding #
psql -d ICAT -c 'drop index idx_objt_access1;'
psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to idx_objt_access1;'
>> But then a colleague pointed out that maybe the name of the index is not meaningful and this might not be any use.
>> Can any experts confirm the validity of this approach or shoot it down?
> The index name is not used for planning query execution (most important thing is the index definition), but of course it is important to have some logical name convention that usually helps in fast understanding database schema. Name the index as you want, database really doesn't understand the names.
I was afraid that was the case. So the rest of the question is, if I have two indexes with identical definitions, what happens? I've confirmed that I can create indexes with identical definitions (except name) without postgres complaining - and without breaking the client on my test system - but I am wary of trying it on my production system where there is much more data (8GB) and I care about it's integrity so much more.
I've just tested further and if I create two indexes with identical definitions, my queries are fast, and I can delete either of them and the queries are still fast, but with both deleted the queries are slow. And yes, renaming makes no difference but is nice for understanding the purpose of the index.
So I still haven't seen this procedure break anything, but does anyone know a reason it might be inadvisable?
When you have two exactly the same index definitions, that's just a waste of resources. There will be used only one of them for speeding up selects, but all of them must be updated during insert/update/delete operations.
regards
Szymon Guz
> From: Szymon Guz [mailto:mabewlun@gmail.com] -snip- >> 2010/6/1 <Gareth.Williams@csiro.au> >> Hi, >> We want to reindex the database behind a production service without interrupting the service. >> I had an idea for creating the index with a new name then dropping the existing index and renaming the new one - and itseems to work and would reduce the time without an index to be minimal. I tried: psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1 on R_OBJT_ACCESS (object_id,user_id);' # would check if that worked before proceeding # psql -d ICAT -c 'drop index idx_objt_access1;' psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to idx_objt_access1;' >> But then a colleague pointed out that maybe the name of the index is not meaningful and this might not be any use. >> Can any experts confirm the validity of this approach or shoot it down? > The index name is not used for planning query execution (most important thing is the index definition), but of course itis important to have some logical name convention that usually helps in fast understanding database schema. Name the indexas you want, database really doesn't understand the names. Thanks Szymon, I was afraid that was the case. So the rest of the question is, if I have two indexes with identical definitions, what happens? I've confirmed that I can create indexes with identical definitions (except name) without postgres complaining -and without breaking the client on my test system - but I am wary of trying it on my production system where there is muchmore data (8GB) and I care about it's integrity so much more. I've just tested further and if I create two indexes with identical definitions, my queries are fast, and I can delete eitherof them and the queries are still fast, but with both deleted the queries are slow. And yes, renaming makes no differencebut is nice for understanding the purpose of the index. So I still haven't seen this procedure break anything, but does anyone know a reason it might be inadvisable? Cheers, Gareth
Re: create index concurrently - duplicate index to reduce time without an index
From
Greg Smith
Date:
Gareth.Williams@csiro.au wrote: > So the rest of the question is, if I have two indexes with identical definitions, what happens? I've confirmed that Ican create indexes with identical definitions (except name) without postgres complaining - and without breaking the clienton my test system - but I am wary of trying it on my production system where there is much more data (8GB) and I careabout it's integrity so much more. > The database doesn't care one bit if you create a bunch of redundant indexes. So long as one of them is around to satisfy the queries that need the index to run well, you're fine. The main thing you can't do with the index concurrently/rename shuffle you've discovered here is use that approach to concurrently rebuild an index that enforces a constraint or unique index. If your index is enforcing a PRIMARY KEY for example, you'll discover a major roadblock were you to try this same technique to rebuild it. Those are tied into the constraint portion of the system catalogs and manipulating them isn't so easy. Regular indexes that exist just to speed up queries, those you can rename around as you've been figuring out without any downside. From a general paranoia perspective, you should run an explicit ANALYZE on the underlying table after you finish the shuffle, just to make absolutely sure the right statistics are available afterwards. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
> When you have two exactly the same index definitions, that's just a waste of resources. There will be used only oneof them for speeding up selects, but all of them must be updated during insert/update/delete operations. Thanks Szymon, I appreciate the info. The duplication would only be for the time of the index build - which may be significant. When thenew index was successfully complete we would drop the old one. I'm concerned that just because I *can* create duplicate indexes, it doesn't mean I *should*. I'm worried it might actuallybreak the service worse than the locking that I'm trying to avoid. Does anyone know which index will be used if thereare duplicate ones and whether there is likely to be a problem? My understanding was that the index would not be updated for inserts to the actual table - which is why one has to reindexor drop the index and create it again (with concurrently if you don't want to block reads). Am I missing something? Perhaps I don't need to recreate the index at all - but I've been advised to reindex. See: http://groups.google.com/group/irod-chat/browse_thread/thread/0396d48ffecfb2b0# "On a side note, for the irods performance, an other key factor is also the iCAT database performances and in some of your tests below, it has an impact. Recomputing the index on a regular basis is something important: here, I am doing it once per week for each iCAT (cron task). Even for a catalog with millions of files, if you have ingested tens of thousands of new files, you will see an improvement when doing reindexing (for example an ils performed in 0.3 s will goes down to less than 0.1 s when reindexing has been made)." Thanks, Gareth
Re: create index concurrently - duplicate index to reduce time without an index
From
Alban Hertroys
Date:
On 2 Jun 2010, at 6:00, <Gareth.Williams@csiro.au> <Gareth.Williams@csiro.au> wrote: > My understanding was that the index would not be updated for inserts to the actual table - which is why one has to reindexor drop the index and create it again (with concurrently if you don't want to block reads). Am I missing something? That's definitely not true, indexes would be rather useless if that were the case. What doesn't get updated with insertsare planner statistics, but you update those by executing ANALYSE (or VACUUM ANALYSE), or automatically by autovacuum. > Perhaps I don't need to recreate the index at all - but I've been advised to reindex. > > See: > http://groups.google.com/group/irod-chat/browse_thread/thread/0396d48ffecfb2b0# > "On a side note, for the irods performance, an other key factor is also > the iCAT database performances and in some of your tests below, it has > an impact. Recomputing the index on a regular basis is something > important: here, I am doing it once per week for each iCAT (cron task). > Even for a catalog with millions of files, if you have ingested tens of > thousands of new files, you will see an improvement when doing > reindexing (for example an ils performed in 0.3 s will goes down to less > than 0.1 s when reindexing has been made)." There are cases where reindexing shows a performance improvement over just analysing a table, but the above (only inserts)shouldn't be one of them. Alban Hertroys -- Screwing up is an excellent way to attach something to the ceiling. !DSPAM:737,4c064ad110154201810452!
Re: create index concurrently - duplicate index to reduce time without an index
From
Scott Marlowe
Date:
On Tue, Jun 1, 2010 at 10:00 PM, <Gareth.Williams@csiro.au> wrote: > I'm concerned that just because I *can* create duplicate indexes, it doesn't mean I *should*. > I'm worried it might actually break the service worse than the locking that I'm trying to avoid. > Does anyone know which index will be used if there are duplicate ones and whether there is > likely to be a problem? I've done this dozens of times on a production server, midday, max load, with no real problems. create index concurrently is one of the greatest things anyone's ever done for pgsql.
Re: create index concurrently - duplicate index to reduce time without an index
From
Greg Smith
Date:
Gareth.Williams@csiro.au wrote: > I'm concerned that just because I *can* create duplicate indexes, it doesn't mean I *should*. I'm worried it might actuallybreak the service worse than the locking that I'm trying to avoid. Does anyone know which index will be used if thereare duplicate ones and whether there is likely to be a problem? > This is a simple disk capacity question. If you can run CREATE INDEX CONCURRENTLY, and it doesn't effectively take your server down due to the disks being overrun with data to process, it's perfectly fine to do so. Only in the case where the index creation itself causes what is effectively client downtime, due to the increased load of the build, could that aproach be worse than the terrible locking that comes with non-concurrent rebuild. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Re: create index concurrently - duplicate index to reduce time without an index
From
Alvaro Herrera
Date:
Excerpts from Gareth.Williams's message of mar jun 01 02:44:35 -0400 2010: > Hi, > > We want to reindex the database behind a production service without interrupting the service. > > I had an idea for creating the index with a new name then dropping the existing index and renaming the new one - and itseems to work and would reduce the time without an index to be minimal. I tried: > psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1 > on R_OBJT_ACCESS (object_id,user_id);' > # would check if that worked before proceeding # > psql -d ICAT -c 'drop index idx_objt_access1;' > psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to idx_objt_access1;' Note that you should wait until the new index becomes usable before deleting the old one; otherwise you could have an intermediate period during which you have no index. IIRC a concurrently created index does not become usable until the oldest transaction that was current when index creation started has finished (IOW the transaction in pg_index.indcheckxmin has gone). -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
> -----Original Message----- > From: Greg Smith [mailto:greg@2ndquadrant.com] -snip- > > Gareth.Williams wrote: > > So the rest of the question is, if I have two indexes with identical > definitions, what happens? I've confirmed that I can create indexes with > identical definitions (except name) without postgres complaining - and > without breaking the client on my test system - but I am wary of trying it > on my production system where there is much more data (8GB) and I care > about it's integrity so much more. > > > > The database doesn't care one bit if you create a bunch of redundant > indexes. So long as one of them is around to satisfy the queries that > need the index to run well, you're fine. > > The main thing you can't do with the index concurrently/rename shuffle > you've discovered here is use that approach to concurrently rebuild an > index that enforces a constraint or unique index. If your index is > enforcing a PRIMARY KEY for example, you'll discover a major roadblock > were you to try this same technique to rebuild it. Those are tied into > the constraint portion of the system catalogs and manipulating them > isn't so easy. > > Regular indexes that exist just to speed up queries, those you can > rename around as you've been figuring out without any downside. From a > general paranoia perspective, you should run an explicit ANALYZE on the > underlying table after you finish the shuffle, just to make absolutely > sure the right statistics are available afterwards. > > -- > Greg Smith 2ndQuadrant US Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.us Thanks Greg, Alban and others, This has cleared up a misunderstanding I had about why one should reindex. Re-reading the documentation http://www.postgresql.org/docs/8.4/interactive/sql-reindex.htmlit is clear now that reindex or recreating and index shouldnot normally be needed - certainly not to keep an index up-to-date. I would have guessed that VACUUM or VACUUM ANALYSEon the table that the index is associated would have been sufficient to reclaim space for a 'bloated' index (maybeonly VACUUM FULL would help). In any case we can leave reindexing or full vacuum for outages where we are interruptingservice anyway. I was heartened by the responses and tried further testing (if it could not hurt, why not try and see if it could be faster),but ran into a problem. A few times when I was trying to drop an index (before or after creating a duplicate indexwith 'concurrently'), the dropping of the index stalled. It seems that this was because of existing connection: postgres: rods ICAT 130.102.163.141(58061) idle in transaction And new clients block. Is that expected? Should we look for 'bugs' in out client that seems to be holding a connection? For the application we have, I'm ready to give up on this train of investigation for optimization and just vacuum analysekey tables regularly and vaccuum and maybe reindex more completely during outages - though we aim for outages to beinfrequent. The database holds data representing a virtual filesystem structure with millions of file (and associatedaccess controls, and information on underlying storage resources and replication). There is probably not muchupdate or delete of the main data - at least compared with the total holdings and the new data/files which are regularlybeing added to the system. Thanks again, Gareth Ps. Greg, I don't understand the issue with 'concurrently rebuild (of) an index that enforces a constraint or unique index'. I don't think I care much right at the moment, but I'm generally interested and others might be too. Would you expectthe create index to fail or to cause locking or just transient performance degradation?
Re: create index concurrently - duplicate index to reduce time without an index
From
Alban Hertroys
Date:
> Thanks Greg, Alban and others, > > This has cleared up a misunderstanding I had about why one should reindex. Re-reading the documentation http://www.postgresql.org/docs/8.4/interactive/sql-reindex.htmlit is clear now that reindex or recreating and index shouldnot normally be needed - certainly not to keep an index up-to-date. I would have guessed that VACUUM or VACUUM ANALYSEon the table that the index is associated would have been sufficient to reclaim space for a 'bloated' index (maybeonly VACUUM FULL would help). In any case we can leave reindexing or full vacuum for outages where we are interruptingservice anyway. VACUUM FULL actually causes bloat to indexes. It rearranges the data in the tables so that any gaps get used, but while doingthat it also needs to update the indices related to those tables. Normal VACUUM and VACUUM ANALYSE don't have this problem though, they just mark table space that's no longer in use (transactionthat deleted rows has committed them) as reusable, so that later INSERTs can put their data in there. This isa bit of a simplification of what's really going on - for the details check the documentation. Autovacuum does VACUUM ANALYSE in the background, using multiple threads in recent versions. You can (and seeing your useof the database you probably should) tune how often it vacuums tables through several configuration parameters. Of course, running ANALYSE when you _know_ data in a table has changed significantly means you don't have to wait for autovacto get around to analysing that table. > I was heartened by the responses and tried further testing (if it could not hurt, why not try and see if it could be faster),but ran into a problem. A few times when I was trying to drop an index (before or after creating a duplicate indexwith 'concurrently'), the dropping of the index stalled. It seems that this was because of existing connection: > postgres: rods ICAT 130.102.163.141(58061) idle in transaction > And new clients block. > > Is that expected? Should we look for 'bugs' in out client that seems to be holding a connection? I'm not exactly sure why that is (I can't look into your database), but my guess is that the index is locked by a transaction.Apparently the transaction you refer to has uncommitted work that depends on the index at some point. Keeping transactions open for a long time is usually a bad idea. You saw that you can't drop an index in use by a transaction for example, but autovacuum is running into similar issues -it can't reclaim space until the transaction finishes as the transaction locks things that autovacuum will want to touch. That probably means (I'm not sure it works that way, but it seems likely) that that autovacuum thread gets stuck at a lockand can't continue until the transaction holding the lock frees it. > For the application we have, I'm ready to give up on this train of investigation for optimization and just vacuum analysekey tables regularly and vaccuum and maybe reindex more completely during outages - though we aim for outages to beinfrequent. The database holds data representing a virtual filesystem structure with millions of file (and associatedaccess controls, and information on underlying storage resources and replication). There is probably not muchupdate or delete of the main data - at least compared with the total holdings and the new data/files which are regularlybeing added to the system. In practice VACUUM FULL and REINDEX are used to reclaim disk space. That of itself doesn't look much like it'd improve performance,but using less disk space also means that data gets more tightly packed in your disk cache, for example. REINDEXcan mean an index that didn't fit into RAM now does. They're both rather intrusive operations though, so it's a matterof balancing the costs and benefits. Many databases don't need to bother with VACUUM FULL or REINDEX. > Ps. Greg, I don't understand the issue with 'concurrently rebuild (of) an index that enforces a constraint or unique index'. I don't think I care much right at the moment, but I'm generally interested and others might be too.. Would you expectthe create index to fail or to cause locking or just transient performance degradation? I think what Greg was getting at is that there's a dependency tree between indexes and constraints: A primary key is implementedusing a unique index. You can create a new (unique) index on the same columns concurrently, but you can't replacethe primary key index with it as you're not allowed to drop the index without dropping the PK constraint. If you haveany FK constraints pointing to that table, you can't drop the PK constraint without also dropping the FK constraints. Quite a bit of trouble to go through to replace one index. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,4c08c88410157954111193!
Re: create index concurrently - duplicate index to reduce time without an index
From
Greg Smith
Date:
Gareth.Williams@csiro.au wrote: > Re-reading the documentation http://www.postgresql.org/docs/8.4/interactive/sql-reindex.html it is clear now that reindexor recreating and index should not normally be needed - certainly not to keep an index up-to-date. I would have guessedthat VACUUM or VACUUM ANALYSE on the table that the index is associated would have been sufficient to reclaim spacefor a 'bloated' index (maybe only VACUUM FULL would help). In any case we can leave reindexing or full vacuum for outageswhere we are interrupting service anyway. > It is a periodic preventative maintenance operation you can expect to need occasionally, but certainly not often. Indexes maintain themselves just fine most of the time. They can get out of whack if you delete a lot of data out of them and there are some use patterns that tend to a aggravate the problems here (like tables where you're always inserting new data and deleting old), but it's certainly not something you run all the time. You should read http://wiki.postgresql.org/wiki/VACUUM_FULL to clear up when it is needed and what the better alternatives are. > A few times when I was trying to drop an index (before or after creating a duplicate index with 'concurrently'), the droppingof the index stalled. It seems that this was because of existing connection: > postgres: rods ICAT 130.102.163.141(58061) idle in transaction > And new clients block. Is that expected? Should we look for 'bugs' in out client that seems to be holding a connection? > You do need to be careful that there are no clients connected when you try this, or yes this is expected behavior. One popular technique is to put some sort of "block access to the database" switch in the application itself, specifically to support small outages while keeping the app from going crazy. You can flip that for a few second around when you're doing the index switch. > Ps. Greg, I don't understand the issue with 'concurrently rebuild (of) an index that enforces a constraint or unique index'. I don't think I care much right at the moment, but I'm generally interested and others might be too. Would you expectthe create index to fail or to cause locking or just transient performance degradation? > The description Alban wrote covers what I was alluding to. You can't just can't drop an index that supports a constraint, and that has some (bad) implication for how you can rebuild it. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Re: create index concurrently - duplicate index to reduce time without an index
From
Alvaro Herrera
Date:
Excerpts from Greg Smith's message of lun jun 07 12:23:44 -0400 2010: > It is a periodic preventative maintenance operation you can expect to > need occasionally, but certainly not often. Indexes maintain themselves > just fine most of the time. They can get out of whack if you delete a > lot of data out of them and there are some use patterns that tend to a > aggravate the problems here (like tables where you're always inserting > new data and deleting old), but it's certainly not something you run all > the time. Indexes on which you always insert new data and delete old can keep themselves in good shape too. The really problematic cases are those in which you delete new data and delete most, but not all, old data. Those cases would result in almost empty pages that can never be recycled (we do not have btree page merging). -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support