Thread: create index concurrently - duplicate index to reduce time without an index

Hi,

We want to reindex the database behind a production service without interrupting the service.

I had an idea for creating the index with a new name then dropping the existing index and renaming the new one - and it
seemsto work and would reduce the time without an index to be minimal.  I tried: 
psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1
on R_OBJT_ACCESS (object_id,user_id);'
# would check if that worked before proceeding #
psql -d ICAT -c 'drop index idx_objt_access1;'
psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to idx_objt_access1;'

But then a colleague pointed out that maybe the name of the index is not meaningful and this might not be any use.

Can any experts confirm the validity of this approach or shoot it down?

Thanks,

Gareth



2010/6/1 <Gareth.Williams@csiro.au>
Hi,

We want to reindex the database behind a production service without interrupting the service.

I had an idea for creating the index with a new name then dropping the existing index and renaming the new one - and it seems to work and would reduce the time without an index to be minimal.  I tried:
psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1
on R_OBJT_ACCESS (object_id,user_id);'
# would check if that worked before proceeding #
psql -d ICAT -c 'drop index idx_objt_access1;'
psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to idx_objt_access1;'

But then a colleague pointed out that maybe the name of the index is not meaningful and this might not be any use.

Can any experts confirm the validity of this approach or shoot it down?


The index name is not used for planning query execution (most important thing is the index definition), but of course it is important to have some logical name convention that usually helps in fast understanding database schema. Name the index as you want, database really doesn't understand the names.

regards
Szymon Guz


2010/6/1 <Gareth.Williams@csiro.au>
> From: Szymon Guz [mailto:mabewlun@gmail.com]
-snip-
>> 2010/6/1 <Gareth.Williams@csiro.au>
>> Hi,

>> We want to reindex the database behind a production service without interrupting the service.

>> I had an idea for creating the index with a new name then dropping the existing index and renaming the new one - and it seems to work and would reduce the time without an index to be minimal.  I tried:
psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1
on R_OBJT_ACCESS (object_id,user_id);'
# would check if that worked before proceeding #
psql -d ICAT -c 'drop index idx_objt_access1;'
psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to idx_objt_access1;'

>> But then a colleague pointed out that maybe the name of the index is not meaningful and this might not be any use.

>> Can any experts confirm the validity of this approach or shoot it down?

> The index name is not used for planning query execution (most important thing is the index definition), but of course it is important to have some logical name convention that usually helps in fast understanding database schema. Name the index as you want, database really doesn't understand the names.

Thanks Szymon,

I was afraid that was the case.  So the rest of the question is, if I have two indexes with identical definitions, what happens?  I've confirmed that I can create indexes with identical definitions (except name) without postgres complaining - and without breaking the client on my test system - but I am wary of trying it on my production system where there is much more data (8GB) and I care about it's integrity so much more. 

I've just tested further and if I create two indexes with identical definitions, my queries are fast, and I can delete either of them and the queries are still fast, but with both deleted the queries are slow. And yes, renaming makes no difference but is nice for understanding the purpose of the index.

So I still haven't seen this procedure break anything, but does anyone know a reason it might be inadvisable?


When you have two exactly the same index definitions, that's just a waste of resources. There will be used only one of them  for speeding up selects, but all of them must be updated during insert/update/delete operations.

regards
Szymon Guz
 
> From: Szymon Guz [mailto:mabewlun@gmail.com]
-snip-
>> 2010/6/1 <Gareth.Williams@csiro.au>
>> Hi,

>> We want to reindex the database behind a production service without interrupting the service.

>> I had an idea for creating the index with a new name then dropping the existing index and renaming the new one - and
itseems to work and would reduce the time without an index to be minimal.  I tried: 
psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1
on R_OBJT_ACCESS (object_id,user_id);'
# would check if that worked before proceeding #
psql -d ICAT -c 'drop index idx_objt_access1;'
psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to idx_objt_access1;'

>> But then a colleague pointed out that maybe the name of the index is not meaningful and this might not be any use.

>> Can any experts confirm the validity of this approach or shoot it down?

> The index name is not used for planning query execution (most important thing is the index definition), but of course
itis important to have some logical name convention that usually helps in fast understanding database schema. Name the
indexas you want, database really doesn't understand the names. 

Thanks Szymon,

I was afraid that was the case.  So the rest of the question is, if I have two indexes with identical definitions, what
happens? I've confirmed that I can create indexes with identical definitions (except name) without postgres complaining
-and without breaking the client on my test system - but I am wary of trying it on my production system where there is
muchmore data (8GB) and I care about it's integrity so much more. 

I've just tested further and if I create two indexes with identical definitions, my queries are fast, and I can delete
eitherof them and the queries are still fast, but with both deleted the queries are slow. And yes, renaming makes no
differencebut is nice for understanding the purpose of the index. 

So I still haven't seen this procedure break anything, but does anyone know a reason it might be inadvisable?

Cheers,

Gareth

Gareth.Williams@csiro.au wrote:
> So the rest of the question is, if I have two indexes with identical definitions, what happens?  I've confirmed that
Ican create indexes with identical definitions (except name) without postgres complaining - and without breaking the
clienton my test system - but I am wary of trying it on my production system where there is much more data (8GB) and I
careabout it's integrity so much more. 
>

The database doesn't care one bit if you create a bunch of redundant
indexes.  So long as one of them is around to satisfy the queries that
need the index to run well, you're fine.

The main thing you can't do with the index concurrently/rename shuffle
you've discovered here is use that approach to concurrently rebuild an
index that enforces a constraint or unique index.  If your index is
enforcing a PRIMARY KEY for example, you'll discover a major roadblock
were you to try this same technique to rebuild it.  Those are tied into
the constraint portion of the system catalogs and manipulating them
isn't so easy.

Regular indexes that exist just to speed up queries, those you can
rename around as you've been figuring out without any downside.  From a
general paranoia perspective, you should run an explicit ANALYZE on the
underlying table after you finish the shuffle, just to make absolutely
sure the right statistics are available afterwards.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


>    When you have two exactly the same index definitions, that's just a waste of resources. There will be used only
oneof them  for speeding up selects, but all of them must be updated during insert/update/delete operations. 

Thanks Szymon,

I appreciate the info.  The duplication would only be for the time of the index build - which may be significant.  When
thenew index was successfully complete we would drop the old one. 

I'm concerned that just because I *can* create duplicate indexes, it doesn't mean I *should*.  I'm worried it might
actuallybreak the service worse than the locking that I'm trying to avoid. Does anyone know which index will be used if
thereare duplicate ones and whether there is likely to be a problem? 

My understanding was that the index would not be updated for inserts to the actual table - which is why one has to
reindexor drop the index and create it again (with concurrently if you don't want to block reads).  Am I missing
something?


Perhaps I don't need to recreate the index at all - but I've been advised to reindex.

See:
http://groups.google.com/group/irod-chat/browse_thread/thread/0396d48ffecfb2b0#
"On a side note, for the irods performance, an other key factor is also
the iCAT database performances and in some of your tests below, it has
an impact. Recomputing the index on a regular basis is something
important: here, I am doing it once per week for each iCAT (cron task).
Even for a catalog with millions of files, if you have ingested tens of
thousands of new files, you will see an improvement when doing
reindexing (for example an ils performed in 0.3 s will goes down to less
than 0.1 s when reindexing has been made)."

Thanks,

Gareth


On 2 Jun 2010, at 6:00, <Gareth.Williams@csiro.au> <Gareth.Williams@csiro.au> wrote:
> My understanding was that the index would not be updated for inserts to the actual table - which is why one has to
reindexor drop the index and create it again (with concurrently if you don't want to block reads).  Am I missing
something?

That's definitely not true, indexes would be rather useless if that were the case. What doesn't get updated with
insertsare planner statistics, but you update those by executing ANALYSE (or VACUUM ANALYSE), or automatically by
autovacuum.

> Perhaps I don't need to recreate the index at all - but I've been advised to reindex.
>
> See:
> http://groups.google.com/group/irod-chat/browse_thread/thread/0396d48ffecfb2b0#
> "On a side note, for the irods performance, an other key factor is also
> the iCAT database performances and in some of your tests below, it has
> an impact. Recomputing the index on a regular basis is something
> important: here, I am doing it once per week for each iCAT (cron task).
> Even for a catalog with millions of files, if you have ingested tens of
> thousands of new files, you will see an improvement when doing
> reindexing (for example an ils performed in 0.3 s will goes down to less
> than 0.1 s when reindexing has been made)."


There are cases where reindexing shows a performance improvement over just analysing a table, but the above (only
inserts)shouldn't be one of them. 

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.


!DSPAM:737,4c064ad110154201810452!



On Tue, Jun 1, 2010 at 10:00 PM,  <Gareth.Williams@csiro.au> wrote:
> I'm concerned that just because I *can* create duplicate indexes, it doesn't mean I *should*.
> I'm worried it might actually break the service worse than the locking that I'm trying to avoid.
> Does anyone know which index will be used if there are duplicate ones and whether there is
> likely to be a problem?

I've done this dozens of times on a production server, midday, max
load, with no real problems.  create index concurrently is one of the
greatest things anyone's ever done for pgsql.

Gareth.Williams@csiro.au wrote:
> I'm concerned that just because I *can* create duplicate indexes, it doesn't mean I *should*.  I'm worried it might
actuallybreak the service worse than the locking that I'm trying to avoid. Does anyone know which index will be used if
thereare duplicate ones and whether there is likely to be a problem? 
>

This is a simple disk capacity question.  If you can run CREATE INDEX
CONCURRENTLY, and it doesn't effectively take your server down due to
the disks being overrun with data to process, it's perfectly fine to do
so.  Only in the case where the index creation itself causes what is
effectively client downtime, due to the increased load of the build,
could that aproach be worse than the terrible locking that comes with
non-concurrent rebuild.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Excerpts from Gareth.Williams's message of mar jun 01 02:44:35 -0400 2010:
> Hi,
>
> We want to reindex the database behind a production service without interrupting the service.
>
> I had an idea for creating the index with a new name then dropping the existing index and renaming the new one - and
itseems to work and would reduce the time without an index to be minimal.  I tried: 
> psql -d ICAT -c 'create unique index concurrently tmp_idx_objt_access1
> on R_OBJT_ACCESS (object_id,user_id);'
> # would check if that worked before proceeding #
> psql -d ICAT -c 'drop index idx_objt_access1;'
> psql -d ICAT -c 'alter index tmp_idx_objt_access1 rename to idx_objt_access1;'

Note that you should wait until the new index becomes usable before
deleting the old one; otherwise you could have an intermediate period
during which you have no index.

IIRC a concurrently created index does not become usable until the
oldest transaction that was current when index creation started has
finished (IOW the transaction in pg_index.indcheckxmin has gone).

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

> -----Original Message-----
> From: Greg Smith [mailto:greg@2ndquadrant.com]
-snip-
>
> Gareth.Williams wrote:
> > So the rest of the question is, if I have two indexes with identical
> definitions, what happens?  I've confirmed that I can create indexes with
> identical definitions (except name) without postgres complaining - and
> without breaking the client on my test system - but I am wary of trying it
> on my production system where there is much more data (8GB) and I care
> about it's integrity so much more.
> >
>
> The database doesn't care one bit if you create a bunch of redundant
> indexes.  So long as one of them is around to satisfy the queries that
> need the index to run well, you're fine.
>
> The main thing you can't do with the index concurrently/rename shuffle
> you've discovered here is use that approach to concurrently rebuild an
> index that enforces a constraint or unique index.  If your index is
> enforcing a PRIMARY KEY for example, you'll discover a major roadblock
> were you to try this same technique to rebuild it.  Those are tied into
> the constraint portion of the system catalogs and manipulating them
> isn't so easy.
>
> Regular indexes that exist just to speed up queries, those you can
> rename around as you've been figuring out without any downside.  From a
> general paranoia perspective, you should run an explicit ANALYZE on the
> underlying table after you finish the shuffle, just to make absolutely
> sure the right statistics are available afterwards.
>
> --
> Greg Smith  2ndQuadrant US  Baltimore, MD
> PostgreSQL Training, Services and Support
> greg@2ndQuadrant.com   www.2ndQuadrant.us

Thanks Greg, Alban and others,

This has cleared up a misunderstanding I had about why one should reindex.  Re-reading the documentation
http://www.postgresql.org/docs/8.4/interactive/sql-reindex.htmlit is clear now that reindex or recreating and index
shouldnot normally be needed - certainly not to keep an index up-to-date.  I would have guessed that VACUUM or VACUUM
ANALYSEon the table that the index is associated would have been sufficient to reclaim space for a 'bloated' index
(maybeonly VACUUM FULL would help).  In any case we can leave reindexing or full vacuum for outages where we are
interruptingservice anyway. 

I was heartened by the responses and tried further testing (if it could not hurt, why not try and see if it could be
faster),but ran into a problem.  A few times when I was trying to drop an index (before or after creating a duplicate
indexwith 'concurrently'), the dropping of the index stalled.  It seems that this was because of existing connection: 
postgres: rods ICAT 130.102.163.141(58061) idle in transaction
And new clients block.

Is that expected? Should we look for 'bugs' in out client that seems to be holding a connection?

For the application we have, I'm ready to give up on this train of investigation for optimization and just vacuum
analysekey tables regularly and vaccuum and maybe reindex more completely during outages - though we aim for outages to
beinfrequent.  The database holds data representing a virtual filesystem structure with millions of file (and
associatedaccess controls, and information on underlying storage resources and replication).  There is probably not
muchupdate or delete of the main data - at least compared with the total holdings and the new data/files which are
regularlybeing added to the system. 

Thanks again,

Gareth

Ps. Greg, I don't understand the issue with 'concurrently rebuild (of) an index that enforces a constraint or unique
index'. I don't think I care much right at the moment, but I'm generally interested and others might be too. Would you
expectthe create index to fail or to cause locking or just transient performance degradation? 

> Thanks Greg, Alban and others,
>
> This has cleared up a misunderstanding I had about why one should reindex.  Re-reading the documentation
http://www.postgresql.org/docs/8.4/interactive/sql-reindex.htmlit is clear now that reindex or recreating and index
shouldnot normally be needed - certainly not to keep an index up-to-date.  I would have guessed that VACUUM or VACUUM
ANALYSEon the table that the index is associated would have been sufficient to reclaim space for a 'bloated' index
(maybeonly VACUUM FULL would help).  In any case we can leave reindexing or full vacuum for outages where we are
interruptingservice anyway. 

VACUUM FULL actually causes bloat to indexes. It rearranges the data in the tables so that any gaps get used, but while
doingthat it also needs to update the indices related to those tables. 

Normal VACUUM and VACUUM ANALYSE don't have this problem though, they just mark table space that's no longer in use
(transactionthat deleted rows has committed them) as reusable, so that later INSERTs can put their data in there. This
isa bit of a simplification of what's really going on - for the details check the documentation. 

Autovacuum does VACUUM ANALYSE in the background, using multiple threads in recent versions. You can (and seeing your
useof the database you probably should) tune how often it vacuums tables through several configuration parameters. 

Of course, running ANALYSE when you _know_ data in a table has changed significantly means you don't have to wait for
autovacto get around to analysing that table. 

> I was heartened by the responses and tried further testing (if it could not hurt, why not try and see if it could be
faster),but ran into a problem.  A few times when I was trying to drop an index (before or after creating a duplicate
indexwith 'concurrently'), the dropping of the index stalled.  It seems that this was because of existing connection: 
> postgres: rods ICAT 130.102.163.141(58061) idle in transaction
> And new clients block.
>
> Is that expected? Should we look for 'bugs' in out client that seems to be holding a connection?

I'm not exactly sure why that is (I can't look into your database), but my guess is that the index is locked by a
transaction.Apparently the transaction you refer to has uncommitted work that depends on the index at some point. 

Keeping transactions open for a long time is usually a bad idea.

You saw that you can't drop an index in use by a transaction for example, but autovacuum is running into similar issues
-it can't reclaim space until the transaction finishes as the transaction locks things that autovacuum will want to
touch.
That probably means (I'm not sure it works that way, but it seems likely) that that autovacuum thread gets stuck at a
lockand can't continue until the transaction holding the lock frees it. 


> For the application we have, I'm ready to give up on this train of investigation for optimization and just vacuum
analysekey tables regularly and vaccuum and maybe reindex more completely during outages - though we aim for outages to
beinfrequent.  The database holds data representing a virtual filesystem structure with millions of file (and
associatedaccess controls, and information on underlying storage resources and replication).  There is probably not
muchupdate or delete of the main data - at least compared with the total holdings and the new data/files which are
regularlybeing added to the system. 

In practice VACUUM FULL and REINDEX are used to reclaim disk space. That of itself doesn't look much like it'd improve
performance,but using less disk space also means that data gets more tightly packed in your disk cache, for example.
REINDEXcan mean an index that didn't fit into RAM now does. They're both rather intrusive operations though, so it's a
matterof balancing the costs and benefits. Many databases don't need to bother with VACUUM FULL or REINDEX. 

> Ps. Greg, I don't understand the issue with 'concurrently rebuild (of) an index that enforces a constraint or unique
index'. I don't think I care much right at the moment, but I'm generally interested and others might be too.. Would you
expectthe create index to fail or to cause locking or just transient performance degradation? 


I think what Greg was getting at is that there's a dependency tree between indexes and constraints: A primary key is
implementedusing a unique index. You can create a new (unique) index on the same columns concurrently, but you can't
replacethe primary key index with it as you're not allowed to drop the index without dropping the PK constraint. If you
haveany FK constraints pointing to that table, you can't drop the PK constraint without also dropping the FK
constraints.

Quite a bit of trouble to go through to replace one index.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,4c08c88410157954111193!



Gareth.Williams@csiro.au wrote:
> Re-reading the documentation http://www.postgresql.org/docs/8.4/interactive/sql-reindex.html it is clear now that
reindexor recreating and index should not normally be needed - certainly not to keep an index up-to-date.  I would have
guessedthat VACUUM or VACUUM ANALYSE on the table that the index is associated would have been sufficient to reclaim
spacefor a 'bloated' index (maybe only VACUUM FULL would help).  In any case we can leave reindexing or full vacuum for
outageswhere we are interrupting service anyway. 
>

It is a periodic preventative maintenance operation you can expect to
need occasionally, but certainly not often.  Indexes maintain themselves
just fine most of the time.  They can get out of whack if you delete a
lot of data out of them and there are some use patterns that tend to a
aggravate the problems here (like tables where you're always inserting
new data and deleting old), but it's certainly not something you run all
the time.

You should read http://wiki.postgresql.org/wiki/VACUUM_FULL to clear up
when it is needed and what the better alternatives are.


> A few times when I was trying to drop an index (before or after creating a duplicate index with 'concurrently'), the
droppingof the index stalled.  It seems that this was because of existing connection: 
> postgres: rods ICAT 130.102.163.141(58061) idle in transaction
> And new clients block.  Is that expected? Should we look for 'bugs' in out client that seems to be holding a
connection?
>

You do need to be careful that there are no clients connected when you
try this, or yes this is expected behavior.  One popular technique is to
put some sort of "block access to the database" switch in the
application itself, specifically to support small outages while keeping
the app from going crazy.  You can flip that for a few second around
when you're doing the index switch.


> Ps. Greg, I don't understand the issue with 'concurrently rebuild (of) an index that enforces a constraint or unique
index'. I don't think I care much right at the moment, but I'm generally interested and others might be too. Would you
expectthe create index to fail or to cause locking or just transient performance degradation? 
>

The description Alban wrote covers what I was alluding to.  You can't
just can't drop an index that supports a constraint, and that has some
(bad) implication for how you can rebuild it.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Excerpts from Greg Smith's message of lun jun 07 12:23:44 -0400 2010:

> It is a periodic preventative maintenance operation you can expect to
> need occasionally, but certainly not often.  Indexes maintain themselves
> just fine most of the time.  They can get out of whack if you delete a
> lot of data out of them and there are some use patterns that tend to a
> aggravate the problems here (like tables where you're always inserting
> new data and deleting old), but it's certainly not something you run all
> the time.

Indexes on which you always insert new data and delete old can keep
themselves in good shape too.  The really problematic cases are those in
which you delete new data and delete most, but not all, old data.  Those
cases would result in almost empty pages that can never be recycled (we
do not have btree page merging).

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support