Thread: Invalid indexes should not consume update overhead
When index is build concurrently and this build fails the index is left in invalid state. It's basically unusable for anything, but, according to documentation: > If a problem arises while scanning the table, such as a deadlock or > a uniqueness violation in a unique index, the CREATE INDEX command > will fail but leave behind an "invalid" index. This index will be > ignored for querying purposes because it might be incomplete; however > it *will* *still* *consume* *update* *overhead*. The psql \d command > will report such an index as INVALID I think this update overhead is actually wasted - there's no way to make use of it, as the only way to make the index usable again is to reindex it or drop and recreate. In the other hand if invalid indexes would have no update overhead then they may actually be useful. Please consider the following situation (taken from the real world): - I have a very large table with a large number of indexes on a server with large number of CPUs. - I need to add a new not null column and I need to do this in a fairly small maintenance window. - I tried to simply "alter table tablename add column columnname int not null default 0", but it did not end in 24 hours - it updates these tens of indexes in single process, using 100% of a single CPU - unacceptable. - I mark all the indexes as invalid (currently I'm just saving and dropping them). - This time adding of the column takes an hour. - I vacuum full or cluster the table, as it has now bloated at least to 200% - this is also fast, as indexes are not updated. - I'm reindexing (currently recreating) all indexes, but instead of calculating them on a single CPU I use all available CPU's (I can have like 40 of them on Amazon AWS RDS cheaply if I need them only for a few hours). This also ends in an hour. - The world is saved and everybody celebrate. Dropping and recreating indexes is not a very safe operation - their definitions need to be saved somewhere out of the database and if anything goes wrong it would not be easy to restore them. If it would be possible to just set them invalid (feature request in passing) and if invalid indexed would not add cost on updates, then this would be much easier and safer. -- Tomasz "Tometzky" Ostrowski
On Wed, Jul 13, 2016 at 4:40 PM, Tomasz Ostrowski <tometzky+pg@ato.waw.pl> wrote: > When index is build concurrently and this build fails the index is left in > invalid state. It's basically unusable for anything, but, according to > documentation: > >> If a problem arises while scanning the table, such as a deadlock or >> a uniqueness violation in a unique index, the CREATE INDEX command >> will fail but leave behind an "invalid" index. This index will be >> ignored for querying purposes because it might be incomplete; however >> it *will* *still* *consume* *update* *overhead*. The psql \d command >> will report such an index as INVALID > > > I think this update overhead is actually wasted - there's no way to make use > of it, as the only way to make the index usable again is to reindex it or > drop and recreate. > > In the other hand if invalid indexes would have no update overhead then they > may actually be useful. Please consider the following situation (taken from > the real world): > > - I have a very large table with a large number of indexes on a server with > large number of CPUs. > - I need to add a new not null column and I need to do this in a fairly > small maintenance window. > - I tried to simply "alter table tablename add column columnname int not > null default 0", but it did not end in 24 hours - it updates these tens of > indexes in single process, using 100% of a single CPU - unacceptable. > - I mark all the indexes as invalid (currently I'm just saving and dropping > them). > - This time adding of the column takes an hour. > - I vacuum full or cluster the table, as it has now bloated at least to 200% > - this is also fast, as indexes are not updated. > - I'm reindexing (currently recreating) all indexes, but instead of > calculating them on a single CPU I use all available CPU's (I can have like > 40 of them on Amazon AWS RDS cheaply if I need them only for a few hours). > This also ends in an hour. > - The world is saved and everybody celebrate. > > Dropping and recreating indexes is not a very safe operation - their > definitions need to be saved somewhere out of the database and if anything > goes wrong it would not be easy to restore them. If it would be possible to > just set them invalid (feature request in passing) and if invalid indexed > would not add cost on updates, then this would be much easier and safer. > How can you consider marking invalid index as valid index without reindexing it? It is quite possible that in the meantime the table has been updated. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
I can't disagree with your conclusion but I can offer a bit of perspective of how the current situation came about. Invalid indexes are in the same state they're in while a concurrent index build is in progress. As far as other queries are concerned they're effectively assuming the index build is still in progress and will still eventually be completed. They could maybe determine that's not the case but then that would be an extra check for them to do in the normal case so not necessarily a win. The real solution imho is to actually clean up failed index builds when a build fails. That's what normal transactions do when they abort after all. This was always the intention but looked like it was going to be a pain and was put off (ie I was lazy). It's probably just several layers of PG_TRY/PG_CATCH and closing the failed transactions and opening new ones.
Greg Stark <stark@mit.edu> writes: > The real solution imho is to actually clean up failed index builds when a > build fails. That's what normal transactions do when they abort after all. > This was always the intention but looked like it was going to be a pain and > was put off (ie I was lazy). It's probably just several layers of > PG_TRY/PG_CATCH and closing the failed transactions and opening new ones. No, that wouldn't fix it if the reason the build failed was a crash. regards, tom lane
On Sat, Jul 16, 2016 at 5:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <stark@mit.edu> writes: >> The real solution imho is to actually clean up failed index builds when a >> build fails. That's what normal transactions do when they abort after all. >> This was always the intention but looked like it was going to be a pain and >> was put off (ie I was lazy). It's probably just several layers of >> PG_TRY/PG_CATCH and closing the failed transactions and opening new ones. > > No, that wouldn't fix it if the reason the build failed was a crash. Could we just have crash recovery perform clean-up to cover that case? -- Peter Geoghegan
On 2016-07-17 02:09, Greg Stark wrote: > The real solution imho is to actually clean up failed index builds when > a build fails. That wouldn't solve my problem, which is that I need a way to disable indexes before large update. I believe (but I'm not sure) that Oracle has this concept: ALTER INDEX [INDEX_NAME] UNUSABLE; Maybe, if an index is in invalid state, update can check which part of table is already indexed and which part is not. Then it would only update indexes of this already reindexed part of table. This way purposely invalid indexes could be marked valid for blocks numbers less than 0. This might actually be a win during concurrent index creation as concurrent updates would not have to update index for all updated rows. But I don't know if it's feasible from concurrency perspective at all. Regards, Tomasz "Tometzky" Ostrowski
On Sun, Jul 17, 2016 at 4:41 AM, Tomasz Ostrowski <tometzky+pg@ato.waw.pl> wrote: > That wouldn't solve my problem, which is that I need a way to disable > indexes before large update. I believe (but I'm not sure) that Oracle has > this concept: > ALTER INDEX [INDEX_NAME] UNUSABLE; I think that this must make the index unusable to the optimizer. The idea being that you can see the impact of dropping the index without actually doing so, reserving the ability to back out (mark the index usable once more rather than actually dropping it) if it turns out that the index is of some use. If it simply made the index unusable while removing any ongoing obligation for writes to maintain the index, then what's the point in supporting this at all? You need to be able to mark it usable again. -- Peter Geoghegan
On Sunday, July 17, 2016, Peter Geoghegan <pg@heroku.com> wrote: > On Sun, Jul 17, 2016 at 4:41 AM, Tomasz Ostrowski > <tometzky+pg@ato.waw.pl <javascript:;>> wrote: > > That wouldn't solve my problem, which is that I need a way to disable > > indexes before large update. I believe (but I'm not sure) that Oracle has > > this concept: > > ALTER INDEX [INDEX_NAME] UNUSABLE; > > I think that this must make the index unusable to the optimizer. The > idea being that you can see the impact of dropping the index without > actually doing so, reserving the ability to back out (mark the index > usable once more rather than actually dropping it) if it turns out > that the index is of some use. > > If it simply made the index unusable while removing any ongoing > obligation for writes to maintain the index, then what's the point in > supporting this at all? You need to be able to mark it usable again. > > -- > Peter Geoghegan For example, in SQL Server you can "alter index disable". If you are about to do a lot of bulk operations. But there is no "re-enable"; instead you have to "alter index rebuild" because as has been said on this thread you don't know what has changed since the disable. Basically this is very similar to dropping and recreating indexes around bulk loads/updates. > > -- > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org <javascript:;> > ) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs > -- -- David Rader davidr@openscg.com
On Sun, Jul 17, 2016 at 1:42 PM, Rader, David <davidr@openscg.com> wrote: > For example, in SQL Server you can "alter index disable". If you are about > to do a lot of bulk operations. But there is no "re-enable"; instead you > have to "alter index rebuild" because as has been said on this thread you > don't know what has changed since the disable. > > Basically this is very similar to dropping and recreating indexes around > bulk loads/updates. That seems pretty pointless. Why not actually drop the index, then? The only reason I can think of is that there is value in representing that indexes should continue to have optimizer statistics (that would happen for expression indexes in Postgres) without actually paying for the ongoing maintenance of the index during write statements. Even that seems like kind of a stretch, though. -- Peter Geoghegan
On Sunday, July 17, 2016, Peter Geoghegan <pg@heroku.com> wrote: > On Sun, Jul 17, 2016 at 1:42 PM, Rader, David <davidr@openscg.com > <javascript:;>> wrote: > > For example, in SQL Server you can "alter index disable". If you are > about > > to do a lot of bulk operations. But there is no "re-enable"; instead you > > have to "alter index rebuild" because as has been said on this thread you > > don't know what has changed since the disable. > > > > Basically this is very similar to dropping and recreating indexes around > > bulk loads/updates. > > That seems pretty pointless. Why not actually drop the index, then? > > The only reason I can think of is that there is value in representing > that indexes should continue to have optimizer statistics (that would > happen for expression indexes in Postgres) without actually paying for > the ongoing maintenance of the index during write statements. Even > that seems like kind of a stretch, though. > > -- > Peter Geoghegan > There's some DBA benefit in that the index disable also disables constraints and foreign keys that depend on the index. instead of having to drop and recreate dependent objects you can leave all the definitions in place but disabled. So it makes laziness easier. Of course then you have to be sure that your data is right when you bulk load since the engine is not enforcing it. -- -- David Rader davidr@openscg.com
On Sun, Jul 17, 2016 at 4:42 PM, Rader, David <davidr@openscg.com> wrote: > > > On Sunday, July 17, 2016, Peter Geoghegan <pg@heroku.com> wrote: > >> On Sun, Jul 17, 2016 at 4:41 AM, Tomasz Ostrowski >> <tometzky+pg@ato.waw.pl> wrote: >> > That wouldn't solve my problem, which is that I need a way to disable >> > indexes before large update. I believe (but I'm not sure) that Oracle >> has >> > this concept: >> > ALTER INDEX [INDEX_NAME] UNUSABLE; >> >> I think that this must make the index unusable to the optimizer. The >> idea being that you can see the impact of dropping the index without >> actually doing so, reserving the ability to back out (mark the index >> usable once more rather than actually dropping it) if it turns out >> that the index is of some use. >> >> If it simply made the index unusable while removing any ongoing >> obligation for writes to maintain the index, then what's the point in >> supporting this at all? You need to be able to mark it usable again. >> >> -- >> Peter Geoghegan > > > For example, in SQL Server you can "alter index disable". If you are about > to do a lot of bulk operations. But there is no "re-enable"; instead you > have to "alter index rebuild" because as has been said on this thread you > don't know what has changed since the disable. > > Basically this is very similar to dropping and recreating indexes around > bulk loads/updates. > I would say that materially there is no difference. What would make a difference would be if it were possible to ALTER TABLE DISABLE INDEXES and then REINDEX to build them. That is, it would be different if rebuilding multiple indexes at once had a substantial advantage, like let's say it would scan the heap only once, building all the sort sets in parallel. Regards, Jan > > > > >> >> -- >> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-bugs >> > > > -- > -- > David Rader > davidr@openscg.com > > -- Jan Wieck Senior Postgres Architect http://pgblog.wi3ck.info
On Sun, Jul 17, 2016 at 5:06 PM, Rader, David <davidr@openscg.com> wrote: > > > Of course then you have to be sure that your data is right when you bulk > load since the engine is not enforcing it. > > Correct. But that won't be different from a NOT VALID constraint (see https://www.postgresql.org/docs/9.5/static/sql-altertable.html). Jan -- Jan Wieck Senior Postgres Architect http://pgblog.wi3ck.info
On Sunday, July 17, 2016, Jan Wieck <jan@wi3ck.info> wrote: > > > On Sun, Jul 17, 2016 at 4:42 PM, Rader, David <davidr@openscg.com > <javascript:_e(%7B%7D,'cvml','davidr@openscg.com');>> wrote: > >> >> >> On Sunday, July 17, 2016, Peter Geoghegan <pg@heroku.com >> <javascript:_e(%7B%7D,'cvml','pg@heroku.com');>> wrote: >> >>> On Sun, Jul 17, 2016 at 4:41 AM, Tomasz Ostrowski >>> <tometzky+pg@ato.waw.pl> wrote: >>> > That wouldn't solve my problem, which is that I need a way to disable >>> > indexes before large update. I believe (but I'm not sure) that Oracle >>> has >>> > this concept: >>> > ALTER INDEX [INDEX_NAME] UNUSABLE; >>> >>> I think that this must make the index unusable to the optimizer. The >>> idea being that you can see the impact of dropping the index without >>> actually doing so, reserving the ability to back out (mark the index >>> usable once more rather than actually dropping it) if it turns out >>> that the index is of some use. >>> >>> If it simply made the index unusable while removing any ongoing >>> obligation for writes to maintain the index, then what's the point in >>> supporting this at all? You need to be able to mark it usable again. >>> >>> -- >>> Peter Geoghegan >> >> >> For example, in SQL Server you can "alter index disable". If you are >> about to do a lot of bulk operations. But there is no "re-enable"; instead >> you have to "alter index rebuild" because as has been said on this >> thread you don't know what has changed since the disable. >> >> Basically this is very similar to dropping and recreating indexes around >> bulk loads/updates. >> > > I would say that materially there is no difference. What would make a > difference would be > if it were possible to ALTER TABLE DISABLE INDEXES and then REINDEX to > build them. > That is, it would be different if rebuilding multiple indexes at once had > a substantial advantage, > like let's say it would scan the heap only once, building all the sort > sets in parallel. > > > Regards, Jan > > > Yes parallel multi index build would provide actual benefit. Otherwise the disable/rebuild is just syntactic sugar that makes scripting bulk operations simpler. > > > >> >> >> >> >>> >>> -- >>> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-bugs >>> >> >> >> -- >> -- >> David Rader >> davidr@openscg.com <javascript:_e(%7B%7D,'cvml','davidr@openscg.com');> >> >> > > > -- > Jan Wieck > Senior Postgres Architect > http://pgblog.wi3ck.info > -- -- David Rader davidr@openscg.com
On 2016-07-17 23:06, Rader, David wrote: >> That seems pretty pointless. Why not actually drop the index, then? >> >> The only reason I can think of is that there is value in representing >> that indexes should continue to have optimizer statistics (that would >> happen for expression indexes in Postgres) without actually paying for >> the ongoing maintenance of the index during write statements. Even >> that seems like kind of a stretch, though. > > There's some DBA benefit in that the index disable also disables > constraints and foreign keys that depend on the index. instead of > having to drop and recreate dependent objects you can leave all the > definitions in place but disabled. So it makes laziness easier. > > Of course then you have to be sure that your data is right when you bulk > load since the engine is not enforcing it. To make it clear - I don't postulate disabling indexes used for data integrity - when the index can't be dropped, as for example some foreign key depends on it, or table primary key is based on it, then it should not be possible to turn it off. Also I don't postulate turning them back on without doing a full reindex - I just need to do this reindex for multiple indexes in parallel. What I'd like to have isn't really different than just dropping the indexes and recreating them back after bulk update. It's just that this operation is not very safe: - you have to save them somewhere else - using for example pg_dump, - pg_dump is often not available or is in wrong version, - when saving these indexes in some temporary directory you risk loosing them in case of a failure or crash, - provided that you're trying to code some application upgrade script, it's hard to tell what to do when previous execution crashed - when you save indexes again you risk overwriting your save with empty or incomplete data; when you don't, then you can't be sure if it was from some previous execution and there were some schema changes since. It's just it's now hard to prepare this for support team in sufficiently reliable way, so that they can do this bulk update on their own. And believe me - this trick works when you have a large table with large number (like tens) of indexes. Even more so if these indexes are of unicode text data. -- Regards, Tomasz "Tometzky" Ostrowski