Thread: Invalid indexes should not consume update overhead

Invalid indexes should not consume update overhead

From

Tomasz Ostrowski

Date:

13 July 2016, 11:10:54

When index is build concurrently and this build fails the index is left
in invalid state. It's basically unusable for anything, but, according
to documentation:

> If a problem arises while scanning the table, such as a deadlock or
> a uniqueness violation in a unique index, the CREATE INDEX command
> will fail but leave behind an "invalid" index. This index will be
> ignored for querying purposes because it might be incomplete; however
> it *will* *still* *consume* *update* *overhead*. The psql \d command
> will report such an index as INVALID

I think this update overhead is actually wasted - there's no way to make
use of it, as the only way to make the index usable again is to reindex
it or drop and recreate.

In the other hand if invalid indexes would have no update overhead then
they may actually be useful. Please consider the following situation
(taken from the real world):

- I have a very large table with a large number of indexes on a server
with large number of CPUs.
- I need to add a new not null column and I need to do this in a fairly
small maintenance window.
- I tried to simply "alter table tablename add column columnname int not
null default 0", but it did not end in 24 hours - it updates these tens
of indexes in single process, using 100% of a single CPU - unacceptable.
- I mark all the indexes as invalid (currently I'm just saving and
dropping them).
- This time adding of the column takes an hour.
- I vacuum full or cluster the table, as it has now bloated at least to
200% - this is also fast, as indexes are not updated.
- I'm reindexing (currently recreating) all indexes, but instead of
calculating them on a single CPU I use all available CPU's (I can have
like 40 of them on Amazon AWS RDS cheaply if I need them only for a few
hours). This also ends in an hour.
- The world is saved and everybody celebrate.

Dropping and recreating indexes is not a very safe operation - their
definitions need to be saved somewhere out of the database and if
anything goes wrong it would not be easy to restore them. If it would be
possible to just set them invalid (feature request in passing) and if
invalid indexed would not add cost on updates, then this would be much
easier and safer.

--
Tomasz "Tometzky" Ostrowski

Re: Invalid indexes should not consume update overhead

From

Amit Kapila

Date:

16 July 2016, 12:55:15

On Wed, Jul 13, 2016 at 4:40 PM, Tomasz Ostrowski
<tometzky+pg@ato.waw.pl> wrote:
> When index is build concurrently and this build fails the index is left in
> invalid state. It's basically unusable for anything, but, according to
> documentation:
>
>> If a problem arises while scanning the table, such as a deadlock or
>> a uniqueness violation in a unique index, the CREATE INDEX command
>> will fail but leave behind an "invalid" index. This index will be
>> ignored for querying purposes because it might be incomplete; however
>> it *will* *still* *consume* *update* *overhead*. The psql \d command
>> will report such an index as INVALID
>
>
> I think this update overhead is actually wasted - there's no way to make use
> of it, as the only way to make the index usable again is to reindex it or
> drop and recreate.
>
> In the other hand if invalid indexes would have no update overhead then they
> may actually be useful. Please consider the following situation (taken from
> the real world):
>
> - I have a very large table with a large number of indexes on a server with
> large number of CPUs.
> - I need to add a new not null column and I need to do this in a fairly
> small maintenance window.
> - I tried to simply "alter table tablename add column columnname int not
> null default 0", but it did not end in 24 hours - it updates these tens of
> indexes in single process, using 100% of a single CPU - unacceptable.
> - I mark all the indexes as invalid (currently I'm just saving and dropping
> them).
> - This time adding of the column takes an hour.
> - I vacuum full or cluster the table, as it has now bloated at least to 200%
> - this is also fast, as indexes are not updated.
> - I'm reindexing (currently recreating) all indexes, but instead of
> calculating them on a single CPU I use all available CPU's (I can have like
> 40 of them on Amazon AWS RDS cheaply if I need them only for a few hours).
> This also ends in an hour.
> - The world is saved and everybody celebrate.
>
> Dropping and recreating indexes is not a very safe operation - their
> definitions need to be saved somewhere out of the database and if anything
> goes wrong it would not be easy to restore them. If it would be possible to
> just set them invalid (feature request in passing) and if invalid indexed
> would not add cost on updates, then this would be much easier and safer.
>

How can you consider marking invalid index as valid index without
reindexing it?  It is quite possible that in the meantime the table
has been updated.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Invalid indexes should not consume update overhead

From

Greg Stark

Date:

17 July 2016, 00:09:32

I can't disagree with your conclusion but I can offer a bit of perspective
of how the current situation came about.

Invalid indexes are in the same state they're in while a concurrent index
build is in progress. As far as other queries are concerned they're
effectively assuming the index build is still in progress and will still
eventually be completed.

They could maybe determine that's not the case but then that would be an
extra check for them to do in the normal case so not necessarily a win.

The real solution imho is to actually clean up failed index builds when a
build fails. That's what normal transactions do when they abort after all.
This was always the intention but looked like it was going to be a pain and
was put off (ie I was lazy). It's probably just several layers of
PG_TRY/PG_CATCH and closing the failed transactions and opening new ones.

Re: Invalid indexes should not consume update overhead

From

Tom Lane

Date:

17 July 2016, 00:59:02

Greg Stark <stark@mit.edu> writes:
> The real solution imho is to actually clean up failed index builds when a
> build fails. That's what normal transactions do when they abort after all.
> This was always the intention but looked like it was going to be a pain and
> was put off (ie I was lazy). It's probably just several layers of
> PG_TRY/PG_CATCH and closing the failed transactions and opening new ones.

No, that wouldn't fix it if the reason the build failed was a crash.

            regards, tom lane

Re: Invalid indexes should not consume update overhead

From

Peter Geoghegan

Date:

17 July 2016, 03:48:30

On Sat, Jul 16, 2016 at 5:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@mit.edu> writes:
>> The real solution imho is to actually clean up failed index builds when a
>> build fails. That's what normal transactions do when they abort after all.
>> This was always the intention but looked like it was going to be a pain and
>> was put off (ie I was lazy). It's probably just several layers of
>> PG_TRY/PG_CATCH and closing the failed transactions and opening new ones.
>
> No, that wouldn't fix it if the reason the build failed was a crash.

Could we just have crash recovery perform clean-up to cover that case?


--
Peter Geoghegan

Re: Invalid indexes should not consume update overhead

From

Tomasz Ostrowski

Date:

17 July 2016, 11:42:00

On 2016-07-17 02:09, Greg Stark wrote:
> The real solution imho is to actually clean up failed index builds when
> a build fails.

That wouldn't solve my problem, which is that I need a way to disable
indexes before large update. I believe (but I'm not sure) that Oracle
has this concept:
    ALTER INDEX [INDEX_NAME] UNUSABLE;

Maybe, if an index is in invalid state, update can check which part of
table is already indexed and which part is not. Then it would only
update indexes of this already reindexed part of table. This way
purposely invalid indexes could be marked valid for blocks numbers less
than 0.

This might actually be a win during concurrent index creation as
concurrent updates would not have to update index for all updated rows.

But I don't know if it's feasible from concurrency perspective at all.

Regards,
Tomasz "Tometzky" Ostrowski

Re: Invalid indexes should not consume update overhead

From

Peter Geoghegan

Date:

17 July 2016, 19:13:11

On Sun, Jul 17, 2016 at 4:41 AM, Tomasz Ostrowski
<tometzky+pg@ato.waw.pl> wrote:
> That wouldn't solve my problem, which is that I need a way to disable
> indexes before large update. I believe (but I'm not sure) that Oracle has
> this concept:
>         ALTER INDEX [INDEX_NAME] UNUSABLE;

I think that this must make the index unusable to the optimizer. The
idea being that you can see the impact of dropping the index without
actually doing so, reserving the ability to back out (mark the index
usable once more rather than actually dropping it) if it turns out
that the index is of some use.

If it simply made the index unusable while removing any ongoing
obligation for writes to maintain the index, then what's the point in
supporting this at all? You need to be able to mark it usable again.

--
Peter Geoghegan

Re: Invalid indexes should not consume update overhead

From

"Rader, David"

Date:

17 July 2016, 20:42:40

On Sunday, July 17, 2016, Peter Geoghegan <pg@heroku.com> wrote:

> On Sun, Jul 17, 2016 at 4:41 AM, Tomasz Ostrowski
> <tometzky+pg@ato.waw.pl <javascript:;>> wrote:
> > That wouldn't solve my problem, which is that I need a way to disable
> > indexes before large update. I believe (but I'm not sure) that Oracle has
> > this concept:
> >         ALTER INDEX [INDEX_NAME] UNUSABLE;
>
> I think that this must make the index unusable to the optimizer. The
> idea being that you can see the impact of dropping the index without
> actually doing so, reserving the ability to back out (mark the index
> usable once more rather than actually dropping it) if it turns out
> that the index is of some use.
>
> If it simply made the index unusable while removing any ongoing
> obligation for writes to maintain the index, then what's the point in
> supporting this at all? You need to be able to mark it usable again.
>
> --
> Peter Geoghegan


For example, in SQL Server you can "alter index disable". If you are about
to do a lot of bulk operations. But there is no "re-enable"; instead you
have to "alter index rebuild" because as has been said on this thread you
don't know what has changed since the disable.

Basically this is very similar to dropping and recreating indexes around
bulk loads/updates.




>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org <javascript:;>
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs
>


--
--
David Rader
davidr@openscg.com

Re: Invalid indexes should not consume update overhead

From

Peter Geoghegan

Date:

17 July 2016, 20:59:22

On Sun, Jul 17, 2016 at 1:42 PM, Rader, David <davidr@openscg.com> wrote:
> For example, in SQL Server you can "alter index disable". If you are about
> to do a lot of bulk operations. But there is no "re-enable"; instead you
> have to "alter index rebuild" because as has been said on this thread you
> don't know what has changed since the disable.
>
> Basically this is very similar to dropping and recreating indexes around
> bulk loads/updates.

That seems pretty pointless. Why not actually drop the index, then?

The only reason I can think of is that there is value in representing
that indexes should continue to have optimizer statistics (that would
happen for expression indexes in Postgres) without actually paying for
the ongoing maintenance of the index during write statements. Even
that seems like kind of a stretch, though.

--
Peter Geoghegan

Re: Invalid indexes should not consume update overhead

From

"Rader, David"

Date:

17 July 2016, 21:06:14

On Sunday, July 17, 2016, Peter Geoghegan <pg@heroku.com> wrote:

> On Sun, Jul 17, 2016 at 1:42 PM, Rader, David <davidr@openscg.com
> <javascript:;>> wrote:
> > For example, in SQL Server you can "alter index disable". If you are
> about
> > to do a lot of bulk operations. But there is no "re-enable"; instead you
> > have to "alter index rebuild" because as has been said on this thread you
> > don't know what has changed since the disable.
> >
> > Basically this is very similar to dropping and recreating indexes around
> > bulk loads/updates.
>
> That seems pretty pointless. Why not actually drop the index, then?
>
> The only reason I can think of is that there is value in representing
> that indexes should continue to have optimizer statistics (that would
> happen for expression indexes in Postgres) without actually paying for
> the ongoing maintenance of the index during write statements. Even
> that seems like kind of a stretch, though.
>
> --
> Peter Geoghegan
>

There's some DBA benefit in that the index disable also disables
constraints and foreign keys that depend on the index.  instead of having
to drop and recreate dependent objects you can leave all the definitions in
place but disabled. So it makes laziness easier.

Of course then you have to be sure that your data is right when you bulk
load since the engine is not enforcing it.

--
--
David Rader
davidr@openscg.com

Re: Invalid indexes should not consume update overhead

From

Jan Wieck

Date:

17 July 2016, 21:12:32

On Sun, Jul 17, 2016 at 4:42 PM, Rader, David <davidr@openscg.com> wrote:

>
>
> On Sunday, July 17, 2016, Peter Geoghegan <pg@heroku.com> wrote:
>
>> On Sun, Jul 17, 2016 at 4:41 AM, Tomasz Ostrowski
>> <tometzky+pg@ato.waw.pl> wrote:
>> > That wouldn't solve my problem, which is that I need a way to disable
>> > indexes before large update. I believe (but I'm not sure) that Oracle
>> has
>> > this concept:
>> >         ALTER INDEX [INDEX_NAME] UNUSABLE;
>>
>> I think that this must make the index unusable to the optimizer. The
>> idea being that you can see the impact of dropping the index without
>> actually doing so, reserving the ability to back out (mark the index
>> usable once more rather than actually dropping it) if it turns out
>> that the index is of some use.
>>
>> If it simply made the index unusable while removing any ongoing
>> obligation for writes to maintain the index, then what's the point in
>> supporting this at all? You need to be able to mark it usable again.
>>
>> --
>> Peter Geoghegan
>
>
> For example, in SQL Server you can "alter index disable". If you are about
> to do a lot of bulk operations. But there is no "re-enable"; instead you
> have to "alter index rebuild" because as has been said on this thread you
> don't know what has changed since the disable.
>
> Basically this is very similar to dropping and recreating indexes around
> bulk loads/updates.
>

I would say that materially there is no difference. What would make a
difference would be
if it were possible to ALTER TABLE DISABLE INDEXES and then REINDEX to
build them.
That is, it would be different if rebuilding multiple indexes at once had a
substantial advantage,
like let's say it would scan the heap only once, building all the sort sets
in parallel.


Regards, Jan






>
>
>
>
>>
>> --
>> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-bugs
>>
>
>
> --
> --
> David Rader
> davidr@openscg.com
>
>


--
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info

Re: Invalid indexes should not consume update overhead

From

Jan Wieck

Date:

17 July 2016, 21:19:59

On Sun, Jul 17, 2016 at 5:06 PM, Rader, David <davidr@openscg.com> wrote:

>
>
> Of course then you have to be sure that your data is right when you bulk
> load since the engine is not enforcing it.
>
>
Correct. But that won't be different from a NOT VALID constraint (see
https://www.postgresql.org/docs/9.5/static/sql-altertable.html).


Jan


--
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info

Re: Invalid indexes should not consume update overhead

From

"Rader, David"

Date:

17 July 2016, 21:20:13

On Sunday, July 17, 2016, Jan Wieck <jan@wi3ck.info> wrote:

>
>
> On Sun, Jul 17, 2016 at 4:42 PM, Rader, David <davidr@openscg.com
> <javascript:_e(%7B%7D,'cvml','davidr@openscg.com');>> wrote:
>
>>
>>
>> On Sunday, July 17, 2016, Peter Geoghegan <pg@heroku.com
>> <javascript:_e(%7B%7D,'cvml','pg@heroku.com');>> wrote:
>>
>>> On Sun, Jul 17, 2016 at 4:41 AM, Tomasz Ostrowski
>>> <tometzky+pg@ato.waw.pl> wrote:
>>> > That wouldn't solve my problem, which is that I need a way to disable
>>> > indexes before large update. I believe (but I'm not sure) that Oracle
>>> has
>>> > this concept:
>>> >         ALTER INDEX [INDEX_NAME] UNUSABLE;
>>>
>>> I think that this must make the index unusable to the optimizer. The
>>> idea being that you can see the impact of dropping the index without
>>> actually doing so, reserving the ability to back out (mark the index
>>> usable once more rather than actually dropping it) if it turns out
>>> that the index is of some use.
>>>
>>> If it simply made the index unusable while removing any ongoing
>>> obligation for writes to maintain the index, then what's the point in
>>> supporting this at all? You need to be able to mark it usable again.
>>>
>>> --
>>> Peter Geoghegan
>>
>>
>> For example, in SQL Server you can "alter index disable". If you are
>> about to do a lot of bulk operations. But there is no "re-enable"; instead
>> you have to "alter index rebuild" because as has been said on this
>> thread you don't know what has changed since the disable.
>>
>> Basically this is very similar to dropping and recreating indexes around
>> bulk loads/updates.
>>
>
> I would say that materially there is no difference. What would make a
> difference would be
> if it were possible to ALTER TABLE DISABLE INDEXES and then REINDEX to
> build them.
> That is, it would be different if rebuilding multiple indexes at once had
> a substantial advantage,
> like let's say it would scan the heap only once, building all the sort
> sets in parallel.
>
>
> Regards, Jan
>
>
> Yes parallel multi index build would provide actual benefit. Otherwise
the disable/rebuild is just syntactic sugar that makes scripting bulk
operations simpler.


>
>
>
>>
>>
>>
>>
>>>
>>> --
>>> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-bugs
>>>
>>
>>
>> --
>> --
>> David Rader
>> davidr@openscg.com <javascript:_e(%7B%7D,'cvml','davidr@openscg.com');>
>>
>>
>
>
> --
> Jan Wieck
> Senior Postgres Architect
> http://pgblog.wi3ck.info
>


--
--
David Rader
davidr@openscg.com

Re: Invalid indexes should not consume update overhead

From

Tomasz Ostrowski

Date:

18 July 2016, 20:59:25

On 2016-07-17 23:06, Rader, David wrote:
>> That seems pretty pointless. Why not actually drop the index, then?
>>
>> The only reason I can think of is that there is value in representing
>> that indexes should continue to have optimizer statistics (that would
>> happen for expression indexes in Postgres) without actually paying for
>> the ongoing maintenance of the index during write statements. Even
>> that seems like kind of a stretch, though.
>
> There's some DBA benefit in that the index disable also disables
> constraints and foreign keys that depend on the index.  instead of
> having to drop and recreate dependent objects you can leave all the
> definitions in place but disabled. So it makes laziness easier.
>
> Of course then you have to be sure that your data is right when you bulk
> load since the engine is not enforcing it.

To make it clear - I don't postulate disabling indexes used for data
integrity - when the index can't be dropped, as for example some foreign
key depends on it, or table primary key is based on it, then it should
not be possible to turn it off.

Also I don't postulate turning them back on without doing a full reindex
- I just need to do this reindex for multiple indexes in parallel.

What I'd like to have isn't really different than just dropping the
indexes and recreating them back after bulk update. It's just that this
operation is not very safe:
- you have to save them somewhere else - using for example pg_dump,
- pg_dump is often not available or is in wrong version,
- when saving these indexes in some temporary directory you risk loosing
them in case of a failure or crash,
- provided that you're trying to code some application upgrade script,
it's hard to tell what to do when previous execution crashed - when you
save indexes again you risk overwriting your save with empty or
incomplete data; when you don't, then you can't be sure if it was from
some previous execution and there were some schema changes since.

It's just it's now hard to prepare this for support team in sufficiently
reliable way, so that they can do this bulk update on their own.

And believe me - this trick works when you have a large table with large
number (like tens) of indexes. Even more so if these indexes are of
unicode text data.

--
Regards,
Tomasz "Tometzky" Ostrowski