Thread: REINDEX backend filtering

REINDEX backend filtering

From
Julien Rouhaud
Date:
Hello,

Now that we have the infrastructure to track indexes that might be corrupted
due to changes in collation libraries, I think it would be a good idea to offer
an easy way for users to reindex all indexes that might be corrupted.

I'm attaching a POC patch as a discussion basis.  It implements a new
"COLLATION" option to reindex, with "not_current" being the only accepted
value.  Note that I didn't spent too much efforts on the grammar part yet.

So for instance you can do:

REINDEX (COLLATION 'not_current') DATABASE mydb;

The filter is also implemented so that you could cumulate multiple filters, so
it could be easy to add more filtering, for instance:

REINDEX (COLLATION 'libc', COLLATION 'not_current') DATABASE mydb;

to only rebuild indexes depending on outdated libc collations, or

REINDEX (COLLATION 'libc', VERSION 'X.Y') DATABASE mydb;

to only rebuild indexes depending on a specific version of libc.

Attachment

Re: REINDEX backend filtering

From
Michael Paquier
Date:
On Thu, Dec 03, 2020 at 05:31:43PM +0800, Julien Rouhaud wrote:
> Now that we have the infrastructure to track indexes that might be corrupted
> due to changes in collation libraries, I think it would be a good idea to offer
> an easy way for users to reindex all indexes that might be corrupted.

Yes.  It would be a good thing.

> The filter is also implemented so that you could cumulate multiple filters, so
> it could be easy to add more filtering, for instance:
>
> REINDEX (COLLATION 'libc', COLLATION 'not_current') DATABASE mydb;
>
> to only rebuild indexes depending on outdated libc collations, or
>
> REINDEX (COLLATION 'libc', VERSION 'X.Y') DATABASE mydb;
>
> to only rebuild indexes depending on a specific version of libc.

Deciding on the grammar to use depends on the use cases we would like
to satisfy.  From what I heard on this topic, the goal is to reduce
the amount of time necessary to reindex a system so as REINDEX only
works on indexes whose dependent collation versions are not known or
works on indexes in need of a collation refresh (like a reindexdb
--all --collation -j $jobs).  What would be the benefit in having more
complexity with library-dependent settings while we could take care
of the use cases that matter the most with a simple grammar?  Perhaps
"not_current" is not the best match as a keyword, we could just use
"collation" and handle that as a boolean.  As long as we don't need
new operators in the grammar rules..
--
Michael

Attachment

Re: REINDEX backend filtering

From
Julien Rouhaud
Date:
On Mon, Dec 14, 2020 at 3:45 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Dec 03, 2020 at 05:31:43PM +0800, Julien Rouhaud wrote:
> > Now that we have the infrastructure to track indexes that might be corrupted
> > due to changes in collation libraries, I think it would be a good idea to offer
> > an easy way for users to reindex all indexes that might be corrupted.
>
> Yes.  It would be a good thing.
>
> > The filter is also implemented so that you could cumulate multiple filters, so
> > it could be easy to add more filtering, for instance:
> >
> > REINDEX (COLLATION 'libc', COLLATION 'not_current') DATABASE mydb;
> >
> > to only rebuild indexes depending on outdated libc collations, or
> >
> > REINDEX (COLLATION 'libc', VERSION 'X.Y') DATABASE mydb;
> >
> > to only rebuild indexes depending on a specific version of libc.
>
> Deciding on the grammar to use depends on the use cases we would like
> to satisfy.  From what I heard on this topic, the goal is to reduce
> the amount of time necessary to reindex a system so as REINDEX only
> works on indexes whose dependent collation versions are not known or
> works on indexes in need of a collation refresh (like a reindexdb
> --all --collation -j $jobs).  What would be the benefit in having more
> complexity with library-dependent settings while we could take care
> of the use cases that matter the most with a simple grammar?  Perhaps
> "not_current" is not the best match as a keyword, we could just use
> "collation" and handle that as a boolean.  As long as we don't need
> new operators in the grammar rules..

I'm not sure what the DBA usual pattern here.  If the reindexing
runtime is really critical, I'm assuming that at least some people
will dig into library details to see what are the collations that
actually broke in the last upgrade and will want to reindex only
those, and force the version for the rest of the indexes.  And
obviously, they probably won't wait to have multiple collation
versions dependencies before taking care of that.  In that case the
filters that would matters would be one to only keep indexes with an
outdated collation version, and an additional one for a specific
collation name.  Or we could have the COLLATION keyword without
additional argument mean all outdated collations, and COLLATION
'collation_name' to specify a specific one.  This is maybe a bit ugly,
and would probably require a different approach for reindexdb.



Re: REINDEX backend filtering

From
Magnus Hagander
Date:
On Tue, Dec 15, 2020 at 12:22 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Mon, Dec 14, 2020 at 3:45 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Thu, Dec 03, 2020 at 05:31:43PM +0800, Julien Rouhaud wrote:
> > > Now that we have the infrastructure to track indexes that might be corrupted
> > > due to changes in collation libraries, I think it would be a good idea to offer
> > > an easy way for users to reindex all indexes that might be corrupted.
> >
> > Yes.  It would be a good thing.
> >
> > > The filter is also implemented so that you could cumulate multiple filters, so
> > > it could be easy to add more filtering, for instance:
> > >
> > > REINDEX (COLLATION 'libc', COLLATION 'not_current') DATABASE mydb;
> > >
> > > to only rebuild indexes depending on outdated libc collations, or
> > >
> > > REINDEX (COLLATION 'libc', VERSION 'X.Y') DATABASE mydb;
> > >
> > > to only rebuild indexes depending on a specific version of libc.
> >
> > Deciding on the grammar to use depends on the use cases we would like
> > to satisfy.  From what I heard on this topic, the goal is to reduce
> > the amount of time necessary to reindex a system so as REINDEX only
> > works on indexes whose dependent collation versions are not known or
> > works on indexes in need of a collation refresh (like a reindexdb
> > --all --collation -j $jobs).  What would be the benefit in having more
> > complexity with library-dependent settings while we could take care
> > of the use cases that matter the most with a simple grammar?  Perhaps
> > "not_current" is not the best match as a keyword, we could just use
> > "collation" and handle that as a boolean.  As long as we don't need
> > new operators in the grammar rules..
>
> I'm not sure what the DBA usual pattern here.  If the reindexing
> runtime is really critical, I'm assuming that at least some people
> will dig into library details to see what are the collations that
> actually broke in the last upgrade and will want to reindex only
> those, and force the version for the rest of the indexes.  And
> obviously, they probably won't wait to have multiple collation
> versions dependencies before taking care of that.  In that case the
> filters that would matters would be one to only keep indexes with an
> outdated collation version, and an additional one for a specific
> collation name.  Or we could have the COLLATION keyword without
> additional argument mean all outdated collations, and COLLATION
> 'collation_name' to specify a specific one.  This is maybe a bit ugly,
> and would probably require a different approach for reindexdb.

Is this really a common enough operation that we need it i the main grammar?

Having the functionality, definitely, but what if it was "just" a
function instead? So you'd do something like:
SELECT 'reindex index ' || i FROM pg_blah(some, arguments, here)
\gexec

Or even a function that returns the REINDEX commands directly (taking
a parameter to turn on/off concurrency for example).

That also seems like it would be easier to make flexible, and just as
easy to plug into reindexdb?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: REINDEX backend filtering

From
Michael Paquier
Date:
On Tue, Dec 15, 2020 at 06:34:16PM +0100, Magnus Hagander wrote:
> Is this really a common enough operation that we need it in the main grammar?
>
> Having the functionality, definitely, but what if it was "just" a
> function instead? So you'd do something like:
> SELECT 'reindex index ' || i FROM pg_blah(some, arguments, here)
> \gexec
>
> Or even a function that returns the REINDEX commands directly (taking
> a parameter to turn on/off concurrency for example).
>
> That also seems like it would be easier to make flexible, and just as
> easy to plug into reindexdb?

Having control in the grammar to choose which index to reindex for a
table is very useful when it comes to parallel reindexing, because
it is no-brainer in terms of knowing which index to distribute to one
job or another.  In short, with this grammar you can just issue a set
of REINDEX TABLE commands that we know will not conflict with each
other.  You cannot get that level of control with REINDEX INDEX as it
may be possible that more or more commands conflict if they work on an
index of the same relation because it is required to take lock also on
the parent table.  Of course, we could decide to implement a
redistribution logic in all frontend tools that need such things, like
reindexdb, but that's not something I think we should let the client
decide of.  A backend-side filtering is IMO much simpler, less code,
and more elegant.
--
Michael

Attachment

Re: REINDEX backend filtering

From
Julien Rouhaud
Date:
On Wed, Dec 16, 2020 at 8:27 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Dec 15, 2020 at 06:34:16PM +0100, Magnus Hagander wrote:
> > Is this really a common enough operation that we need it in the main grammar?
> >
> > Having the functionality, definitely, but what if it was "just" a
> > function instead? So you'd do something like:
> > SELECT 'reindex index ' || i FROM pg_blah(some, arguments, here)
> > \gexec
> >
> > Or even a function that returns the REINDEX commands directly (taking
> > a parameter to turn on/off concurrency for example).
> >
> > That also seems like it would be easier to make flexible, and just as
> > easy to plug into reindexdb?
>
> Having control in the grammar to choose which index to reindex for a
> table is very useful when it comes to parallel reindexing, because
> it is no-brainer in terms of knowing which index to distribute to one
> job or another.  In short, with this grammar you can just issue a set
> of REINDEX TABLE commands that we know will not conflict with each
> other.  You cannot get that level of control with REINDEX INDEX as it
> may be possible that more or more commands conflict if they work on an
> index of the same relation because it is required to take lock also on
> the parent table.  Of course, we could decide to implement a
> redistribution logic in all frontend tools that need such things, like
> reindexdb, but that's not something I think we should let the client
> decide of.  A backend-side filtering is IMO much simpler, less code,
> and more elegant.

Maybe additional filtering capabilities is not something that will be
required frequently, but I'm pretty sure that reindexing only indexes
that might be corrupt is something that will be required often..  So I
agree, having all that logic in the backend makes everything easier
for users, having the choice of the tools they want to issue the query
while still having all features available.

There was a conflict with a3dc926009be8 (Refactor option handling of
CLUSTER, REINDEX and VACUUM), so rebased version attached.  No other
changes included yet.

Attachment

Re: REINDEX backend filtering

From
Julien Rouhaud
Date:
On Thu, Jan 21, 2021 at 11:12:56AM +0800, Julien Rouhaud wrote:
> 
> There was a conflict with a3dc926009be8 (Refactor option handling of
> CLUSTER, REINDEX and VACUUM), so rebased version attached.  No other
> changes included yet.

New conflict, v3 attached.

Attachment

Re: REINDEX backend filtering

From
Zhihong Yu
Date:
Hi,
For index_has_deprecated_collation(),

+   object.objectSubId = 0;

The objectSubId field is not accessed by do_check_index_has_deprecated_collation(). Does it need to be assigned ?

For RelationGetIndexListFiltered(), it seems when (options & REINDEXOPT_COLL_NOT_CURRENT) == 0, the full_list would be returned.
This can be checked prior to entering the foreach loop.

Cheers

On Sat, Feb 6, 2021 at 11:20 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
On Thu, Jan 21, 2021 at 11:12:56AM +0800, Julien Rouhaud wrote:
>
> There was a conflict with a3dc926009be8 (Refactor option handling of
> CLUSTER, REINDEX and VACUUM), so rebased version attached.  No other
> changes included yet.

New conflict, v3 attached.