Thread: Improving REINDEX for system indexes (long)

Improving REINDEX for system indexes (long)

From
Tom Lane
Date:
I've been looking at the issues involved in reindexing system tables,
and I now have what I think is a fairly defensible set of proposals.

We should whenever possible use the same reindexing technique used by
CLUSTER: assign a new relfilenode number, build the new index in that
file, and apply an ordinary heap_update operation to update the index's
pg_class row with the new relfilenode value.  This technique is fully
crash-safe and transaction-safe (meaning it can be rolled back even after
completion, in case of failure later in the same transaction).  However,
there are several pitfalls for applying it to system indexes:

1. There is a problem for a shared index, because we have no way to
update pg_class.relfilenode in all the other databases besides ours.
I see no good way around this problem.

2. There is a problem for a nailed-in-cache index, because the relcache
isn't prepared to cope with relfilenode updates for such indexes.
However, that is fixable.

3. There is a problem for an index on pg_class itself: doing heap_update
on a pg_class row must make new index entries.  We have to be careful
that the new row does appear in the updated index, while not making
entries in old-and-possibly-broken indexes.  This is doable.

4. There is a problem with updating indexes that might be used for catalog
fetches executed during the index_build process: if we try to use the
partially-rebuilt index for such a fetch we will fail.  In the case of
disaster recovery the problem doesn't exist because we'll have
IsIgnoringSystemIndexes true, but for an ordinary on-line indexing
operation there's definitely a risk.  Presently the code relies on
"deactivating indexes" to prevent this, but I think it can be done more
simply, because we only need to suppress access to the target index
locally in the reindexing backend.  (Other backends will be locked out by
means of an AccessExclusiveLock on the system catalog in question.)

In short then, we can fix things so that the new-relfilenode path can be
taken in all cases except shared indexes.  For shared indexes, we have
little choice but to truncate and rebuild the index in-place.  (There is
then no need to update its pg_class row at all.)  This is of course
horribly crash-unsafe.

The code presently uses a "deactivated indexes" flag (namely, setting
pg_class.relhasindex false) to provide some protection against a crash
midway through an in-place reindex.  However, this concept is really quite
useless for a shared index, because only one database could contain the
deactivation flag.  Accesses from any other database would still try to
use the broken index.

Based on this analysis, I feel that the "deactivated indexes" mechanism
is of no further value and should be retired.  We should instead simply
acknowledge that reindexing shared indexes is dangerous.  I propose that
it be allowed only from a standalone backend.  Specifically I think that
REINDEX INDEX and REINDEX TABLE should error out if pointed to a shared
index or table, unless running in a standalone backend; REINDEX DATABASE
should skip over the shared indexes unless running in a standalone
backend.  (I'm not convinced that either -O or -P needs to be insisted on
by REINDEX, btw, but we can discuss that separately.)  We'll have to warn
users not to try to restart the database when reindexing of a shared table
wasn't successfully completed.

Details to back up the above claims:

Part of the code needed to fix the relcache restriction on nailed indexes
is already there, but ifdef'd out; that's the part that re-reads the
index's pg_class row after receiving SI inval for it.  There are some
cases not yet handled though.  In the first place, if the nailed index
being modified is pg_class_oid_index, ScanPgRelation will try to use that
same index to load the updated row, and will fail because it'll be trying
to use the old relfilenode.  We can easily force a seqscan in that case,
however.  A more subtle problem is that if an SI buffer overflow occurs, 
RelationCacheInvalidate walks through the relation cache in a random
order.  I believe that the correct update order has to be first
pg_class_oid_index, then the other nailed indexes, and finally all other
relation cache entries.  pg_class_oid_index has to be updated first (with
the aforementioned seqscan), since it will be used by ScanPgRelation to
reread the pg_class rows for the other nailed indexes.  Only when we are
sure the nailed indexes are up-to-date can we safely rebuild other
relcache entries.

Assigning a new relfilenode to indexes of pg_class is not a big deal when
we don't have an index-corruption problem.  We simply have to make the new
heap_updated row before we start index_build (which indeed the code does
already); both old and new rows will be indexed in the new index, and all
is well.  However, if we suspect index corruption then it is important not
to try to make entries into the old indexes, because we might crash trying
to update a corrupt index.  I propose the following behavior:
REINDEX INDEX: no special action; this is not the thing to usewhen index corruption is suspected anyway.
REINDEX TABLE: while operating on pg_class, arrange for therelcache's list of known indexes of pg_class to contain only
theindexesalready reindexed.  In this way, only those indexes willbe updated by CatalogUpdateIndexes(), and so no
untrustworthyindexeswill be touched while making new pg_class rows.
 
REINDEX DATABASE: take care to process pg_class first.

To avoid catalog accesses via an index that's being rebuilt, we can simply
generalize the SetReindexProcessing() state to include the OID of the
index currently being rebuilt.  Places that presently make checks for
IsIgnoringSystemIndexes can check this as well and fall back to seqscans
when the targeted index would have been used.

One other point: IsIgnoringSystemIndexes seems presently to be taken to
mean not only that we don't *read* the system indexes, but that we don't
*write* them either.  I think this is horribly dangerous; it effectively
means that the only thing you can safely do in a backend started with -P
is REINDEX.  If you make any other changes to the system catalogs then
you've re-corrupted the indexes.  Also, you don't have any protection
against duplicate entries getting made, since the unique indexes are
disabled.  It would be a lot safer to define IsIgnoringSystemIndexes as
preventing the system indexes from being used for lookups, while still
updating the indexes on any catalog modification.  This will not affect
the safety of REINDEX and will make other operations much less prone to
pilot error.

Comments?
        regards, tom lane


Re: Improving REINDEX for system indexes (long)

From
"Hiroshi Inoue"
Date:
First it should have been discussed before your commitment or at least
it should be discussed after reversing your change.

I require you to explain me why you committed the change
with no discussion and little investigation.

I also noticed that your change for catalog/index.c   Revision 1.200 / (download) - annotate - [select for diffs] , Mon
Sep23 
00:42:48 2002 UTC (11 months, 4 weeks ago) by tgl    Changes since 1.199: +124 -161 lines   Diff to previous 1.199
Getrid of bogus use of heap_mark4update in reindex operations (cf.   recent bug report).  Fix processing of
nailed-in-cacheindexes;   it appears that REINDEX DATABASE has been broken for months :-(. 

removed an #ifndef ENABLE_REINDEX_NAILED_RELATIONS
trial stuff. I'm very surprised to see it now.

regards,
Hiroshi Inoue

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>
> I've been looking at the issues involved in reindexing system tables,
> and I now have what I think is a fairly defensible set of proposals.
>
> We should whenever possible use the same reindexing technique used by
> CLUSTER: assign a new relfilenode number, build the new index in that
> file, and apply an ordinary heap_update operation to update
> the index's
> pg_class row with the new relfilenode value.  This technique is fully
> crash-safe and transaction-safe (meaning it can be rolled
> back even after
> completion, in case of failure later in the same
> transaction).  However,
> there are several pitfalls for applying it to system indexes:
>
> 1. There is a problem for a shared index, because we have no way to
> update pg_class.relfilenode in all the other databases besides ours.
> I see no good way around this problem.
>
> 2. There is a problem for a nailed-in-cache index, because
> the relcache
> isn't prepared to cope with relfilenode updates for such indexes.
> However, that is fixable.
>
> 3. There is a problem for an index on pg_class itself: doing
> heap_update
> on a pg_class row must make new index entries.  We have to be careful
> that the new row does appear in the updated index, while not making
> entries in old-and-possibly-broken indexes.  This is doable.
>
> 4. There is a problem with updating indexes that might be
> used for catalog
> fetches executed during the index_build process: if we try to use the
> partially-rebuilt index for such a fetch we will fail.  In the case of
> disaster recovery the problem doesn't exist because we'll have
> IsIgnoringSystemIndexes true, but for an ordinary on-line indexing
> operation there's definitely a risk.  Presently the code relies on
> "deactivating indexes" to prevent this, but I think it can be
> done more
> simply, because we only need to suppress access to the target index
> locally in the reindexing backend.  (Other backends will be
> locked out by
> means of an AccessExclusiveLock on the system catalog in question.)
>
> In short then, we can fix things so that the new-relfilenode
> path can be
> taken in all cases except shared indexes.  For shared indexes, we have
> little choice but to truncate and rebuild the index in-place.
>  (There is
> then no need to update its pg_class row at all.)  This is of course
> horribly crash-unsafe.
>
> The code presently uses a "deactivated indexes" flag (namely, setting
> pg_class.relhasindex false) to provide some protection against a crash
> midway through an in-place reindex.  However, this concept is
> really quite
> useless for a shared index, because only one database could
> contain the
> deactivation flag.  Accesses from any other database would
> still try to
> use the broken index.
>
> Based on this analysis, I feel that the "deactivated indexes"
> mechanism
> is of no further value and should be retired.  We should
> instead simply
> acknowledge that reindexing shared indexes is dangerous.  I
> propose that
> it be allowed only from a standalone backend.  Specifically I
> think that
> REINDEX INDEX and REINDEX TABLE should error out if pointed
> to a shared
> index or table, unless running in a standalone backend;
> REINDEX DATABASE
> should skip over the shared indexes unless running in a standalone
> backend.  (I'm not convinced that either -O or -P needs to be
> insisted on
> by REINDEX, btw, but we can discuss that separately.)  We'll
> have to warn
> users not to try to restart the database when reindexing of a
> shared table
> wasn't successfully completed.
>
> Details to back up the above claims:
>
> Part of the code needed to fix the relcache restriction on
> nailed indexes
> is already there, but ifdef'd out; that's the part that re-reads the
> index's pg_class row after receiving SI inval for it.  There are some
> cases not yet handled though.  In the first place, if the nailed index
> being modified is pg_class_oid_index, ScanPgRelation will try
> to use that
> same index to load the updated row, and will fail because
> it'll be trying
> to use the old relfilenode.  We can easily force a seqscan in
> that case,
> however.  A more subtle problem is that if an SI buffer
> overflow occurs,
> RelationCacheInvalidate walks through the relation cache in a random
> order.  I believe that the correct update order has to be first
> pg_class_oid_index, then the other nailed indexes, and
> finally all other
> relation cache entries.  pg_class_oid_index has to be updated
> first (with
> the aforementioned seqscan), since it will be used by
> ScanPgRelation to
> reread the pg_class rows for the other nailed indexes.  Only
> when we are
> sure the nailed indexes are up-to-date can we safely rebuild other
> relcache entries.
>
> Assigning a new relfilenode to indexes of pg_class is not a
> big deal when
> we don't have an index-corruption problem.  We simply have to
> make the new
> heap_updated row before we start index_build (which indeed
> the code does
> already); both old and new rows will be indexed in the new
> index, and all
> is well.  However, if we suspect index corruption then it is
> important not
> to try to make entries into the old indexes, because we might
> crash trying
> to update a corrupt index.  I propose the following behavior:
>
>     REINDEX INDEX: no special action; this is not the thing to use
>     when index corruption is suspected anyway.
>
>     REINDEX TABLE: while operating on pg_class, arrange for the
>     relcache's list of known indexes of pg_class to contain only the
>     indexes already reindexed.  In this way, only those indexes will
>     be updated by CatalogUpdateIndexes(), and so no untrustworthy
>     indexes will be touched while making new pg_class rows.
>
>     REINDEX DATABASE: take care to process pg_class first.
>
> To avoid catalog accesses via an index that's being rebuilt,
> we can simply
> generalize the SetReindexProcessing() state to include the OID of the
> index currently being rebuilt.  Places that presently make checks for
> IsIgnoringSystemIndexes can check this as well and fall back
> to seqscans
> when the targeted index would have been used.
>
> One other point: IsIgnoringSystemIndexes seems presently to
> be taken to
> mean not only that we don't *read* the system indexes, but
> that we don't
> *write* them either.  I think this is horribly dangerous; it
> effectively
> means that the only thing you can safely do in a backend
> started with -P
> is REINDEX.  If you make any other changes to the system catalogs then
> you've re-corrupted the indexes.  Also, you don't have any protection
> against duplicate entries getting made, since the unique indexes are
> disabled.  It would be a lot safer to define
> IsIgnoringSystemIndexes as
> preventing the system indexes from being used for lookups, while still
> updating the indexes on any catalog modification.  This will
> not affect
> the safety of REINDEX and will make other operations much
> less prone to
> pilot error.
>
> Comments?
>
>             regards, tom lane
>



Re: Improving REINDEX for system indexes (long)

From
Tom Lane
Date:
"Hiroshi Inoue" <inoue@tpf.co.jp> writes:
> I require you to explain me why you committed the change
> with no discussion and little investigation.

If you want an apology for not having discussed it in advance, I'll
gladly offer one.  It was poorly done.

I do, however, think that the reindexing code needs work.  Can we
get on with discussing the technical issues?
        regards, tom lane


Re: Improving REINDEX for system indexes (long)

From
Kurt Roeckx
Date:
On Mon, Sep 22, 2003 at 04:56:35AM +0900, Hiroshi Inoue wrote:
> First it should have been discussed before your commitment or at least
> it should be discussed after reversing your change.
> 
> I require you to explain me why you committed the change
> with no discussion and little investigation.
> 
> I also noticed that your change for catalog/index.c
>  
>     Revision 1.200 / (download) - annotate - [select for diffs] , Mon Sep 23
> 00:42:48 2002 UTC (11 months, 4 weeks ago) by tgl 

You know you're talking about a commit of 1 year ago?


Kurt



Re: Improving REINDEX for system indexes (long)

From
Hiroshi Inoue
Date:
Kurt Roeckx wrote:
> 
> On Mon, Sep 22, 2003 at 04:56:35AM +0900, Hiroshi Inoue wrote:
> > First it should have been discussed before your commitment or at least
> > it should be discussed after reversing your change.
> >
> > I require you to explain me why you committed the change
> > with no discussion and little investigation.
> >
> > I also noticed that your change for catalog/index.c
> >
> >     Revision 1.200 / (download) - annotate - [select for diffs] , Mon Sep 23
> > 00:42:48 2002 UTC (11 months, 4 weeks ago) by tgl
> 
> You know you're talking about a commit of 1 year ago?

Yes of cource. Unfortunately I haven't had enough time to
check all the changes since long. This time I was lucky to
find a change to spoil my effort. Is it strange for me to
doubt other places ? I looked again the current code about
REINDEX command and found the above change which is closely
related to this topic. 

regards,
Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/


Re: Improving REINDEX for system indexes (long)

From
Hiroshi Inoue
Date:
Tom Lane wrote:
> 
> "Hiroshi Inoue" <inoue@tpf.co.jp> writes:
> > I require you to explain me why you committed the change
> > with no discussion and little investigation.
> 
> If you want an apology for not having discussed it in advance, I'll
> gladly offer one.  It was poorly done.

Thanks. 
> I do, however, think that the reindexing code needs work.  Can we
> get on with discussing the technical issues?

OK but I would need some time to remember the code.

regards,
Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/


Re: Improving REINDEX for system indexes (long)

From
Hiroshi Inoue
Date:
I've just put back your previous change, sorry.
As I already mentioned many times it must be the first thing.

Though I don't remenber my code completely yet, I would
reply to some points. 
Unfortunately REINDEX wasn't a eagerly wanted command when 
I implemented it. Though I wanted to introduce a per index
flag, I unwillingly used an existent per table flag(relhasindex)
instead. Because it was impossible to make REINDEX transaction-safe
then, such flag was needed to suppress inconsistency as less
as possible.
I also unwillingly introduced the ReindexProcessing mode(flag)
because I didn't think of other quick solutions. 

IIRC the main reason why I gave up the REINDEX functionality
on nailed relations was the difficulty of reindexing pg_class
and the handling of relcache overflow. I didn't have much time
to test it. In addtion REINDEX wasn't a recognized command
then and I found no one to discuss the situation.

Tom Lane wrote:
> 
> I've been looking at the issues involved in reindexing system tables,
> and I now have what I think is a fairly defensible set of proposals.
> 
> We should whenever possible use the same reindexing technique used by
> CLUSTER: 

REINDEX was the first command which used the pg_class.relfilenode
functionality. The pg_class.relfilenode was essentially my proposal.

> 1. There is a problem for a shared index, because we have no way to
> update pg_class.relfilenode in all the other databases besides ours.
> I see no good way around this problem.

So the current REINDEX disallows on-line reindex on shared relations.
> 2. There is a problem for a nailed-in-cache index, because the relcache
> isn't prepared to cope with relfilenode updates for such indexes.
> However, that is fixable.

My code works pretty good with nailed relations except pg_class
#if defined (ENABLE_REINDEX_NAILED_RELATIONS).

> 3. There is a problem for an index on pg_class itself: doing heap_update
> on a pg_class row must make new index entries.  We have to be careful
> that the new row does appear in the updated index, while not making
> entries in old-and-possibly-broken indexes.  This is doable.

Yes.

Sorry I have no time to continue the discussion now.

regards,
Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/


Re: Improving REINDEX for system indexes (long)

From
Gaetano Mendola
Date:
Hiroshi Inoue wrote:
> instead. Because it was impossible to make REINDEX transaction-safe
> then, such flag was needed to suppress inconsistency as less
> as possible.

This mean that the actual REINDEX is not transaction-safe ?


Regards
Gaetano Mendola



Re: Improving REINDEX for system indexes (long)

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Gaetano Mendola [mailto:mendola@bigfoot.com] 
> 
> Hiroshi Inoue wrote:
> > instead. Because it was impossible to make REINDEX transaction-safe
> > then, such flag was needed to suppress inconsistency as less
> > as possible.
> 
> This mean that the actual REINDEX is not transaction-safe ?

No.
It was not transaction-safe long time ago.

regards,
Hiroshi Inoue



Re: Improving REINDEX for system indexes (long)

From
Gaetano Mendola
Date:
Hiroshi Inoue wrote:
Gaetano Mendola [mailto:mendola@bigfoot.com]  wrote:
>>
>>Hiroshi Inoue wrote:
>>
>>>instead. Because it was impossible to make REINDEX transaction-safe
>>>then, such flag was needed to suppress inconsistency as less
>>>as possible.
>>
>>This mean that the actual REINDEX is not transaction-safe ?
> 
> 
> No.
> It was not transaction-safe long time ago.

Anyway I think there is a nasty bug somewhere, see my last posts
about "duplication primary key".


Regards
Gaetano Mendola











Re: Improving REINDEX for system indexes (long)

From
Bruce Momjian
Date:
Tom, would you summarize what REINDEX currently _doesn't_ do?

---------------------------------------------------------------------------

Tom Lane wrote:
> I've been looking at the issues involved in reindexing system tables,
> and I now have what I think is a fairly defensible set of proposals.
> 
> We should whenever possible use the same reindexing technique used by
> CLUSTER: assign a new relfilenode number, build the new index in that
> file, and apply an ordinary heap_update operation to update the index's
> pg_class row with the new relfilenode value.  This technique is fully
> crash-safe and transaction-safe (meaning it can be rolled back even after
> completion, in case of failure later in the same transaction).  However,
> there are several pitfalls for applying it to system indexes:
> 
> 1. There is a problem for a shared index, because we have no way to
> update pg_class.relfilenode in all the other databases besides ours.
> I see no good way around this problem.
> 
> 2. There is a problem for a nailed-in-cache index, because the relcache
> isn't prepared to cope with relfilenode updates for such indexes.
> However, that is fixable.
> 
> 3. There is a problem for an index on pg_class itself: doing heap_update
> on a pg_class row must make new index entries.  We have to be careful
> that the new row does appear in the updated index, while not making
> entries in old-and-possibly-broken indexes.  This is doable.
> 
> 4. There is a problem with updating indexes that might be used for catalog
> fetches executed during the index_build process: if we try to use the
> partially-rebuilt index for such a fetch we will fail.  In the case of
> disaster recovery the problem doesn't exist because we'll have
> IsIgnoringSystemIndexes true, but for an ordinary on-line indexing
> operation there's definitely a risk.  Presently the code relies on
> "deactivating indexes" to prevent this, but I think it can be done more
> simply, because we only need to suppress access to the target index
> locally in the reindexing backend.  (Other backends will be locked out by
> means of an AccessExclusiveLock on the system catalog in question.)
> 
> In short then, we can fix things so that the new-relfilenode path can be
> taken in all cases except shared indexes.  For shared indexes, we have
> little choice but to truncate and rebuild the index in-place.  (There is
> then no need to update its pg_class row at all.)  This is of course
> horribly crash-unsafe.
> 
> The code presently uses a "deactivated indexes" flag (namely, setting
> pg_class.relhasindex false) to provide some protection against a crash
> midway through an in-place reindex.  However, this concept is really quite
> useless for a shared index, because only one database could contain the
> deactivation flag.  Accesses from any other database would still try to
> use the broken index.
> 
> Based on this analysis, I feel that the "deactivated indexes" mechanism
> is of no further value and should be retired.  We should instead simply
> acknowledge that reindexing shared indexes is dangerous.  I propose that
> it be allowed only from a standalone backend.  Specifically I think that
> REINDEX INDEX and REINDEX TABLE should error out if pointed to a shared
> index or table, unless running in a standalone backend; REINDEX DATABASE
> should skip over the shared indexes unless running in a standalone
> backend.  (I'm not convinced that either -O or -P needs to be insisted on
> by REINDEX, btw, but we can discuss that separately.)  We'll have to warn
> users not to try to restart the database when reindexing of a shared table
> wasn't successfully completed.
> 
> Details to back up the above claims:
> 
> Part of the code needed to fix the relcache restriction on nailed indexes
> is already there, but ifdef'd out; that's the part that re-reads the
> index's pg_class row after receiving SI inval for it.  There are some
> cases not yet handled though.  In the first place, if the nailed index
> being modified is pg_class_oid_index, ScanPgRelation will try to use that
> same index to load the updated row, and will fail because it'll be trying
> to use the old relfilenode.  We can easily force a seqscan in that case,
> however.  A more subtle problem is that if an SI buffer overflow occurs, 
> RelationCacheInvalidate walks through the relation cache in a random
> order.  I believe that the correct update order has to be first
> pg_class_oid_index, then the other nailed indexes, and finally all other
> relation cache entries.  pg_class_oid_index has to be updated first (with
> the aforementioned seqscan), since it will be used by ScanPgRelation to
> reread the pg_class rows for the other nailed indexes.  Only when we are
> sure the nailed indexes are up-to-date can we safely rebuild other
> relcache entries.
> 
> Assigning a new relfilenode to indexes of pg_class is not a big deal when
> we don't have an index-corruption problem.  We simply have to make the new
> heap_updated row before we start index_build (which indeed the code does
> already); both old and new rows will be indexed in the new index, and all
> is well.  However, if we suspect index corruption then it is important not
> to try to make entries into the old indexes, because we might crash trying
> to update a corrupt index.  I propose the following behavior:
> 
>     REINDEX INDEX: no special action; this is not the thing to use
>     when index corruption is suspected anyway.
> 
>     REINDEX TABLE: while operating on pg_class, arrange for the
>     relcache's list of known indexes of pg_class to contain only the
>     indexes already reindexed.  In this way, only those indexes will
>     be updated by CatalogUpdateIndexes(), and so no untrustworthy
>     indexes will be touched while making new pg_class rows.
> 
>     REINDEX DATABASE: take care to process pg_class first.
> 
> To avoid catalog accesses via an index that's being rebuilt, we can simply
> generalize the SetReindexProcessing() state to include the OID of the
> index currently being rebuilt.  Places that presently make checks for
> IsIgnoringSystemIndexes can check this as well and fall back to seqscans
> when the targeted index would have been used.
> 
> One other point: IsIgnoringSystemIndexes seems presently to be taken to
> mean not only that we don't *read* the system indexes, but that we don't
> *write* them either.  I think this is horribly dangerous; it effectively
> means that the only thing you can safely do in a backend started with -P
> is REINDEX.  If you make any other changes to the system catalogs then
> you've re-corrupted the indexes.  Also, you don't have any protection
> against duplicate entries getting made, since the unique indexes are
> disabled.  It would be a lot safer to define IsIgnoringSystemIndexes as
> preventing the system indexes from being used for lookups, while still
> updating the indexes on any catalog modification.  This will not affect
> the safety of REINDEX and will make other operations much less prone to
> pilot error.
> 
> Comments?
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Improving REINDEX for system indexes (long)

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom, would you summarize what REINDEX currently _doesn't_ do?

As of CVS tip I think the only deficiency is that indexes on the shared
catalogs (pg_database, pg_shadow, pg_group) have to be reindexed in
place, rather than being rebuilt with a new relfilenode as is done for
CLUSTER or TRUNCATE.  In-place reindexing isn't crash-safe, since if
you fail you're left with a half-built (effectively corrupt) index.

I don't see any way to avoid that, though, since we cannot change the
relfilenode value for a shared index.

I was toying with the notion of changing btree index build to not write
the metapage until the index is fully built; in this way, at least the
corrupted state of the index would be obvious.  (You'd get "not a btree"
failures.)
        regards, tom lane


Re: Improving REINDEX for system indexes (long)

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom, would you summarize what REINDEX currently _doesn't_ do?
> 
> As of CVS tip I think the only deficiency is that indexes on the shared
> catalogs (pg_database, pg_shadow, pg_group) have to be reindexed in
> place, rather than being rebuilt with a new relfilenode as is done for
> CLUSTER or TRUNCATE.  In-place reindexing isn't crash-safe, since if
> you fail you're left with a half-built (effectively corrupt) index.
> 
> I don't see any way to avoid that, though, since we cannot change the
> relfilenode value for a shared index.
> 
> I was toying with the notion of changing btree index build to not write
> the metapage until the index is fully built; in this way, at least the
> corrupted state of the index would be obvious.  (You'd get "not a btree"
> failures.)

Oh, that's great. I can't imagine a lot of traffic in those shared
tables anyway.  So you implemented all the ideas in the email --- great.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Improving REINDEX for system indexes (long)

From
Alvaro Herrera
Date:
On Sat, Sep 27, 2003 at 06:37:22PM -0400, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom, would you summarize what REINDEX currently _doesn't_ do?
> 
> As of CVS tip I think the only deficiency is that indexes on the shared
> catalogs (pg_database, pg_shadow, pg_group) have to be reindexed in
> place, rather than being rebuilt with a new relfilenode as is done for
> CLUSTER or TRUNCATE.  In-place reindexing isn't crash-safe, since if
> you fail you're left with a half-built (effectively corrupt) index.
> 
> I don't see any way to avoid that, though, since we cannot change the
> relfilenode value for a shared index.

What about creating a separate filenode anyway and renaming the files
afterwards?  It would not be an atomic operation anyway, but it would be
better than the current setup IMHO.

(It seems unlikely that one would have > 1GB indexes for shared
relations...)

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"La espina, desde que nace, ya pincha" (Proverbio africano)


Re: Improving REINDEX for system indexes (long)

From
Tom Lane
Date:
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> On Sat, Sep 27, 2003 at 06:37:22PM -0400, Tom Lane wrote:
>> I don't see any way to avoid that, though, since we cannot change the
>> relfilenode value for a shared index.

> What about creating a separate filenode anyway and renaming the files
> afterwards?  It would not be an atomic operation anyway, but it would be
> better than the current setup IMHO.

I think it would be difficult to persuade the buffer manager and storage
manager to work with this; from their point of view you'd be moving a
relation underneath them.  I doubt it's really worth the trouble; how
often do you need to reindex a shared catalog?
        regards, tom lane


Re: Improving REINDEX for system indexes (long)

From
Bruce Momjian
Date:
Tom Lane wrote:
> > What about creating a separate filenode anyway and renaming the files
> > afterwards?  It would not be an atomic operation anyway, but it would be
> > better than the current setup IMHO.
> 
> I think it would be difficult to persuade the buffer manager and storage
> manager to work with this; from their point of view you'd be moving a
> relation underneath them.  I doubt it's really worth the trouble; how
> often do you need to reindex a shared catalog?

The point I missed originally is that he is talking about shared
catalogs, not system catalogs, which work fine in CVS.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Improving REINDEX for system indexes (long)

From
Alvaro Herrera
Date:
On Sat, Sep 27, 2003 at 08:18:07PM -0400, Bruce Momjian wrote:
> Tom Lane wrote:
> > > What about creating a separate filenode anyway and renaming the files
> > > afterwards?  It would not be an atomic operation anyway, but it would be
> > > better than the current setup IMHO.
> > 
> > I think it would be difficult to persuade the buffer manager and storage
> > manager to work with this; from their point of view you'd be moving a
> > relation underneath them.  I doubt it's really worth the trouble; how
> > often do you need to reindex a shared catalog?
> 
> The point I missed originally is that he is talking about shared
> catalogs, not system catalogs, which work fine in CVS.

Well, my idea was to reduce the window of time during which the index
would be corrupt, i.e. not completely rebuilt.  This is only an issue
with shared indexes, because other system indexes do use the changing
relfilenode thingie already.

However, the main reason for reindexing is a corrupt index, so if for
some reason the new index is also corrupt (e.g. the machine crashes
midway) there's no point in having a separate index filenode anyway.

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)