Improving REINDEX for system indexes (long) - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Improving REINDEX for system indexes (long) |
Date | |
Msg-id | 13303.1064170501@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Improving REINDEX for system indexes (long)
Re: Improving REINDEX for system indexes (long) |
List | pgsql-hackers |
I've been looking at the issues involved in reindexing system tables, and I now have what I think is a fairly defensible set of proposals. We should whenever possible use the same reindexing technique used by CLUSTER: assign a new relfilenode number, build the new index in that file, and apply an ordinary heap_update operation to update the index's pg_class row with the new relfilenode value. This technique is fully crash-safe and transaction-safe (meaning it can be rolled back even after completion, in case of failure later in the same transaction). However, there are several pitfalls for applying it to system indexes: 1. There is a problem for a shared index, because we have no way to update pg_class.relfilenode in all the other databases besides ours. I see no good way around this problem. 2. There is a problem for a nailed-in-cache index, because the relcache isn't prepared to cope with relfilenode updates for such indexes. However, that is fixable. 3. There is a problem for an index on pg_class itself: doing heap_update on a pg_class row must make new index entries. We have to be careful that the new row does appear in the updated index, while not making entries in old-and-possibly-broken indexes. This is doable. 4. There is a problem with updating indexes that might be used for catalog fetches executed during the index_build process: if we try to use the partially-rebuilt index for such a fetch we will fail. In the case of disaster recovery the problem doesn't exist because we'll have IsIgnoringSystemIndexes true, but for an ordinary on-line indexing operation there's definitely a risk. Presently the code relies on "deactivating indexes" to prevent this, but I think it can be done more simply, because we only need to suppress access to the target index locally in the reindexing backend. (Other backends will be locked out by means of an AccessExclusiveLock on the system catalog in question.) In short then, we can fix things so that the new-relfilenode path can be taken in all cases except shared indexes. For shared indexes, we have little choice but to truncate and rebuild the index in-place. (There is then no need to update its pg_class row at all.) This is of course horribly crash-unsafe. The code presently uses a "deactivated indexes" flag (namely, setting pg_class.relhasindex false) to provide some protection against a crash midway through an in-place reindex. However, this concept is really quite useless for a shared index, because only one database could contain the deactivation flag. Accesses from any other database would still try to use the broken index. Based on this analysis, I feel that the "deactivated indexes" mechanism is of no further value and should be retired. We should instead simply acknowledge that reindexing shared indexes is dangerous. I propose that it be allowed only from a standalone backend. Specifically I think that REINDEX INDEX and REINDEX TABLE should error out if pointed to a shared index or table, unless running in a standalone backend; REINDEX DATABASE should skip over the shared indexes unless running in a standalone backend. (I'm not convinced that either -O or -P needs to be insisted on by REINDEX, btw, but we can discuss that separately.) We'll have to warn users not to try to restart the database when reindexing of a shared table wasn't successfully completed. Details to back up the above claims: Part of the code needed to fix the relcache restriction on nailed indexes is already there, but ifdef'd out; that's the part that re-reads the index's pg_class row after receiving SI inval for it. There are some cases not yet handled though. In the first place, if the nailed index being modified is pg_class_oid_index, ScanPgRelation will try to use that same index to load the updated row, and will fail because it'll be trying to use the old relfilenode. We can easily force a seqscan in that case, however. A more subtle problem is that if an SI buffer overflow occurs, RelationCacheInvalidate walks through the relation cache in a random order. I believe that the correct update order has to be first pg_class_oid_index, then the other nailed indexes, and finally all other relation cache entries. pg_class_oid_index has to be updated first (with the aforementioned seqscan), since it will be used by ScanPgRelation to reread the pg_class rows for the other nailed indexes. Only when we are sure the nailed indexes are up-to-date can we safely rebuild other relcache entries. Assigning a new relfilenode to indexes of pg_class is not a big deal when we don't have an index-corruption problem. We simply have to make the new heap_updated row before we start index_build (which indeed the code does already); both old and new rows will be indexed in the new index, and all is well. However, if we suspect index corruption then it is important not to try to make entries into the old indexes, because we might crash trying to update a corrupt index. I propose the following behavior: REINDEX INDEX: no special action; this is not the thing to usewhen index corruption is suspected anyway. REINDEX TABLE: while operating on pg_class, arrange for therelcache's list of known indexes of pg_class to contain only theindexesalready reindexed. In this way, only those indexes willbe updated by CatalogUpdateIndexes(), and so no untrustworthyindexeswill be touched while making new pg_class rows. REINDEX DATABASE: take care to process pg_class first. To avoid catalog accesses via an index that's being rebuilt, we can simply generalize the SetReindexProcessing() state to include the OID of the index currently being rebuilt. Places that presently make checks for IsIgnoringSystemIndexes can check this as well and fall back to seqscans when the targeted index would have been used. One other point: IsIgnoringSystemIndexes seems presently to be taken to mean not only that we don't *read* the system indexes, but that we don't *write* them either. I think this is horribly dangerous; it effectively means that the only thing you can safely do in a backend started with -P is REINDEX. If you make any other changes to the system catalogs then you've re-corrupted the indexes. Also, you don't have any protection against duplicate entries getting made, since the unique indexes are disabled. It would be a lot safer to define IsIgnoringSystemIndexes as preventing the system indexes from being used for lookups, while still updating the indexes on any catalog modification. This will not affect the safety of REINDEX and will make other operations much less prone to pilot error. Comments? regards, tom lane
pgsql-hackers by date: