Thread: CLUSTER patch
Hello: I think I have some more knowledge on my failing to build a working CLUSTER patch. It does not help me to fix it, however. The way the cluster is done is still the same: first cluster the heap and swap relfilenodes, then drop old heap; then rebuild each index, swap relfilenodes with old index and drop new. If I don't use the new relfilenode for anything after building it, everything works OK. But I can't attach the old filenode to the new heap; it has to stay around. I do can attach the new filenode to the old heap however. But as soon as I try to do anything to it (the new, clustered filenode) before transaction commit (building an index, say), the local buffer manager fails an assertion (actually, RelationNodeCacheGetRelation returns 0 for the given rnode), and the transaction aborts. Thus, if I comment the lines that attach the old rnode to the new heap and the lines that create the new indexes, the cluster works, and I now have two tables with the same rnode, one of the named "temp_<oid-of-the-other>". Obviously dropping any of those renders the other useless (no disk storage). What I still don't understand is why the RelationNodeCache doesn't return the buffer I'm looking for. I added some debugging fprintf's to the bufmgr code and the relation _is_ entered (!?) during the cluster transaction. Maybe somewhere the local buffer manager drops the knowledge about the relation, or that knowledge is based on (OID + rnode) info, which changed because the rnode changed. I do not understand the relcaching bussiness anyway. I attach anyway the old version of the patch, that reconstructs the indexes without the filenode swapping bussiness. The observations done to the earlier patch are corrected. I think that if nothing comes to fix the problems with the newer approach and there are no other objections, this one should be applied as it enhances the current version anyway. I'm leaving for winter vacation on monday and probably won't be back on two weeks. -- Alvaro Herrera (<alvherre[a]atentus.com>) "La vida es para el que se aventura"
Attachment
I am working on the same thing myself. Attached is a version of the patch that applies to current CVS. I worked with RelationClearRelation() to clear the cache for the relations. It now reports "Relation still open" during index_drop(), so something still isn't right. I think the whole problem stems from the fact that the relation cache caches by relnode, so when we change it, the cache gets all confused. The RelationClearRelation() change handles this in a very crude way, but it seems to be in the right direction for a solution. --------------------------------------------------------------------------- Alvaro Herrera wrote: > Hello: > > I think I have some more knowledge on my failing to build a working > CLUSTER patch. It does not help me to fix it, however. > > The way the cluster is done is still the same: first cluster the heap > and swap relfilenodes, then drop old heap; then rebuild each index, swap > relfilenodes with old index and drop new. > > If I don't use the new relfilenode for anything after building it, > everything works OK. But I can't attach the old filenode to the new > heap; it has to stay around. I do can attach the new filenode to the > old heap however. But as soon as I try to do anything to it (the new, > clustered filenode) before transaction commit (building an index, say), > the local buffer manager fails an assertion (actually, > RelationNodeCacheGetRelation returns 0 for the given rnode), and the > transaction aborts. > > Thus, if I comment the lines that attach the old rnode to the new heap > and the lines that create the new indexes, the cluster works, and I now > have two tables with the same rnode, one of the named > "temp_<oid-of-the-other>". Obviously dropping any of those renders the > other useless (no disk storage). > > > What I still don't understand is why the RelationNodeCache doesn't > return the buffer I'm looking for. I added some debugging fprintf's to > the bufmgr code and the relation _is_ entered (!?) during the cluster > transaction. Maybe somewhere the local buffer manager drops the > knowledge about the relation, or that knowledge is based on (OID + > rnode) info, which changed because the rnode changed. I do not > understand the relcaching bussiness anyway. > > > I attach anyway the old version of the patch, that reconstructs the > indexes without the filenode swapping bussiness. The observations done > to the earlier patch are corrected. I think that if nothing comes to > fix the problems with the newer approach and there are no other > objections, this one should be applied as it enhances the current > version anyway. > > I'm leaving for winter vacation on monday and probably won't be back on > two weeks. > > -- > Alvaro Herrera (<alvherre[a]atentus.com>) > "La vida es para el que se aventura" Content-Description: [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 Index: src/backend/commands/cluster.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/commands/cluster.c,v retrieving revision 1.83 diff -c -r1.83 cluster.c *** src/backend/commands/cluster.c 12 Jul 2002 18:43:15 -0000 1.83 --- src/backend/commands/cluster.c 14 Jul 2002 06:38:44 -0000 *************** *** 15,21 **** * * * IDENTIFICATION ! * $Header: /cvsroot/pgsql/src/backend/commands/cluster.c,v 1.83 2002/07/12 18:43:15 tgl Exp $ * *------------------------------------------------------------------------- */ --- 15,21 ---- * * * IDENTIFICATION ! * $Header: /cvsroot/pgsql/src/backend/commands/cluster.c,v 1.82 2002/06/20 20:29:26 momjian Exp $ * *------------------------------------------------------------------------- */ *************** *** 24,74 **** #include "access/genam.h" #include "access/heapam.h" - #include "catalog/dependency.h" #include "catalog/heap.h" #include "catalog/index.h" #include "catalog/pg_index.h" #include "catalog/pg_proc.h" #include "commands/cluster.h" #include "commands/tablecmds.h" #include "miscadmin.h" #include "utils/builtins.h" #include "utils/lsyscache.h" #include "utils/syscache.h" static Oid copy_heap(Oid OIDOldHeap, const char *NewName); - static Oid copy_index(Oid OIDOldIndex, Oid OIDNewHeap, - const char *NewIndexName); static void rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex); /* * cluster * * STILL TO DO: ! * Create a list of all the other indexes on this relation. Because ! * the cluster will wreck all the tids, I'll need to destroy bogus ! * indexes. The user will have to re-create them. Not nice, but ! * I'm not a nice guy. The alternative is to try some kind of post ! * destroy re-build. This may be possible. I'll check out what the ! * index create functiond want in the way of paramaters. On the other ! * hand, re-creating n indexes may blow out the space. */ void cluster(RangeVar *oldrelation, char *oldindexname) { Oid OIDOldHeap, OIDOldIndex, ! OIDNewHeap, ! OIDNewIndex; Relation OldHeap, OldIndex; char NewHeapName[NAMEDATALEN]; ! char NewIndexName[NAMEDATALEN]; ObjectAddress object; /* ! * We grab exclusive access to the target rel and index for the * duration of the transaction. */ OldHeap = heap_openrv(oldrelation, AccessExclusiveLock); --- 24,100 ---- #include "access/genam.h" #include "access/heapam.h" #include "catalog/heap.h" + #include "catalog/dependency.h" #include "catalog/index.h" + #include "catalog/indexing.h" + #include "catalog/catname.h" #include "catalog/pg_index.h" #include "catalog/pg_proc.h" #include "commands/cluster.h" #include "commands/tablecmds.h" #include "miscadmin.h" #include "utils/builtins.h" + #include "utils/fmgroids.h" #include "utils/lsyscache.h" + #include "utils/relcache.h" #include "utils/syscache.h" + /* + * We need one of these structs for each index in the relation to be + * clustered. It's basically the data needed by index_create() so + * we can recreate the indexes after destroying the old heap. + */ + typedef struct + { + char *indexName; + IndexInfo *indexInfo; + Oid accessMethodOID; + Oid *classOID; + Oid indexOID; + bool isPrimary; + } IndexAttrs; static Oid copy_heap(Oid OIDOldHeap, const char *NewName); static void rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex); + List *get_indexattr_list (Oid OIDOldHeap); + void recreate_indexattr(Oid OIDOldHeap, List *indexes); + void swap_relfilenodes(Oid r1, Oid r2); + + void + RelationClearRelation(Relation relation, bool rebuildIt); /* * cluster * * STILL TO DO: ! * Keep foreign keys, permissions and inheritance of the clustered table. ! * ! * We need to look at making use of the ability to write a new version of a ! * table (or index) under a new relfilenode value, without changing the ! * table's OID. */ void cluster(RangeVar *oldrelation, char *oldindexname) { Oid OIDOldHeap, OIDOldIndex, ! OIDNewHeap; Relation OldHeap, OldIndex; char NewHeapName[NAMEDATALEN]; ! List *indexes; ObjectAddress object; + /* The games with filenodes may not be rollbackable, so + * disallow running this inside a transaction block. + * This may be a false assumption though. + */ + if (IsTransactionBlock()) + elog (ERROR, "CLUSTER: may not be called inside a transaction block"); + /* ! * We grab exclusive access to the target relation for the * duration of the transaction. */ OldHeap = heap_openrv(oldrelation, AccessExclusiveLock); *************** *** 96,143 **** heap_close(OldHeap, NoLock); index_close(OldIndex); ! /* ! * Create the new heap with a temporary name. ! */ ! snprintf(NewHeapName, NAMEDATALEN, "temp_%u", OIDOldHeap); OIDNewHeap = copy_heap(OIDOldHeap, NewHeapName); /* We do not need CommandCounterIncrement() because copy_heap did it. */ ! /* ! * Copy the heap data into the new table in the desired order. ! */ rebuildheap(OIDNewHeap, OIDOldHeap, OIDOldIndex); /* To make the new heap's data visible. */ CommandCounterIncrement(); - /* Create new index over the tuples of the new heap. */ - snprintf(NewIndexName, NAMEDATALEN, "temp_%u", OIDOldIndex); ! OIDNewIndex = copy_index(OIDOldIndex, OIDNewHeap, NewIndexName); ! CommandCounterIncrement(); ! /* Destroy old heap (along with its index) and rename new. */ object.classId = RelOid_pg_class; ! object.objectId = OIDOldHeap; object.objectSubId = 0; /* XXX better to use DROP_CASCADE here? */ performDeletion(&object, DROP_RESTRICT); - /* performDeletion does CommandCounterIncrement at end */ - - renamerel(OIDNewHeap, oldrelation->relname); - - /* This one might be unnecessary, but let's be safe. */ - CommandCounterIncrement(); - - renamerel(OIDNewIndex, oldindexname); } static Oid copy_heap(Oid OIDOldHeap, const char *NewName) { --- 122,165 ---- heap_close(OldHeap, NoLock); index_close(OldIndex); ! /* Save the information of all indexes on the relation. */ ! indexes = get_indexattr_list(OIDOldHeap); + /* Create the new heap with a temporary name. */ + snprintf(NewHeapName, NAMEDATALEN, "temp_%u", OIDOldHeap); OIDNewHeap = copy_heap(OIDOldHeap, NewHeapName); /* We do not need CommandCounterIncrement() because copy_heap did it. */ ! /* Copy the heap data into the new table in the desired order. */ rebuildheap(OIDNewHeap, OIDOldHeap, OIDOldIndex); /* To make the new heap's data visible. */ CommandCounterIncrement(); ! /* Swap the relfilenodes of the old and new heaps. */ ! swap_relfilenodes(OIDNewHeap, OIDOldHeap); ! /* At this point, Old points to the new file, and new points to old */ ! ! /* Recreate the indexes on the relation. We do not need ! * CommandCounterIncrement() because recreate_indexattr does it. ! */ ! recreate_indexattr(OIDOldHeap, indexes); ! /* Destroy the new heap, carrying the old filenode along. */ object.classId = RelOid_pg_class; ! object.objectId = OIDNewHeap; object.objectSubId = 0; /* XXX better to use DROP_CASCADE here? */ performDeletion(&object, DROP_RESTRICT); /* performDeletion does CommandCounterIncrement at end */ } + /* Create and initialize the new heap + */ static Oid copy_heap(Oid OIDOldHeap, const char *NewName) { *************** *** 181,223 **** return OIDNewHeap; } ! static Oid ! copy_index(Oid OIDOldIndex, Oid OIDNewHeap, const char *NewIndexName) ! { ! Oid OIDNewIndex; ! Relation OldIndex, ! NewHeap; ! IndexInfo *indexInfo; ! ! NewHeap = heap_open(OIDNewHeap, AccessExclusiveLock); ! OldIndex = index_open(OIDOldIndex); ! ! /* ! * Create a new index like the old one. To do this I get the info ! * from pg_index, and add a new index with a temporary name (that will ! * be changed later). ! */ ! indexInfo = BuildIndexInfo(OldIndex->rd_index); ! ! OIDNewIndex = index_create(OIDNewHeap, ! NewIndexName, ! indexInfo, ! OldIndex->rd_rel->relam, ! OldIndex->rd_index->indclass, ! OldIndex->rd_index->indisprimary, ! false, /* XXX losing constraint status */ ! allowSystemTableMods); ! ! setRelhasindex(OIDNewHeap, true, ! OldIndex->rd_index->indisprimary, InvalidOid); ! ! index_close(OldIndex); ! heap_close(NewHeap, NoLock); ! ! return OIDNewIndex; ! } ! ! static void rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex) { --- 203,210 ---- return OIDNewHeap; } ! /* Load the data into the new heap, clustered. ! */ static void rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex) { *************** *** 260,263 **** --- 247,415 ---- index_close(LocalOldIndex); heap_close(LocalOldHeap, NoLock); heap_close(LocalNewHeap, NoLock); + } + + /* Get the necessary info about the indexes in the relation and + * return a List of IndexAttrs. + */ + List * + get_indexattr_list (Oid OIDOldHeap) + { + ScanKeyData entry; + HeapScanDesc scan; + Relation indexRelation; + HeapTuple indexTuple; + List *indexes = NIL; + IndexAttrs *attrs; + HeapTuple tuple; + Form_pg_index index; + + /* Grab the index tuples by looking into RelationRelationName + * by the OID of the old heap. + */ + indexRelation = heap_openr(IndexRelationName, AccessShareLock); + ScanKeyEntryInitialize(&entry, 0, Anum_pg_index_indrelid, + F_OIDEQ, ObjectIdGetDatum(OIDOldHeap)); + scan = heap_beginscan(indexRelation, SnapshotNow, 1, &entry); + while ((indexTuple = heap_getnext(scan, ForwardScanDirection)) != NULL) + { + index = (Form_pg_index) GETSTRUCT(indexTuple); + + attrs = (IndexAttrs *) palloc(sizeof(IndexAttrs)); + attrs->indexInfo = BuildIndexInfo(index); + attrs->isPrimary = index->indisprimary; + attrs->indexOID = index->indexrelid; + + /* The opclasses are copied verbatim from the original indexes. + */ + attrs->classOID = (Oid *)palloc(sizeof(Oid) * + attrs->indexInfo->ii_NumIndexAttrs); + memcpy(attrs->classOID, index->indclass, + sizeof(Oid) * attrs->indexInfo->ii_NumIndexAttrs); + + /* Name and access method of each index come from + * RelationRelationName. + */ + tuple = SearchSysCache(RELOID, + ObjectIdGetDatum(attrs->indexOID), + 0, 0, 0); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "CLUSTER: cannot find index %u", attrs->indexOID); + attrs->indexName = pstrdup(NameStr(((Form_pg_class) GETSTRUCT(tuple))->relname)); + attrs->accessMethodOID = ((Form_pg_class) GETSTRUCT(tuple))->relam; + ReleaseSysCache(tuple); + + /* Cons the gathered data into the list. We do not care about + * ordering, and this is more efficient than append. + */ + indexes=lcons((void *)attrs, indexes); + } + heap_endscan(scan); + heap_close(indexRelation, NoLock); + return indexes; + } + + /* Create new indexes and swap the filenodes with old indexes. Then drop + * the new index (carrying the old heap along). + */ + void + recreate_indexattr(Oid OIDOldHeap, List *indexes) + { + IndexAttrs *attrs; + List *elem; + Oid newIndexOID; + char newIndexName[NAMEDATALEN]; + + foreach (elem, indexes) + { + attrs=(IndexAttrs *) lfirst(elem); + + /* Create the new index under a temporary name */ + snprintf(newIndexName, NAMEDATALEN, "temp_%u", attrs->indexOID); + newIndexOID = index_create(OIDOldHeap, newIndexName, + attrs->indexInfo, attrs->accessMethodOID, + attrs->classOID, attrs->isPrimary, + false, /* XXX losing constraint status */ + allowSystemTableMods); + CommandCounterIncrement(); + + /* Swap the filenodes. */ + swap_relfilenodes(newIndexOID, attrs->indexOID); + setRelhasindex(OIDOldHeap, true, attrs->isPrimary, InvalidOid); + + /* I'm not sure this one is needed, but let's be safe. */ + CommandCounterIncrement(); + + /* Drop the new index, carrying the old filenode along. */ + index_drop(newIndexOID); + CommandCounterIncrement(); + + pfree(attrs->classOID); + pfree(attrs); + } + freeList(indexes); + } + + /* Swap the relfilenodes for two given relations. + */ + void + swap_relfilenodes(Oid r1, Oid r2) + { + /* I can probably keep RelationRelationName open in the main + * function and pass the Relation around so I don't have to open + * it avery time. + */ + Relation relRelation, + irels[Num_pg_class_indices]; + HeapTuple reltup[2]; + Oid tempRFNode; + + RelationClearRelation(RelationIdGetRelation(r1), false); + // RelationClearRelation(RelationIdGetRelation(r2), false); + + CommandCounterIncrement(); + /* We need both RelationRelationName tuples. */ + relRelation = heap_openr(RelationRelationName, RowExclusiveLock); + + reltup[0] = SearchSysCacheCopy(RELOID, + ObjectIdGetDatum(r1), + 0, 0, 0); + if (!HeapTupleIsValid(reltup[0])) + elog(ERROR, "CLUSTER: Cannot find tuple for relation %u", r1); + reltup[1] = SearchSysCacheCopy(RELOID, + ObjectIdGetDatum(r2), + 0, 0, 0); + if (!HeapTupleIsValid(reltup[1])) + elog(ERROR, "CLUSTER: Cannot find tuple for relation %u", r2); + + /* Swap the filenodes */ + tempRFNode = ((Form_pg_class) GETSTRUCT(reltup[0]))->relfilenode; + ((Form_pg_class) GETSTRUCT(reltup[0]))->relfilenode = + ((Form_pg_class) GETSTRUCT(reltup[1]))->relfilenode; + ((Form_pg_class) GETSTRUCT(reltup[1]))->relfilenode = tempRFNode; + + /* Update the RelationRelationName tuples */ + simple_heap_update(relRelation, &reltup[0]->t_self, reltup[0]); + simple_heap_update(relRelation, &reltup[1]->t_self, reltup[1]); + + /* Keep system catalogs current */ + CatalogOpenIndices(Num_pg_class_indices, Name_pg_class_indices, irels); + CatalogIndexInsert(irels, Num_pg_class_indices, relRelation, reltup[0]); + CatalogIndexInsert(irels, Num_pg_class_indices, relRelation, reltup[1]); + CatalogCloseIndices(Num_pg_class_indices, irels); + + heap_close(relRelation, RowExclusiveLock); + heap_freetuple(reltup[0]); + heap_freetuple(reltup[1]); + + CommandCounterIncrement(); + + // bjm + // RelationIdInvalidateRelationCacheByRelationId(r1); + // RelationIdInvalidateRelationCacheByRelationId(r2); + + RelationClearRelation(RelationIdGetRelation(r1), true); + // RelationClearRelation(RelationIdGetRelation(r2), true); + + CommandCounterIncrement(); } Index: src/backend/utils/cache/relcache.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/cache/relcache.c,v retrieving revision 1.166 diff -c -r1.166 relcache.c *** src/backend/utils/cache/relcache.c 12 Jul 2002 18:43:18 -0000 1.166 --- src/backend/utils/cache/relcache.c 14 Jul 2002 06:38:55 -0000 *************** *** 238,249 **** (void *)&(RELATION->rd_id), \ HASH_REMOVE, NULL); \ if (idhentry == NULL) \ ! elog(WARNING, "trying to delete a reldesc that does not exist."); \ nodentry = (RelNodeCacheEnt*)hash_search(RelationNodeCache, \ (void *)&(RELATION->rd_node), \ HASH_REMOVE, NULL); \ if (nodentry == NULL) \ ! elog(WARNING, "trying to delete a reldesc that does not exist."); \ if (IsSystemNamespace(RelationGetNamespace(RELATION))) \ { \ char *relname = RelationGetRelationName(RELATION); \ --- 238,249 ---- (void *)&(RELATION->rd_id), \ HASH_REMOVE, NULL); \ if (idhentry == NULL) \ ! elog(WARNING, "trying to delete a reldesc (rd_id) that does not exist."); \ nodentry = (RelNodeCacheEnt*)hash_search(RelationNodeCache, \ (void *)&(RELATION->rd_node), \ HASH_REMOVE, NULL); \ if (nodentry == NULL) \ ! elog(WARNING, "trying to delete a reldesc (rd_node) that does not exist."); \ if (IsSystemNamespace(RelationGetNamespace(RELATION))) \ { \ char *relname = RelationGetRelationName(RELATION); \ *************** *** 252,258 **** relname, \ HASH_REMOVE, NULL); \ if (namehentry == NULL) \ ! elog(WARNING, "trying to delete a reldesc that does not exist."); \ } \ } while(0) --- 252,258 ---- relname, \ HASH_REMOVE, NULL); \ if (namehentry == NULL) \ ! elog(WARNING, "trying to delete a reldesc (relname) that does not exist."); \ } \ } while(0) *************** *** 276,282 **** /* non-export function prototypes */ ! static void RelationClearRelation(Relation relation, bool rebuildIt); #ifdef ENABLE_REINDEX_NAILED_RELATIONS static void RelationReloadClassinfo(Relation relation); --- 276,282 ---- /* non-export function prototypes */ ! //static void RelationClearRelation(Relation relation, bool rebuildIt); #ifdef ENABLE_REINDEX_NAILED_RELATIONS static void RelationReloadClassinfo(Relation relation); *************** *** 683,689 **** */ rewrite_desc = heap_openr(RewriteRelationName, AccessShareLock); rewrite_tupdesc = RelationGetDescr(rewrite_desc); ! rewrite_scan = systable_beginscan(rewrite_desc, RewriteRelRulenameIndex, true, SnapshotNow, 1, &key); --- 683,689 ---- */ rewrite_desc = heap_openr(RewriteRelationName, AccessShareLock); rewrite_tupdesc = RelationGetDescr(rewrite_desc); ! rewrite_scan = systable_beginscan(rewrite_desc, RewriteRelRulenameIndex, true, SnapshotNow, 1, &key); *************** *** 1651,1657 **** * (one with refcount > 0). However, this routine just does whichever * it's told to do; callers must determine which they want. */ ! static void RelationClearRelation(Relation relation, bool rebuildIt) { MemoryContext oldcxt; --- 1651,1658 ---- * (one with refcount > 0). However, this routine just does whichever * it's told to do; callers must determine which they want. */ ! //bjm ! void RelationClearRelation(Relation relation, bool rebuildIt) { MemoryContext oldcxt;
Alvaro Herrera <alvherre@atentus.com> writes: > The way the cluster is done is still the same: first cluster the heap > and swap relfilenodes, then drop old heap; then rebuild each index, swap > relfilenodes with old index and drop new. I do not see anything in this patch that touches relfilenode. Perhaps the patch is incomplete? > But as soon as I try to do anything to it (the new, > clustered filenode) before transaction commit (building an index, say), > the local buffer manager fails an assertion (actually, > RelationNodeCacheGetRelation returns 0 for the given rnode), and the > transaction aborts. Hmm. If you do the swap in the correct way (viz, update the relation's pg_class entry and then CommandCounterIncrement) I'd expect the relcache to respond correctly. This does involve re-indexing the relcache entry under a new relfilenode value, but that's not significantly different from the case of renaming a relation. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > + CommandCounterIncrement(); > + > + // bjm > + // RelationIdInvalidateRelationCacheByRelationId(r1); > + // RelationIdInvalidateRelationCacheByRelationId(r2); > + > + RelationClearRelation(RelationIdGetRelation(r1), true); > + // RelationClearRelation(RelationIdGetRelation(r2), true); > + > + CommandCounterIncrement(); Surely the above is neither necessary nor appropriate. The relcache should automatically rebuild its entries for these relations at CommandCounterIncrement. In any case I do not care for exporting internal relcache routines to make CLUSTER work ... regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > + CommandCounterIncrement(); > > + > > + // bjm > > + // RelationIdInvalidateRelationCacheByRelationId(r1); > > + // RelationIdInvalidateRelationCacheByRelationId(r2); > > + > > + RelationClearRelation(RelationIdGetRelation(r1), true); > > + // RelationClearRelation(RelationIdGetRelation(r2), true); > > + > > + CommandCounterIncrement(); > > Surely the above is neither necessary nor appropriate. The relcache > should automatically rebuild its entries for these relations at New patch attached. Something like this is required or heap_drop/index_drop will fail because it can't find the relation cache entries for the relation. Maybe the trick is to properly invalidate the relation caches when pg_class is updated. This is particularly significant for thisxactonly relations. I don't think we need to handle oid changes in pg_class, but relfilenode changes are not updated, causing problems down the road. The attached patch actually does work with the following warnings: test=> cluster i on test; WARNING: trying to delete a reldesc (rd_node) that does not exist. WARNING: trying to delete a reldesc (rd_node) that does not exist. CLUSTER My guess is that the proper fix is elsewhere. I looked through relcache.c and didn't see any case where a relfilenode could be invalidated _without_ removing the old entry first, but of course, the old entry has a different value from the new one. My guess is that the work is to be done there somewhere. My cache forget calls work because it forces new cache entries to match pg_class, and that way the other commands can succeed. > CommandCounterIncrement. In any case I do not care for exporting > internal relcache routines to make CLUSTER work ... I need to get cluster working before I can worry about style. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 Index: src/backend/commands/cluster.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/commands/cluster.c,v retrieving revision 1.83 diff -c -r1.83 cluster.c *** src/backend/commands/cluster.c 12 Jul 2002 18:43:15 -0000 1.83 --- src/backend/commands/cluster.c 14 Jul 2002 17:03:37 -0000 *************** *** 15,21 **** * * * IDENTIFICATION ! * $Header: /cvsroot/pgsql/src/backend/commands/cluster.c,v 1.83 2002/07/12 18:43:15 tgl Exp $ * *------------------------------------------------------------------------- */ --- 15,21 ---- * * * IDENTIFICATION ! * $Header: /cvsroot/pgsql/src/backend/commands/cluster.c,v 1.82 2002/06/20 20:29:26 momjian Exp $ * *------------------------------------------------------------------------- */ *************** *** 24,74 **** #include "access/genam.h" #include "access/heapam.h" - #include "catalog/dependency.h" #include "catalog/heap.h" #include "catalog/index.h" #include "catalog/pg_index.h" #include "catalog/pg_proc.h" #include "commands/cluster.h" #include "commands/tablecmds.h" #include "miscadmin.h" #include "utils/builtins.h" #include "utils/lsyscache.h" #include "utils/syscache.h" static Oid copy_heap(Oid OIDOldHeap, const char *NewName); - static Oid copy_index(Oid OIDOldIndex, Oid OIDNewHeap, - const char *NewIndexName); static void rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex); /* * cluster * * STILL TO DO: ! * Create a list of all the other indexes on this relation. Because ! * the cluster will wreck all the tids, I'll need to destroy bogus ! * indexes. The user will have to re-create them. Not nice, but ! * I'm not a nice guy. The alternative is to try some kind of post ! * destroy re-build. This may be possible. I'll check out what the ! * index create functiond want in the way of paramaters. On the other ! * hand, re-creating n indexes may blow out the space. */ void cluster(RangeVar *oldrelation, char *oldindexname) { Oid OIDOldHeap, OIDOldIndex, ! OIDNewHeap, ! OIDNewIndex; Relation OldHeap, OldIndex; char NewHeapName[NAMEDATALEN]; ! char NewIndexName[NAMEDATALEN]; ObjectAddress object; /* ! * We grab exclusive access to the target rel and index for the * duration of the transaction. */ OldHeap = heap_openrv(oldrelation, AccessExclusiveLock); --- 24,99 ---- #include "access/genam.h" #include "access/heapam.h" #include "catalog/heap.h" + #include "catalog/dependency.h" #include "catalog/index.h" + #include "catalog/indexing.h" + #include "catalog/catname.h" #include "catalog/pg_index.h" #include "catalog/pg_proc.h" #include "commands/cluster.h" #include "commands/tablecmds.h" #include "miscadmin.h" #include "utils/builtins.h" + #include "utils/fmgroids.h" #include "utils/lsyscache.h" + #include "utils/relcache.h" #include "utils/syscache.h" + /* + * We need one of these structs for each index in the relation to be + * clustered. It's basically the data needed by index_create() so + * we can recreate the indexes after destroying the old heap. + */ + typedef struct + { + char *indexName; + IndexInfo *indexInfo; + Oid accessMethodOID; + Oid *classOID; + Oid indexOID; + bool isPrimary; + } IndexAttrs; static Oid copy_heap(Oid OIDOldHeap, const char *NewName); static void rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex); + List *get_indexattr_list (Oid OIDOldHeap); + void recreate_indexattr(Oid OIDOldHeap, List *indexes); + void swap_relfilenodes(Oid r1, Oid r2); + + void RelationFlushRelation(Relation relation); /* * cluster * * STILL TO DO: ! * Keep foreign keys, permissions and inheritance of the clustered table. ! * ! * We need to look at making use of the ability to write a new version of a ! * table (or index) under a new relfilenode value, without changing the ! * table's OID. */ void cluster(RangeVar *oldrelation, char *oldindexname) { Oid OIDOldHeap, OIDOldIndex, ! OIDNewHeap; Relation OldHeap, OldIndex; char NewHeapName[NAMEDATALEN]; ! List *indexes; ObjectAddress object; + /* The games with filenodes may not be rollbackable, so + * disallow running this inside a transaction block. + * This may be a false assumption though. + */ + if (IsTransactionBlock()) + elog (ERROR, "CLUSTER: may not be called inside a transaction block"); + /* ! * We grab exclusive access to the target relation for the * duration of the transaction. */ OldHeap = heap_openrv(oldrelation, AccessExclusiveLock); *************** *** 96,143 **** heap_close(OldHeap, NoLock); index_close(OldIndex); ! /* ! * Create the new heap with a temporary name. ! */ ! snprintf(NewHeapName, NAMEDATALEN, "temp_%u", OIDOldHeap); OIDNewHeap = copy_heap(OIDOldHeap, NewHeapName); /* We do not need CommandCounterIncrement() because copy_heap did it. */ ! /* ! * Copy the heap data into the new table in the desired order. ! */ rebuildheap(OIDNewHeap, OIDOldHeap, OIDOldIndex); /* To make the new heap's data visible. */ CommandCounterIncrement(); - /* Create new index over the tuples of the new heap. */ - snprintf(NewIndexName, NAMEDATALEN, "temp_%u", OIDOldIndex); ! OIDNewIndex = copy_index(OIDOldIndex, OIDNewHeap, NewIndexName); ! CommandCounterIncrement(); ! /* Destroy old heap (along with its index) and rename new. */ object.classId = RelOid_pg_class; ! object.objectId = OIDOldHeap; object.objectSubId = 0; /* XXX better to use DROP_CASCADE here? */ performDeletion(&object, DROP_RESTRICT); - /* performDeletion does CommandCounterIncrement at end */ - - renamerel(OIDNewHeap, oldrelation->relname); - - /* This one might be unnecessary, but let's be safe. */ - CommandCounterIncrement(); - - renamerel(OIDNewIndex, oldindexname); } static Oid copy_heap(Oid OIDOldHeap, const char *NewName) { --- 121,164 ---- heap_close(OldHeap, NoLock); index_close(OldIndex); ! /* Save the information of all indexes on the relation. */ ! indexes = get_indexattr_list(OIDOldHeap); + /* Create the new heap with a temporary name. */ + snprintf(NewHeapName, NAMEDATALEN, "temp_%u", OIDOldHeap); OIDNewHeap = copy_heap(OIDOldHeap, NewHeapName); /* We do not need CommandCounterIncrement() because copy_heap did it. */ ! /* Copy the heap data into the new table in the desired order. */ rebuildheap(OIDNewHeap, OIDOldHeap, OIDOldIndex); /* To make the new heap's data visible. */ CommandCounterIncrement(); ! /* Swap the relfilenodes of the old and new heaps. */ ! swap_relfilenodes(OIDNewHeap, OIDOldHeap); ! /* At this point, Old points to the new file, and new points to old */ ! ! /* Recreate the indexes on the relation. We do not need ! * CommandCounterIncrement() because recreate_indexattr does it. ! */ ! recreate_indexattr(OIDOldHeap, indexes); ! /* Destroy the new heap, carrying the old filenode along. */ object.classId = RelOid_pg_class; ! object.objectId = OIDNewHeap; object.objectSubId = 0; /* XXX better to use DROP_CASCADE here? */ performDeletion(&object, DROP_RESTRICT); /* performDeletion does CommandCounterIncrement at end */ } + /* Create and initialize the new heap + */ static Oid copy_heap(Oid OIDOldHeap, const char *NewName) { *************** *** 181,223 **** return OIDNewHeap; } ! static Oid ! copy_index(Oid OIDOldIndex, Oid OIDNewHeap, const char *NewIndexName) ! { ! Oid OIDNewIndex; ! Relation OldIndex, ! NewHeap; ! IndexInfo *indexInfo; ! ! NewHeap = heap_open(OIDNewHeap, AccessExclusiveLock); ! OldIndex = index_open(OIDOldIndex); ! ! /* ! * Create a new index like the old one. To do this I get the info ! * from pg_index, and add a new index with a temporary name (that will ! * be changed later). ! */ ! indexInfo = BuildIndexInfo(OldIndex->rd_index); ! ! OIDNewIndex = index_create(OIDNewHeap, ! NewIndexName, ! indexInfo, ! OldIndex->rd_rel->relam, ! OldIndex->rd_index->indclass, ! OldIndex->rd_index->indisprimary, ! false, /* XXX losing constraint status */ ! allowSystemTableMods); ! ! setRelhasindex(OIDNewHeap, true, ! OldIndex->rd_index->indisprimary, InvalidOid); ! ! index_close(OldIndex); ! heap_close(NewHeap, NoLock); ! ! return OIDNewIndex; ! } ! ! static void rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex) { --- 202,209 ---- return OIDNewHeap; } ! /* Load the data into the new heap, clustered. ! */ static void rebuildheap(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex) { *************** *** 260,263 **** --- 246,418 ---- index_close(LocalOldIndex); heap_close(LocalOldHeap, NoLock); heap_close(LocalNewHeap, NoLock); + } + + /* Get the necessary info about the indexes in the relation and + * return a List of IndexAttrs. + */ + List * + get_indexattr_list (Oid OIDOldHeap) + { + ScanKeyData entry; + HeapScanDesc scan; + Relation indexRelation; + HeapTuple indexTuple; + List *indexes = NIL; + IndexAttrs *attrs; + HeapTuple tuple; + Form_pg_index index; + + /* Grab the index tuples by looking into RelationRelationName + * by the OID of the old heap. + */ + indexRelation = heap_openr(IndexRelationName, AccessShareLock); + ScanKeyEntryInitialize(&entry, 0, Anum_pg_index_indrelid, + F_OIDEQ, ObjectIdGetDatum(OIDOldHeap)); + scan = heap_beginscan(indexRelation, SnapshotNow, 1, &entry); + while ((indexTuple = heap_getnext(scan, ForwardScanDirection)) != NULL) + { + index = (Form_pg_index) GETSTRUCT(indexTuple); + + attrs = (IndexAttrs *) palloc(sizeof(IndexAttrs)); + attrs->indexInfo = BuildIndexInfo(index); + attrs->isPrimary = index->indisprimary; + attrs->indexOID = index->indexrelid; + + /* The opclasses are copied verbatim from the original indexes. + */ + attrs->classOID = (Oid *)palloc(sizeof(Oid) * + attrs->indexInfo->ii_NumIndexAttrs); + memcpy(attrs->classOID, index->indclass, + sizeof(Oid) * attrs->indexInfo->ii_NumIndexAttrs); + + /* Name and access method of each index come from + * RelationRelationName. + */ + tuple = SearchSysCache(RELOID, + ObjectIdGetDatum(attrs->indexOID), + 0, 0, 0); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "CLUSTER: cannot find index %u", attrs->indexOID); + attrs->indexName = pstrdup(NameStr(((Form_pg_class) GETSTRUCT(tuple))->relname)); + attrs->accessMethodOID = ((Form_pg_class) GETSTRUCT(tuple))->relam; + ReleaseSysCache(tuple); + + /* Cons the gathered data into the list. We do not care about + * ordering, and this is more efficient than append. + */ + indexes=lcons((void *)attrs, indexes); + } + heap_endscan(scan); + heap_close(indexRelation, NoLock); + return indexes; + } + + /* Create new indexes and swap the filenodes with old indexes. Then drop + * the new index (carrying the old heap along). + */ + void + recreate_indexattr(Oid OIDOldHeap, List *indexes) + { + IndexAttrs *attrs; + List *elem; + Oid newIndexOID; + char newIndexName[NAMEDATALEN]; + + foreach (elem, indexes) + { + attrs=(IndexAttrs *) lfirst(elem); + + /* Create the new index under a temporary name */ + snprintf(newIndexName, NAMEDATALEN, "temp_%u", attrs->indexOID); + newIndexOID = index_create(OIDOldHeap, newIndexName, + attrs->indexInfo, attrs->accessMethodOID, + attrs->classOID, attrs->isPrimary, + false, /* XXX losing constraint status */ + allowSystemTableMods); + CommandCounterIncrement(); + + /* Swap the filenodes. */ + swap_relfilenodes(newIndexOID, attrs->indexOID); + setRelhasindex(OIDOldHeap, true, attrs->isPrimary, InvalidOid); + + /* I'm not sure this one is needed, but let's be safe. */ + CommandCounterIncrement(); + + /* Drop the new index, carrying the old filenode along. */ + index_drop(newIndexOID); + CommandCounterIncrement(); + + pfree(attrs->classOID); + pfree(attrs); + } + freeList(indexes); + } + + /* Swap the relfilenodes for two given relations. + */ + void + swap_relfilenodes(Oid r1, Oid r2) + { + /* I can probably keep RelationRelationName open in the main + * function and pass the Relation around so I don't have to open + * it avery time. + */ + Relation relRelation, + rel1, rel2, + irels[Num_pg_class_indices]; + HeapTuple reltup[2]; + Oid tempRFNode; + + rel1 = RelationIdGetRelation(r1); + rel2 = RelationIdGetRelation(r2); + RelationFlushRelation(rel1); + RelationFlushRelation(rel2); + RelationDecrementReferenceCount(rel1); + RelationDecrementReferenceCount(rel2); + + CommandCounterIncrement(); + /* We need both RelationRelationName tuples. */ + relRelation = heap_openr(RelationRelationName, RowExclusiveLock); + + reltup[0] = SearchSysCacheCopy(RELOID, + ObjectIdGetDatum(r1), + 0, 0, 0); + if (!HeapTupleIsValid(reltup[0])) + elog(ERROR, "CLUSTER: Cannot find tuple for relation %u", r1); + reltup[1] = SearchSysCacheCopy(RELOID, + ObjectIdGetDatum(r2), + 0, 0, 0); + if (!HeapTupleIsValid(reltup[1])) + elog(ERROR, "CLUSTER: Cannot find tuple for relation %u", r2); + + /* Swap the filenodes */ + tempRFNode = ((Form_pg_class) GETSTRUCT(reltup[0]))->relfilenode; + ((Form_pg_class) GETSTRUCT(reltup[0]))->relfilenode = + ((Form_pg_class) GETSTRUCT(reltup[1]))->relfilenode; + ((Form_pg_class) GETSTRUCT(reltup[1]))->relfilenode = tempRFNode; + + /* Update the RelationRelationName tuples */ + simple_heap_update(relRelation, &reltup[0]->t_self, reltup[0]); + simple_heap_update(relRelation, &reltup[1]->t_self, reltup[1]); + + /* Keep system catalogs current */ + CatalogOpenIndices(Num_pg_class_indices, Name_pg_class_indices, irels); + CatalogIndexInsert(irels, Num_pg_class_indices, relRelation, reltup[0]); + CatalogIndexInsert(irels, Num_pg_class_indices, relRelation, reltup[1]); + CatalogCloseIndices(Num_pg_class_indices, irels); + + heap_close(relRelation, RowExclusiveLock); + heap_freetuple(reltup[0]); + heap_freetuple(reltup[1]); + + CommandCounterIncrement(); + + rel1 = RelationIdGetRelation(r1); + rel2 = RelationIdGetRelation(r2); + RelationFlushRelation(rel1); + RelationFlushRelation(rel2); + RelationDecrementReferenceCount(rel1); + RelationDecrementReferenceCount(rel2); + CommandCounterIncrement(); } Index: src/backend/utils/cache/relcache.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/cache/relcache.c,v retrieving revision 1.166 diff -c -r1.166 relcache.c *** src/backend/utils/cache/relcache.c 12 Jul 2002 18:43:18 -0000 1.166 --- src/backend/utils/cache/relcache.c 14 Jul 2002 17:03:46 -0000 *************** *** 238,249 **** (void *)&(RELATION->rd_id), \ HASH_REMOVE, NULL); \ if (idhentry == NULL) \ ! elog(WARNING, "trying to delete a reldesc that does not exist."); \ nodentry = (RelNodeCacheEnt*)hash_search(RelationNodeCache, \ (void *)&(RELATION->rd_node), \ HASH_REMOVE, NULL); \ if (nodentry == NULL) \ ! elog(WARNING, "trying to delete a reldesc that does not exist."); \ if (IsSystemNamespace(RelationGetNamespace(RELATION))) \ { \ char *relname = RelationGetRelationName(RELATION); \ --- 238,249 ---- (void *)&(RELATION->rd_id), \ HASH_REMOVE, NULL); \ if (idhentry == NULL) \ ! elog(WARNING, "trying to delete a reldesc (rd_id) that does not exist."); \ nodentry = (RelNodeCacheEnt*)hash_search(RelationNodeCache, \ (void *)&(RELATION->rd_node), \ HASH_REMOVE, NULL); \ if (nodentry == NULL) \ ! elog(WARNING, "trying to delete a reldesc (rd_node) that does not exist."); \ if (IsSystemNamespace(RelationGetNamespace(RELATION))) \ { \ char *relname = RelationGetRelationName(RELATION); \ *************** *** 252,258 **** relname, \ HASH_REMOVE, NULL); \ if (namehentry == NULL) \ ! elog(WARNING, "trying to delete a reldesc that does not exist."); \ } \ } while(0) --- 252,258 ---- relname, \ HASH_REMOVE, NULL); \ if (namehentry == NULL) \ ! elog(WARNING, "trying to delete a reldesc (relname) that does not exist."); \ } \ } while(0) *************** *** 281,287 **** #ifdef ENABLE_REINDEX_NAILED_RELATIONS static void RelationReloadClassinfo(Relation relation); #endif /* ENABLE_REINDEX_NAILED_RELATIONS */ ! static void RelationFlushRelation(Relation relation); static Relation RelationSysNameCacheGetRelation(const char *relationName); static bool load_relcache_init_file(void); static void write_relcache_init_file(void); --- 281,287 ---- #ifdef ENABLE_REINDEX_NAILED_RELATIONS static void RelationReloadClassinfo(Relation relation); #endif /* ENABLE_REINDEX_NAILED_RELATIONS */ ! //static void RelationFlushRelation(Relation relation); static Relation RelationSysNameCacheGetRelation(const char *relationName); static bool load_relcache_init_file(void); static void write_relcache_init_file(void); *************** *** 683,689 **** */ rewrite_desc = heap_openr(RewriteRelationName, AccessShareLock); rewrite_tupdesc = RelationGetDescr(rewrite_desc); ! rewrite_scan = systable_beginscan(rewrite_desc, RewriteRelRulenameIndex, true, SnapshotNow, 1, &key); --- 683,689 ---- */ rewrite_desc = heap_openr(RewriteRelationName, AccessShareLock); rewrite_tupdesc = RelationGetDescr(rewrite_desc); ! rewrite_scan = systable_beginscan(rewrite_desc, RewriteRelRulenameIndex, true, SnapshotNow, 1, &key); *************** *** 1651,1657 **** * (one with refcount > 0). However, this routine just does whichever * it's told to do; callers must determine which they want. */ ! static void RelationClearRelation(Relation relation, bool rebuildIt) { MemoryContext oldcxt; --- 1651,1658 ---- * (one with refcount > 0). However, this routine just does whichever * it's told to do; callers must determine which they want. */ ! //bjm ! void RelationClearRelation(Relation relation, bool rebuildIt) { MemoryContext oldcxt; *************** *** 1803,1809 **** * * Rebuild the relation if it is open (refcount > 0), else blow it away. */ ! static void RelationFlushRelation(Relation relation) { bool rebuildIt; --- 1804,1810 ---- * * Rebuild the relation if it is open (refcount > 0), else blow it away. */ ! void RelationFlushRelation(Relation relation) { bool rebuildIt;
Tom Lane dijo: > Alvaro Herrera <alvherre@atentus.com> writes: > > The way the cluster is done is still the same: first cluster the heap > > and swap relfilenodes, then drop old heap; then rebuild each index, swap > > relfilenodes with old index and drop new. > > I do not see anything in this patch that touches relfilenode. Perhaps > the patch is incomplete? No, this is the old version, corrected according your comments, for inclusion in case the new version doesn't make it. Perhaps you missed the patch I posted some days ago that did swap relfilenodes (there was even a function named swap_relfilenode, so it wasn't easy to miss). It's archived in http://archives.postgresql.org/pgsql-patches/2002-07/msg00079.php. The code in question is this: tempRFNode = ((Form_pg_class) GETSTRUCT(reltup[0]))->relfilenode; ((Form_pg_class) GETSTRUCT(reltup[0]))->relfilenode = ((Form_pg_class) GETSTRUCT(reltup[1]))->relfilenode; ((Form_pg_class) GETSTRUCT(reltup[1]))->relfilenode = tempRFNode; /* Update the RelationRelationName tuples */ simple_heap_update(relRelation, &reltup[1]->t_self, reltup[1]); simple_heap_update(relRelation, &reltup[0]->t_self, reltup[0]); Now, if I do CommandCounterIncrement() right after this, I get "Relation deleted while still in use". So I add CatalogOpenIndices(Num_pg_class_indices, Name_pg_class_indices, irels); CatalogIndexInsert(irels, Num_pg_class_indices, relRelation, reltup[1]); CatalogIndexInsert(irels, Num_pg_class_indices, relRelation, reltup[0]); CatalogCloseIndices(Num_pg_class_indices, irels); and only then the CommandCounterIncrement. But the server says TRAP: Failed Assertion("!(bufrel != ((void *)0)):", File: "localbuf.c", Line: 242) (I think it's line 241 in pristine source). I tried doing CatalogIndexInsert and CommandCounterIncrement after each simple_heap_update, but the result is the same. This assertion is failed at transaction commit, when LocalBufferSync is called. > > But as soon as I try to do anything to it (the new, > > clustered filenode) before transaction commit (building an index, say), > > the local buffer manager fails an assertion (actually, > > RelationNodeCacheGetRelation returns 0 for the given rnode), and the > > transaction aborts. > > Hmm. If you do the swap in the correct way (viz, update the relation's > pg_class entry and then CommandCounterIncrement) I'd expect the relcache > to respond correctly. This does involve re-indexing the relcache entry > under a new relfilenode value, but that's not significantly different > from the case of renaming a relation. That's what I think I'm doing, but I don't know what's the relcache doing; other than I expect, I pressume. I tried using RelationForgetRelation on both relations (following Bruce's idea of invalidating the relcache entries), but I get lost in the relcache, so I'm now in the dark. -- Alvaro Herrera (<alvherre[a]atentus.com>) "The ability to monopolize a planet is insignificant next to the power of the source"
Bruce Momjian <pgman@candle.pha.pa.us> writes: > New patch attached. Something like this is required or > heap_drop/index_drop will fail because it can't find the relation cache > entries for the relation. Maybe the trick is to properly invalidate the > relation caches when pg_class is updated. They should be updated *automatically* --- otherwise CLUSTER is hardly the only thing that will fail. > This is particularly > significant for thisxactonly relations. Yes. After thinking awhile I realize that the real problem is that we are trying to swap between an existing relation (!rd_myxactonly) and a new relation (rd_myxactonly). Buffers for one live in the main buffer pool, for the other in the local buffer pool. There's also the little matter of the local state inside relcache.c. While the update to pg_class should make the right things happen to relfilenode, it doesn't do anything to cause a change in rd_myxactonly status. Not sure what to do about this. Will think more. regards, tom lane
Alvaro Herrera wrote: > Tom Lane dijo: > > > Alvaro Herrera <alvherre@atentus.com> writes: > > > The way the cluster is done is still the same: first cluster the heap > > > and swap relfilenodes, then drop old heap; then rebuild each index, swap > > > relfilenodes with old index and drop new. > > > > I do not see anything in this patch that touches relfilenode. Perhaps > > the patch is incomplete? > > No, this is the old version, corrected according your comments, for > inclusion in case the new version doesn't make it. I sent a new version of your patch out that got farther, and it does have those swap lines. > Now, if I do CommandCounterIncrement() right after this, I get "Relation > deleted while still in use". So I add > > CatalogOpenIndices(Num_pg_class_indices, Name_pg_class_indices, irels); > CatalogIndexInsert(irels, Num_pg_class_indices, relRelation, reltup[1]); > CatalogIndexInsert(irels, Num_pg_class_indices, relRelation, reltup[0]); > CatalogCloseIndices(Num_pg_class_indices, irels); Yes, that is required. > and only then the CommandCounterIncrement. But the server says > TRAP: Failed Assertion("!(bufrel != ((void *)0)):", File: "localbuf.c", > Line: 242) Yes, the problem here is that dirty buffers can't look up the proper relation to figure out how to complete the transaction. Reloading the cache after the swap seems to fix it, but the invalidation itself throws WARNINGS because the relfilenode's of the old and new relations structures don't match, and when it tries to delete the first one, it throws the warning. > (I think it's line 241 in pristine source). I tried doing > CatalogIndexInsert and CommandCounterIncrement after each > simple_heap_update, but the result is the same. This assertion is > failed at transaction commit, when LocalBufferSync is called. Yep, I think everything now points to changes needed in relcache.c to handle the new condition of a relation swapping its relnode. I looked at the REINDEX code and it doesn't seem to have this problem. Not sure why. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Tom Lane wrote: > > This is particularly > > significant for thisxactonly relations. > > Yes. After thinking awhile I realize that the real problem is that we > are trying to swap between an existing relation (!rd_myxactonly) and > a new relation (rd_myxactonly). Buffers for one live in the main > buffer pool, for the other in the local buffer pool. There's also the > little matter of the local state inside relcache.c. While the update > to pg_class should make the right things happen to relfilenode, it > doesn't do anything to cause a change in rd_myxactonly status. Yes, I had one intermediate patch where I was calling RelationClearRelation directly and not using the reload flag computed by RelationForgetRelation. Then I found RelationFlushRelation and that seemed to work better. This case is actually more complicated because after the swap, pg_class still says thisxactonly but the relfilenode points to non-thisactonly buffers, and via versa. One of the key things is that thisxactonly relations can't be forgotten because there are local buffers associated with the relation, or something like that. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Yep, I think everything now points to changes needed in relcache.c to > handle the new condition of a relation swapping its relnode. I still don't believe that any changes are needed in relcache.c. In particular, after thinking more I see that we do not need to change the myxactonly status of either relation: the temp one is still temp, the original one is still not. What we do need to do is to flush all buffers for the two rels from the local and global buffer caches respectively, since after the swap they should get buffered in different places. FlushRelationBuffers looks like it will work for that. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > This case is actually more complicated because after the swap, pg_class > still says thisxactonly but the relfilenode points to non-thisactonly > buffers, and via versa. pg_class doesn't say anything (if it did, we could make relcache.c track it). But yes, the problem is that myxactonly has to line up with where the buffers are stored. As I mentioned, I think it would work to flush out all buffers for both rels before doing the swap. regards, tom lane
Tom Lane dijo: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > This case is actually more complicated because after the swap, pg_class > > still says thisxactonly but the relfilenode points to non-thisactonly > > buffers, and via versa. > > pg_class doesn't say anything (if it did, we could make relcache.c > track it). > > But yes, the problem is that myxactonly has to line up with where the > buffers are stored. As I mentioned, I think it would work to flush out > all buffers for both rels before doing the swap. It does indeed work. Actually, I tried to do that before, but the fact that RelationFlushBuffers requires a firstDelBlock made me turn around. I attach a new patch against current cvs HEAD. It works as far as I've tested it, but I may be missing some cases. I'll post a documentation patch later if this gets applied. -- Alvaro Herrera (<alvherre[a]atentus.com>) "Some men are heterosexual, and some are bisexual, and some men don't think about sex at all... they become lawyers" (Woody Allen)
Attachment
Alvaro Herrera <alvherre@atentus.com> writes: > It does indeed work. Actually, I tried to do that before, but the fact > that RelationFlushBuffers requires a firstDelBlock made me turn around. firstDelBlock should be 0, not length-of-relation; as given, this code fails to get rid of the buffer entries! regards, tom lane
Tom Lane dijo: > Alvaro Herrera <alvherre@atentus.com> writes: > > It does indeed work. Actually, I tried to do that before, but the fact > > that RelationFlushBuffers requires a firstDelBlock made me turn around. > > firstDelBlock should be 0, not length-of-relation; as given, this code > fails to get rid of the buffer entries! Oh, I failed to understand completely the purpose of firstDelBlock then. Anyway, there's still a big problem with this patch: the pg_depend information gets messed up after CLUSTER, so a clustered table cannot be dropped. I don't know why is this. As I said yesterday, I'm going on vacation tomorrow, and probably will not fix this issue. I can look into it when I'm back, on two weeks. I attach a little doc patch. -- Alvaro Herrera (<alvherre[a]atentus.com>) "La gente vulgar solo piensa en pasar el tiempo; el que tiene talento, en aprovecharlo"
Attachment
Alvaro Herrera wrote: > Tom Lane dijo: > > > Alvaro Herrera <alvherre@atentus.com> writes: > > > It does indeed work. Actually, I tried to do that before, but the fact > > > that RelationFlushBuffers requires a firstDelBlock made me turn around. > > > > firstDelBlock should be 0, not length-of-relation; as given, this code > > fails to get rid of the buffer entries! > > Oh, I failed to understand completely the purpose of firstDelBlock then. > > Anyway, there's still a big problem with this patch: the pg_depend > information gets messed up after CLUSTER, so a clustered table cannot be > dropped. I don't know why is this. I actually backed out Tom's recent change to cluster.c (attached), applied your patch, then manually applied Tom's patch to cluster.c so I had a working version of your patch with the new dependencies. Seemed to work fine. I will do the same when I apply you version. > As I said yesterday, I'm going on vacation tomorrow, and probably will > not fix this issue. I can look into it when I'm back, on two weeks. I will take care of applying your patch and the doc part too. It will be all done when you return. Thanks for your work and have a nice vacation. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 Index: cluster.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/commands/cluster.c,v retrieving revision 1.82 retrieving revision 1.83 diff -c -r1.82 -r1.83 *** cluster.c 20 Jun 2002 20:29:26 -0000 1.82 --- cluster.c 12 Jul 2002 18:43:15 -0000 1.83 *************** *** 15,21 **** * * * IDENTIFICATION ! * $Header: /cvsroot/pgsql/src/backend/commands/cluster.c,v 1.82 2002/06/20 20:29:26 momjian Exp $ * *------------------------------------------------------------------------- */ --- 15,21 ---- * * * IDENTIFICATION ! * $Header: /cvsroot/pgsql/src/backend/commands/cluster.c,v 1.83 2002/07/12 18:43:15 tgl Exp $ * *------------------------------------------------------------------------- */ *************** *** 24,29 **** --- 24,30 ---- #include "access/genam.h" #include "access/heapam.h" + #include "catalog/dependency.h" #include "catalog/heap.h" #include "catalog/index.h" #include "catalog/pg_index.h" *************** *** 64,69 **** --- 65,71 ---- OldIndex; char NewHeapName[NAMEDATALEN]; char NewIndexName[NAMEDATALEN]; + ObjectAddress object; /* * We grab exclusive access to the target rel and index for the *************** *** 119,127 **** CommandCounterIncrement(); /* Destroy old heap (along with its index) and rename new. */ ! heap_drop_with_catalog(OIDOldHeap, allowSystemTableMods); ! CommandCounterIncrement(); renamerel(OIDNewHeap, oldrelation->relname); --- 121,134 ---- CommandCounterIncrement(); /* Destroy old heap (along with its index) and rename new. */ ! object.classId = RelOid_pg_class; ! object.objectId = OIDOldHeap; ! object.objectSubId = 0; ! /* XXX better to use DROP_CASCADE here? */ ! performDeletion(&object, DROP_RESTRICT); ! ! /* performDeletion does CommandCounterIncrement at end */ renamerel(OIDNewHeap, oldrelation->relname); *************** *** 198,203 **** --- 205,211 ---- OldIndex->rd_rel->relam, OldIndex->rd_index->indclass, OldIndex->rd_index->indisprimary, + false, /* XXX losing constraint status */ allowSystemTableMods); setRelhasindex(OIDNewHeap, true,
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> Anyway, there's still a big problem with this patch: the pg_depend >> information gets messed up after CLUSTER, so a clustered table cannot be >> dropped. I don't know why is this. > I actually backed out Tom's recent change to cluster.c (attached), > applied your patch, then manually applied Tom's patch to cluster.c so I > had a working version of your patch with the new dependencies. Seemed > to work fine. The changes I made to cluster.c may or may not be correct in the context of a redone CLUSTER implementation; it'll need to be looked at. regards, tom lane
Tom Lane wrote: > What we do need to do is to flush all buffers for the two rels from the > local and global buffer caches respectively, since after the swap they > should get buffered in different places. FlushRelationBuffers looks > like it will work for that. Sure, if you want to take the easy solution. ;-) Good idea, just flush the buffers so we don't have to track the change. Thanks for the assistance, Tom. Getting a cluster that doesn't have embarrassing shortcomings is a major improvement for 7.3. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > >> Anyway, there's still a big problem with this patch: the pg_depend > >> information gets messed up after CLUSTER, so a clustered table cannot be > >> dropped. I don't know why is this. > > > I actually backed out Tom's recent change to cluster.c (attached), > > applied your patch, then manually applied Tom's patch to cluster.c so I > > had a working version of your patch with the new dependencies. Seemed > > to work fine. > > The changes I made to cluster.c may or may not be correct in the > context of a redone CLUSTER implementation; it'll need to be looked at. Tom, you are probably right because the table being dropped doesn't have anything associated with it, except that it has the same tuple descriptor as the base table. Wonder if that is going to create things like defaults that need to be cascade deleted: OIDNewHeap = heap_create_with_catalog(NewName, RelationGetNamespace(OldHeap), tupdesc, OldHeap->rd_rel->relkind, OldHeap->rd_rel->relisshared, OldHeap->rd_rel->relhasoids, allowSystemTableMods); If you can tell me which drop type is correct, I can apply the cluster patch. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Your patch has been added to the PostgreSQL unapplied patches list at: http://candle.pha.pa.us/cgi-bin/pgpatches I will try to apply it within the next 48 hours. --------------------------------------------------------------------------- Alvaro Herrera wrote: > Tom Lane dijo: > > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > This case is actually more complicated because after the swap, pg_class > > > still says thisxactonly but the relfilenode points to non-thisactonly > > > buffers, and via versa. > > > > pg_class doesn't say anything (if it did, we could make relcache.c > > track it). > > > > But yes, the problem is that myxactonly has to line up with where the > > buffers are stored. As I mentioned, I think it would work to flush out > > all buffers for both rels before doing the swap. > > It does indeed work. Actually, I tried to do that before, but the fact > that RelationFlushBuffers requires a firstDelBlock made me turn around. > > I attach a new patch against current cvs HEAD. It works as far as I've > tested it, but I may be missing some cases. I'll post a documentation > patch later if this gets applied. > > -- > Alvaro Herrera (<alvherre[a]atentus.com>) > "Some men are heterosexual, and some are bisexual, and some > men don't think about sex at all... they become lawyers" (Woody Allen) Content-Description: [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Your patch has been added to the PostgreSQL unapplied patches list at: http://candle.pha.pa.us/cgi-bin/pgpatches I will try to apply it within the next 48 hours. --------------------------------------------------------------------------- Alvaro Herrera wrote: > Tom Lane dijo: > > > Alvaro Herrera <alvherre@atentus.com> writes: > > > It does indeed work. Actually, I tried to do that before, but the fact > > > that RelationFlushBuffers requires a firstDelBlock made me turn around. > > > > firstDelBlock should be 0, not length-of-relation; as given, this code > > fails to get rid of the buffer entries! > > Oh, I failed to understand completely the purpose of firstDelBlock then. > > Anyway, there's still a big problem with this patch: the pg_depend > information gets messed up after CLUSTER, so a clustered table cannot be > dropped. I don't know why is this. > > As I said yesterday, I'm going on vacation tomorrow, and probably will > not fix this issue. I can look into it when I'm back, on two weeks. > > I attach a little doc patch. > > -- > Alvaro Herrera (<alvherre[a]atentus.com>) > "La gente vulgar solo piensa en pasar el tiempo; > el que tiene talento, en aprovecharlo" Content-Description: [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I will try to apply it within the next 48 hours. Given that this patch is specifically stated not to work: >> Anyway, there's still a big problem with this patch: the pg_depend >> information gets messed up after CLUSTER, so a clustered table cannot be >> dropped. I don't know why is this. I object to it being applied now. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I will try to apply it within the next 48 hours. > > Given that this patch is specifically stated not to work: > > >> Anyway, there's still a big problem with this patch: the pg_depend > >> information gets messed up after CLUSTER, so a clustered table cannot be > >> dropped. I don't know why is this. Wow, that is strange since it has the same oid as before. > I object to it being applied now. Oh, I didn't see that part of his message; will remove from queue. I know I was waiting to hear if the proper way to drop the temp table was using object drop or drop it directly. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Doc patch mentions problem with dependency after CLUSTER. Patch removed from queue. --------------------------------------------------------------------------- Alvaro Herrera wrote: > Tom Lane dijo: > > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > This case is actually more complicated because after the swap, pg_class > > > still says thisxactonly but the relfilenode points to non-thisactonly > > > buffers, and via versa. > > > > pg_class doesn't say anything (if it did, we could make relcache.c > > track it). > > > > But yes, the problem is that myxactonly has to line up with where the > > buffers are stored. As I mentioned, I think it would work to flush out > > all buffers for both rels before doing the swap. > > It does indeed work. Actually, I tried to do that before, but the fact > that RelationFlushBuffers requires a firstDelBlock made me turn around. > > I attach a new patch against current cvs HEAD. It works as far as I've > tested it, but I may be missing some cases. I'll post a documentation > patch later if this gets applied. > > -- > Alvaro Herrera (<alvherre[a]atentus.com>) > "Some men are heterosexual, and some are bisexual, and some > men don't think about sex at all... they become lawyers" (Woody Allen) Content-Description: [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian dijo: > Doc patch mentions problem with dependency after CLUSTER. Patch removed > from queue. It turns out I was using index_drop() rather than performDeletion(). I have corrected it. Please review this version. Everything works here as expected. -- Alvaro Herrera (<alvherre[a]atentus.com>) "La Primavera ha venido. Nadie sabe como ha sido" (A. Machado)