Thread: Maintaining cluster order on insert
While thinking about index-organized-tables and similar ideas, it occurred to me that there's some low-hanging-fruit: maintaining cluster order on inserts by trying to place new heap tuples close to other similar tuples. That involves asking the index am where on the heap the new tuple should go, and trying to insert it there before using the FSM. Using the new fillfactor parameter makes it more likely that there's room on the page. We don't worry about the order within the page. The API I'm thinking of introduces a new optional index am function, amsuggestblock (suggestions for a better name are welcome). It gets the same parameters as aminsert, and returns the heap block number that would be optimal place to put the new tuple. It's be called from ExecInsert before inserting the heap tuple, and the suggestion is passed on to heap_insert and RelationGetBufferForTuple. I wrote a little patch to implement this for btree, attached. This could be optimized by changing the existing aminsert API, because as it is, an insert will have to descend the btree twice. Once in amsuggestblock and then in aminsert. amsuggestblock could keep the right index page pinned so aminsert could locate it quicker. But I wanted to keep this simple for now. Another improvement might be to allow amsuggestblock to return a list of suggestions, but that makes it more expensive to insert if there isn't room in the suggested pages, since heap_insert will have to try them all before giving up. Comments regarding the general idea or the patch? There should probably be a index option to turn the feature on and off. You'll want to turn it off when you first load a table, and turn it on after CLUSTER to keep it clustered. Since there's been discussion on keeping the TODO list more up-to-date, I hereby officially claim the "Automatically maintain clustering on a table" TODO item :). Feel free to bombard me with requests for status reports. And just to be clear, I'm not trying to sneak this into 8.2 anymore, this is 8.3 stuff. I won't be implementing a background daemon described on the TODO item, since that would essentially be an online version of CLUSTER. Which sure would be nice, but that's a different story. - Heikki Index: doc/src/sgml/catalogs.sgml =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/catalogs.sgml,v retrieving revision 2.129 diff -c -r2.129 catalogs.sgml *** doc/src/sgml/catalogs.sgml 31 Jul 2006 20:08:55 -0000 2.129 --- doc/src/sgml/catalogs.sgml 8 Aug 2006 16:17:21 -0000 *************** *** 499,504 **** --- 499,511 ---- <entry>Function to parse and validate reloptions for an index</entry> </row> + <row> + <entry><structfield>amsuggestblock</structfield></entry> + <entry><type>regproc</type></entry> + <entry><literal><link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>.oid</literal></entry> + <entry>Get the best place in the heap to put a new tuple</entry> + </row> + </tbody> </tgroup> </table> Index: doc/src/sgml/indexam.sgml =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/indexam.sgml,v retrieving revision 2.16 diff -c -r2.16 indexam.sgml *** doc/src/sgml/indexam.sgml 31 Jul 2006 20:08:59 -0000 2.16 --- doc/src/sgml/indexam.sgml 8 Aug 2006 17:15:25 -0000 *************** *** 391,396 **** --- 391,414 ---- <function>amoptions</> to test validity of options settings. </para> + <para> + <programlisting> + BlockNumber + amsuggestblock (Relation indexRelation, + Datum *values, + bool *isnull, + Relation heapRelation); + </programlisting> + Gets the optimal place in the heap for a new tuple. The parameters + correspond the parameters for <literal>aminsert</literal>. + This function is called on the clustered index before a new tuple + is inserted to the heap, and it should choose the optimal insertion + target page on the heap in such manner that the heap stays as close + as possible to the index order. + <literal>amsuggestblock</literal> can return InvalidBlockNumber if + the index am doesn't have a suggestion. + </para> + </sect1> <sect1 id="index-scanning"> Index: src/backend/access/heap/heapam.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v retrieving revision 1.218 diff -c -r1.218 heapam.c *** src/backend/access/heap/heapam.c 31 Jul 2006 20:08:59 -0000 1.218 --- src/backend/access/heap/heapam.c 8 Aug 2006 16:17:21 -0000 *************** *** 1325,1330 **** --- 1325,1335 ---- * use_fsm is passed directly to RelationGetBufferForTuple, which see for * more info. * + * suggested_blk can be set by the caller to hint heap_insert which + * block would be the best place to put the new tuple in. heap_insert can + * ignore the suggestion, if there's not enough room on that block. + * InvalidBlockNumber means no preference. + * * The return value is the OID assigned to the tuple (either here or by the * caller), or InvalidOid if no OID. The header fields of *tup are updated * to match the stored tuple; in particular tup->t_self receives the actual *************** *** 1333,1339 **** */ Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, ! bool use_wal, bool use_fsm) { TransactionId xid = GetCurrentTransactionId(); HeapTuple heaptup; --- 1338,1344 ---- */ Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, ! bool use_wal, bool use_fsm, BlockNumber suggested_blk) { TransactionId xid = GetCurrentTransactionId(); HeapTuple heaptup; *************** *** 1386,1392 **** /* Find buffer to insert this tuple into */ buffer = RelationGetBufferForTuple(relation, heaptup->t_len, ! InvalidBuffer, use_fsm); /* NO EREPORT(ERROR) from here till changes are logged */ START_CRIT_SECTION(); --- 1391,1397 ---- /* Find buffer to insert this tuple into */ buffer = RelationGetBufferForTuple(relation, heaptup->t_len, ! InvalidBuffer, use_fsm, suggested_blk); /* NO EREPORT(ERROR) from here till changes are logged */ START_CRIT_SECTION(); *************** *** 1494,1500 **** Oid simple_heap_insert(Relation relation, HeapTuple tup) { ! return heap_insert(relation, tup, GetCurrentCommandId(), true, true); } /* --- 1499,1506 ---- Oid simple_heap_insert(Relation relation, HeapTuple tup) { ! return heap_insert(relation, tup, GetCurrentCommandId(), true, ! true, InvalidBlockNumber); } /* *************** *** 2079,2085 **** { /* Assume there's no chance to put heaptup on same page. */ newbuf = RelationGetBufferForTuple(relation, heaptup->t_len, ! buffer, true); } else { --- 2085,2092 ---- { /* Assume there's no chance to put heaptup on same page. */ newbuf = RelationGetBufferForTuple(relation, heaptup->t_len, ! buffer, true, ! InvalidBlockNumber); } else { *************** *** 2096,2102 **** */ LockBuffer(buffer, BUFFER_LOCK_UNLOCK); newbuf = RelationGetBufferForTuple(relation, heaptup->t_len, ! buffer, true); } else { --- 2103,2110 ---- */ LockBuffer(buffer, BUFFER_LOCK_UNLOCK); newbuf = RelationGetBufferForTuple(relation, heaptup->t_len, ! buffer, true, ! InvalidBlockNumber); } else { Index: src/backend/access/heap/hio.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/hio.c,v retrieving revision 1.63 diff -c -r1.63 hio.c *** src/backend/access/heap/hio.c 3 Jul 2006 22:45:37 -0000 1.63 --- src/backend/access/heap/hio.c 9 Aug 2006 18:03:01 -0000 *************** *** 93,98 **** --- 93,100 ---- * any committed data of other transactions. (See heap_insert's comments * for additional constraints needed for safe usage of this behavior.) * + * If the caller has a suggestion, it's passed in suggestedBlock. + * * We always try to avoid filling existing pages further than the fillfactor. * This is OK since this routine is not consulted when updating a tuple and * keeping it on the same page, which is the scenario fillfactor is meant *************** *** 103,109 **** */ Buffer RelationGetBufferForTuple(Relation relation, Size len, ! Buffer otherBuffer, bool use_fsm) { Buffer buffer = InvalidBuffer; Page pageHeader; --- 105,112 ---- */ Buffer RelationGetBufferForTuple(Relation relation, Size len, ! Buffer otherBuffer, bool use_fsm, ! BlockNumber suggestedBlock) { Buffer buffer = InvalidBuffer; Page pageHeader; *************** *** 135,142 **** otherBlock = InvalidBlockNumber; /* just to keep compiler quiet */ /* ! * We first try to put the tuple on the same page we last inserted a tuple ! * on, as cached in the relcache entry. If that doesn't work, we ask the * shared Free Space Map to locate a suitable page. Since the FSM's info * might be out of date, we have to be prepared to loop around and retry * multiple times. (To insure this isn't an infinite loop, we must update --- 138,147 ---- otherBlock = InvalidBlockNumber; /* just to keep compiler quiet */ /* ! * We first try to put the tuple on the page suggested by the caller, if ! * any. Then we try to put the tuple on the same page we last inserted a ! * tuple on, as cached in the relcache entry. If that doesn't work, we ! * ask the * shared Free Space Map to locate a suitable page. Since the FSM's info * might be out of date, we have to be prepared to loop around and retry * multiple times. (To insure this isn't an infinite loop, we must update *************** *** 144,152 **** * not to be suitable.) If the FSM has no record of a page with enough * free space, we give up and extend the relation. * ! * When use_fsm is false, we either put the tuple onto the existing target ! * page or extend the relation. */ if (len + saveFreeSpace <= MaxTupleSize) targetBlock = relation->rd_targblock; else --- 149,167 ---- * not to be suitable.) If the FSM has no record of a page with enough * free space, we give up and extend the relation. * ! * When use_fsm is false, we skip the fsm lookup if neither the suggested ! * nor the cached last insertion page has enough room, and extend the ! * relation. ! * ! * The fillfactor is taken into account when calculating the free space ! * on the cached target block, and when using the FSM. The suggested page ! * is used whenever there's enough room in it, regardless of the fillfactor, ! * because that's exactly the purpose the space is reserved for in the ! * first place. */ + if (suggestedBlock != InvalidBlockNumber) + targetBlock = suggestedBlock; + else if (len + saveFreeSpace <= MaxTupleSize) targetBlock = relation->rd_targblock; else *************** *** 219,224 **** --- 234,244 ---- */ pageHeader = (Page) BufferGetPage(buffer); pageFreeSpace = PageGetFreeSpace(pageHeader); + + /* If we're trying the suggested block, don't care about fillfactor */ + if (targetBlock == suggestedBlock && len <= pageFreeSpace) + return buffer; + if (len + saveFreeSpace <= pageFreeSpace) { /* use this page as future insert target, too */ *************** *** 241,246 **** --- 261,275 ---- ReleaseBuffer(buffer); } + /* If we just tried the suggested block, try the cached target + * block next, before consulting the FSM. */ + if(suggestedBlock == targetBlock) + { + targetBlock = relation->rd_targblock; + suggestedBlock = InvalidBlockNumber; + continue; + } + /* Without FSM, always fall out of the loop and extend */ if (!use_fsm) break; Index: src/backend/access/index/genam.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/index/genam.c,v retrieving revision 1.58 diff -c -r1.58 genam.c *** src/backend/access/index/genam.c 31 Jul 2006 20:08:59 -0000 1.58 --- src/backend/access/index/genam.c 8 Aug 2006 16:17:21 -0000 *************** *** 259,261 **** --- 259,275 ---- pfree(sysscan); } + + /* + * This is a dummy implementation of amsuggestblock, to be used for index + * access methods that don't or can't support it. It just returns + * InvalidBlockNumber, which means "no preference". + * + * This is probably not a good best place for this function, but it doesn't + * fit naturally anywhere else either. + */ + Datum + dummysuggestblock(PG_FUNCTION_ARGS) + { + PG_RETURN_UINT32(InvalidBlockNumber); + } Index: src/backend/access/index/indexam.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/index/indexam.c,v retrieving revision 1.94 diff -c -r1.94 indexam.c *** src/backend/access/index/indexam.c 31 Jul 2006 20:08:59 -0000 1.94 --- src/backend/access/index/indexam.c 8 Aug 2006 16:17:21 -0000 *************** *** 18,23 **** --- 18,24 ---- * index_rescan - restart a scan of an index * index_endscan - end a scan * index_insert - insert an index tuple into a relation + * index_suggestblock - get desired insert location for a heap tuple * index_markpos - mark a scan position * index_restrpos - restore a scan position * index_getnext - get the next tuple from a scan *************** *** 202,207 **** --- 203,237 ---- BoolGetDatum(check_uniqueness))); } + /* ---------------- + * index_suggestblock - get desired insert location for a heap tuple + * + * The returned BlockNumber is the *heap* page that is the best place + * to insert the given tuple to, according to the index am. The best + * place is usually one that maintains the cluster order. + * ---------------- + */ + BlockNumber + index_suggestblock(Relation indexRelation, + Datum *values, + bool *isnull, + Relation heapRelation) + { + FmgrInfo *procedure; + + RELATION_CHECKS; + GET_REL_PROCEDURE(amsuggestblock); + + /* + * have the am's suggestblock proc do all the work. + */ + return DatumGetUInt32(FunctionCall4(procedure, + PointerGetDatum(indexRelation), + PointerGetDatum(values), + PointerGetDatum(isnull), + PointerGetDatum(heapRelation))); + } + /* * index_beginscan - start a scan of an index with amgettuple * Index: src/backend/access/nbtree/nbtinsert.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtinsert.c,v retrieving revision 1.142 diff -c -r1.142 nbtinsert.c *** src/backend/access/nbtree/nbtinsert.c 25 Jul 2006 19:13:00 -0000 1.142 --- src/backend/access/nbtree/nbtinsert.c 9 Aug 2006 17:51:33 -0000 *************** *** 146,151 **** --- 146,221 ---- } /* + * _bt_suggestblock() -- Find the heap block of the closest index tuple. + * + * The logic to find the target should match _bt_doinsert, otherwise + * we'll be making bad suggestions. + */ + BlockNumber + _bt_suggestblock(Relation rel, IndexTuple itup, Relation heapRel) + { + int natts = rel->rd_rel->relnatts; + OffsetNumber offset; + Page page; + BTPageOpaque opaque; + + ScanKey itup_scankey; + BTStack stack; + Buffer buf; + IndexTuple curitup; + BlockNumber suggestion = InvalidBlockNumber; + + /* we need an insertion scan key to do our search, so build one */ + itup_scankey = _bt_mkscankey(rel, itup); + + /* find the first page containing this key */ + stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_READ); + if(!BufferIsValid(buf)) + { + /* The index was completely empty. No suggestion then. */ + return InvalidBlockNumber; + } + /* we don't need the stack, so free it right away */ + _bt_freestack(stack); + + page = BufferGetPage(buf); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + + /* Find the location in the page where the new index tuple would go to. */ + + offset = _bt_binsrch(rel, buf, natts, itup_scankey, false); + if (offset > PageGetMaxOffsetNumber(page)) + { + /* _bt_binsrch returned pointer to end-of-page. It means that + * there was no equal items on the page, and the new item should + * be inserted as the last tuple of the page. There could be equal + * items on the next page, however. + * + * At the moment, we just ignore the potential equal items on the + * right, and pretend there isn't any. We could instead walk right + * to the next page to check that, but let's keep it simple for now. + */ + offset = OffsetNumberPrev(offset); + } + if(offset < P_FIRSTDATAKEY(opaque)) + { + /* We landed on an empty page. We could step left or right until + * we find some items, but let's keep it simple for now. + */ + } else { + /* We're now positioned at the index tuple that we're interested in. */ + + curitup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset)); + suggestion = ItemPointerGetBlockNumber(&curitup->t_tid); + } + + _bt_relbuf(rel, buf); + _bt_freeskey(itup_scankey); + + return suggestion; + } + + /* * _bt_check_unique() -- Check for violation of unique index constraint * * Returns InvalidTransactionId if there is no conflict, else an xact ID Index: src/backend/access/nbtree/nbtree.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtree.c,v retrieving revision 1.149 diff -c -r1.149 nbtree.c *** src/backend/access/nbtree/nbtree.c 10 May 2006 23:18:39 -0000 1.149 --- src/backend/access/nbtree/nbtree.c 9 Aug 2006 18:04:02 -0000 *************** *** 228,233 **** --- 228,265 ---- } /* + * btsuggestblock() -- find the best place in the heap to put a new tuple. + * + * This uses the same logic as btinsert to find the place where the index + * tuple would go if this was a btinsert call. + * + * There's room for improvement here. An insert operation will descend + * the tree twice, first by btsuggestblock, then by btinsert. Things + * might have changed in between, so that the heap tuple is actually + * not inserted in the optimal page, but since this is just an + * optimization, it's ok if it happens sometimes. + */ + Datum + btsuggestblock(PG_FUNCTION_ARGS) + { + Relation rel = (Relation) PG_GETARG_POINTER(0); + Datum *values = (Datum *) PG_GETARG_POINTER(1); + bool *isnull = (bool *) PG_GETARG_POINTER(2); + Relation heapRel = (Relation) PG_GETARG_POINTER(3); + IndexTuple itup; + BlockNumber suggestion; + + /* generate an index tuple */ + itup = index_form_tuple(RelationGetDescr(rel), values, isnull); + + suggestion =_bt_suggestblock(rel, itup, heapRel); + + pfree(itup); + + PG_RETURN_UINT32(suggestion); + } + + /* * btgettuple() -- Get the next tuple in the scan. */ Datum Index: src/backend/executor/execMain.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execMain.c,v retrieving revision 1.277 diff -c -r1.277 execMain.c *** src/backend/executor/execMain.c 31 Jul 2006 01:16:37 -0000 1.277 --- src/backend/executor/execMain.c 8 Aug 2006 16:17:21 -0000 *************** *** 892,897 **** --- 892,898 ---- resultRelInfo->ri_RangeTableIndex = resultRelationIndex; resultRelInfo->ri_RelationDesc = resultRelationDesc; resultRelInfo->ri_NumIndices = 0; + resultRelInfo->ri_ClusterIndex = -1; resultRelInfo->ri_IndexRelationDescs = NULL; resultRelInfo->ri_IndexRelationInfo = NULL; /* make a copy so as not to depend on relcache info not changing... */ *************** *** 1388,1394 **** heap_insert(estate->es_into_relation_descriptor, tuple, estate->es_snapshot->curcid, estate->es_into_relation_use_wal, ! false); /* never any point in using FSM */ /* we know there are no indexes to update */ heap_freetuple(tuple); IncrAppended(); --- 1389,1396 ---- heap_insert(estate->es_into_relation_descriptor, tuple, estate->es_snapshot->curcid, estate->es_into_relation_use_wal, ! false, /* never any point in using FSM */ ! InvalidBlockNumber); /* we know there are no indexes to update */ heap_freetuple(tuple); IncrAppended(); *************** *** 1419,1424 **** --- 1421,1427 ---- ResultRelInfo *resultRelInfo; Relation resultRelationDesc; Oid newId; + BlockNumber suggestedBlock; /* * get the heap tuple out of the tuple table slot, making sure we have a *************** *** 1467,1472 **** --- 1470,1479 ---- if (resultRelationDesc->rd_att->constr) ExecConstraints(resultRelInfo, slot, estate); + /* Ask the index am of the clustered index for the + * best place to put it */ + suggestedBlock = ExecSuggestBlock(slot, estate); + /* * insert the tuple * *************** *** 1475,1481 **** */ newId = heap_insert(resultRelationDesc, tuple, estate->es_snapshot->curcid, ! true, true); IncrAppended(); (estate->es_processed)++; --- 1482,1488 ---- */ newId = heap_insert(resultRelationDesc, tuple, estate->es_snapshot->curcid, ! true, true, suggestedBlock); IncrAppended(); (estate->es_processed)++; Index: src/backend/executor/execUtils.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execUtils.c,v retrieving revision 1.139 diff -c -r1.139 execUtils.c *** src/backend/executor/execUtils.c 4 Aug 2006 21:33:36 -0000 1.139 --- src/backend/executor/execUtils.c 9 Aug 2006 18:05:05 -0000 *************** *** 31,36 **** --- 31,37 ---- * ExecOpenIndices \ * ExecCloseIndices | referenced by InitPlan, EndPlan, * ExecInsertIndexTuples / ExecInsert, ExecUpdate + * ExecSuggestBlock Referenced by ExecInsert * * RegisterExprContextCallback Register function shutdown callback * UnregisterExprContextCallback Deregister function shutdown callback *************** *** 874,879 **** --- 875,881 ---- IndexInfo **indexInfoArray; resultRelInfo->ri_NumIndices = 0; + resultRelInfo->ri_ClusterIndex = -1; /* fast path if no indexes */ if (!RelationGetForm(resultRelation)->relhasindex) *************** *** 913,918 **** --- 915,925 ---- /* extract index key information from the index's pg_index info */ ii = BuildIndexInfo(indexDesc); + /* Remember which index is the clustered one. + * It's used to call the suggestblock-method on inserts */ + if(indexDesc->rd_index->indisclustered) + resultRelInfo->ri_ClusterIndex = i; + relationDescs[i] = indexDesc; indexInfoArray[i] = ii; i++; *************** *** 1062,1067 **** --- 1069,1137 ---- } } + /* ---------------------------------------------------------------- + * ExecSuggestBlock + * + * This routine asks the index am where a new heap tuple + * should be placed. + * ---------------------------------------------------------------- + */ + BlockNumber + ExecSuggestBlock(TupleTableSlot *slot, + EState *estate) + { + ResultRelInfo *resultRelInfo; + int i; + Relation relationDesc; + Relation heapRelation; + ExprContext *econtext; + Datum values[INDEX_MAX_KEYS]; + bool isnull[INDEX_MAX_KEYS]; + IndexInfo *indexInfo; + + /* + * Get information from the result relation info structure. + */ + resultRelInfo = estate->es_result_relation_info; + i = resultRelInfo->ri_ClusterIndex; + if(i == -1) + return InvalidBlockNumber; /* there was no clustered index */ + + heapRelation = resultRelInfo->ri_RelationDesc; + relationDesc = resultRelInfo->ri_IndexRelationDescs[i]; + indexInfo = resultRelInfo->ri_IndexRelationInfo[i]; + + /* You can't cluster on a partial index */ + Assert(indexInfo->ii_Predicate == NIL); + + /* + * We will use the EState's per-tuple context for evaluating + * index expressions (creating it if it's not already there). + */ + econtext = GetPerTupleExprContext(estate); + + /* Arrange for econtext's scan tuple to be the tuple under test */ + econtext->ecxt_scantuple = slot; + + /* + * FormIndexDatum fills in its values and isnull parameters with the + * appropriate values for the column(s) of the index. + */ + FormIndexDatum(indexInfo, + slot, + estate, + values, + isnull); + + /* + * The index AM does the rest. + */ + return index_suggestblock(relationDesc, /* index relation */ + values, /* array of index Datums */ + isnull, /* null flags */ + heapRelation); + } + /* * UpdateChangedParamSet * Add changed parameters to a plan node's chgParam set Index: src/include/access/genam.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/genam.h,v retrieving revision 1.65 diff -c -r1.65 genam.h *** src/include/access/genam.h 31 Jul 2006 20:09:05 -0000 1.65 --- src/include/access/genam.h 9 Aug 2006 17:53:44 -0000 *************** *** 93,98 **** --- 93,101 ---- ItemPointer heap_t_ctid, Relation heapRelation, bool check_uniqueness); + extern BlockNumber index_suggestblock(Relation indexRelation, + Datum *values, bool *isnull, + Relation heapRelation); extern IndexScanDesc index_beginscan(Relation heapRelation, Relation indexRelation, *************** *** 123,128 **** --- 126,133 ---- extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum, uint16 procnum); + extern Datum dummysuggestblock(PG_FUNCTION_ARGS); + /* * index access method support routines (in genam.c) */ Index: src/include/access/heapam.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/heapam.h,v retrieving revision 1.114 diff -c -r1.114 heapam.h *** src/include/access/heapam.h 3 Jul 2006 22:45:39 -0000 1.114 --- src/include/access/heapam.h 8 Aug 2006 16:17:21 -0000 *************** *** 156,162 **** extern void setLastTid(const ItemPointer tid); extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, ! bool use_wal, bool use_fsm); extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, ItemPointer ctid, TransactionId *update_xmax, CommandId cid, Snapshot crosscheck, bool wait); --- 156,162 ---- extern void setLastTid(const ItemPointer tid); extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, ! bool use_wal, bool use_fsm, BlockNumber suggestedblk); extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, ItemPointer ctid, TransactionId *update_xmax, CommandId cid, Snapshot crosscheck, bool wait); Index: src/include/access/hio.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/hio.h,v retrieving revision 1.32 diff -c -r1.32 hio.h *** src/include/access/hio.h 13 Jul 2006 17:47:01 -0000 1.32 --- src/include/access/hio.h 8 Aug 2006 16:17:21 -0000 *************** *** 21,26 **** extern void RelationPutHeapTuple(Relation relation, Buffer buffer, HeapTuple tuple); extern Buffer RelationGetBufferForTuple(Relation relation, Size len, ! Buffer otherBuffer, bool use_fsm); #endif /* HIO_H */ --- 21,26 ---- extern void RelationPutHeapTuple(Relation relation, Buffer buffer, HeapTuple tuple); extern Buffer RelationGetBufferForTuple(Relation relation, Size len, ! Buffer otherBuffer, bool use_fsm, BlockNumber suggestedblk); #endif /* HIO_H */ Index: src/include/access/nbtree.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/nbtree.h,v retrieving revision 1.103 diff -c -r1.103 nbtree.h *** src/include/access/nbtree.h 7 Aug 2006 16:57:57 -0000 1.103 --- src/include/access/nbtree.h 8 Aug 2006 16:17:21 -0000 *************** *** 467,472 **** --- 467,473 ---- extern Datum btbulkdelete(PG_FUNCTION_ARGS); extern Datum btvacuumcleanup(PG_FUNCTION_ARGS); extern Datum btoptions(PG_FUNCTION_ARGS); + extern Datum btsuggestblock(PG_FUNCTION_ARGS); /* * prototypes for functions in nbtinsert.c *************** *** 476,481 **** --- 477,484 ---- extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access); extern void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf, BTStack stack, bool is_root, bool is_only); + extern BlockNumber _bt_suggestblock(Relation rel, IndexTuple itup, + Relation heapRel); /* * prototypes for functions in nbtpage.c Index: src/include/catalog/pg_am.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_am.h,v retrieving revision 1.46 diff -c -r1.46 pg_am.h *** src/include/catalog/pg_am.h 31 Jul 2006 20:09:05 -0000 1.46 --- src/include/catalog/pg_am.h 8 Aug 2006 16:17:21 -0000 *************** *** 65,70 **** --- 65,71 ---- regproc amvacuumcleanup; /* post-VACUUM cleanup function */ regproc amcostestimate; /* estimate cost of an indexscan */ regproc amoptions; /* parse AM-specific parameters */ + regproc amsuggestblock; /* suggest a block where to put heap tuple */ } FormData_pg_am; /* ---------------- *************** *** 78,84 **** * compiler constants for pg_am * ---------------- */ ! #define Natts_pg_am 23 #define Anum_pg_am_amname 1 #define Anum_pg_am_amstrategies 2 #define Anum_pg_am_amsupport 3 --- 79,85 ---- * compiler constants for pg_am * ---------------- */ ! #define Natts_pg_am 24 #define Anum_pg_am_amname 1 #define Anum_pg_am_amstrategies 2 #define Anum_pg_am_amsupport 3 *************** *** 102,123 **** #define Anum_pg_am_amvacuumcleanup 21 #define Anum_pg_am_amcostestimate 22 #define Anum_pg_am_amoptions 23 /* ---------------- * initial contents of pg_am * ---------------- */ ! DATA(insert OID = 403 ( btree 5 1 1 t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan btmarkposbtrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions )); DESCR("b-tree index access method"); #define BTREE_AM_OID 403 ! DATA(insert OID = 405 ( hash 1 1 0 f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan hashendscanhashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions )); DESCR("hash index access method"); #define HASH_AM_OID 405 ! DATA(insert OID = 783 ( gist 100 7 0 f t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan gistendscangistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions )); DESCR("GiST index access method"); #define GIST_AM_OID 783 ! DATA(insert OID = 2742 ( gin 100 4 0 f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan ginendscanginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions )); DESCR("GIN index access method"); #define GIN_AM_OID 2742 --- 103,125 ---- #define Anum_pg_am_amvacuumcleanup 21 #define Anum_pg_am_amcostestimate 22 #define Anum_pg_am_amoptions 23 + #define Anum_pg_am_amsuggestblock 24 /* ---------------- * initial contents of pg_am * ---------------- */ ! DATA(insert OID = 403 ( btree 5 1 1 t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan btmarkposbtrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions btsuggestblock)); DESCR("b-tree index access method"); #define BTREE_AM_OID 403 ! DATA(insert OID = 405 ( hash 1 1 0 f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan hashendscanhashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions dummysuggestblock)); DESCR("hash index access method"); #define HASH_AM_OID 405 ! DATA(insert OID = 783 ( gist 100 7 0 f t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan gistendscangistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions dummysuggestblock)); DESCR("GiST index access method"); #define GIST_AM_OID 783 ! DATA(insert OID = 2742 ( gin 100 4 0 f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan ginendscanginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions dummysuggestblock )); DESCR("GIN index access method"); #define GIN_AM_OID 2742 Index: src/include/catalog/pg_proc.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_proc.h,v retrieving revision 1.420 diff -c -r1.420 pg_proc.h *** src/include/catalog/pg_proc.h 6 Aug 2006 03:53:44 -0000 1.420 --- src/include/catalog/pg_proc.h 9 Aug 2006 18:06:44 -0000 *************** *** 682,687 **** --- 682,689 ---- DESCR("btree(internal)"); DATA(insert OID = 2785 ( btoptions PGNSP PGUID 12 f f t f s 2 17 "1009 16" _null_ _null_ _null_ btoptions -_null_ )); DESCR("btree(internal)"); + DATA(insert OID = 2852 ( btsuggestblock PGNSP PGUID 12 f f t f v 4 23 "2281 2281 2281 2281" _null_ _null_ _null_ btsuggestblock - _null_ )); + DESCR("btree(internal)"); DATA(insert OID = 339 ( poly_same PGNSP PGUID 12 f f t f i 2 16 "604 604" _null_ _null_ _null_ poly_same - _null_)); DESCR("same as?"); *************** *** 3936,3941 **** --- 3938,3946 ---- DATA(insert OID = 2749 ( arraycontained PGNSP PGUID 12 f f t f i 2 16 "2277 2277" _null_ _null_ _null_ arraycontained- _null_ )); DESCR("anyarray contained"); + DATA(insert OID = 2853 ( dummysuggestblock PGNSP PGUID 12 f f t f v 4 23 "2281 2281 2281 2281" _null_ _null_ _null_ dummysuggestblock - _null_ )); + DESCR("dummy amsuggestblock implementation (internal)"); + /* * Symbolic values for provolatile column: these indicate whether the result * of a function is dependent *only* on the values of its explicit arguments, Index: src/include/executor/executor.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/executor/executor.h,v retrieving revision 1.128 diff -c -r1.128 executor.h *** src/include/executor/executor.h 4 Aug 2006 21:33:36 -0000 1.128 --- src/include/executor/executor.h 8 Aug 2006 16:17:21 -0000 *************** *** 271,276 **** --- 271,277 ---- extern void ExecCloseIndices(ResultRelInfo *resultRelInfo); extern void ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid, EState *estate, bool is_vacuum); + extern BlockNumber ExecSuggestBlock(TupleTableSlot *slot, EState *estate); extern void RegisterExprContextCallback(ExprContext *econtext, ExprContextCallbackFunction function, Index: src/include/nodes/execnodes.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/nodes/execnodes.h,v retrieving revision 1.158 diff -c -r1.158 execnodes.h *** src/include/nodes/execnodes.h 4 Aug 2006 21:33:36 -0000 1.158 --- src/include/nodes/execnodes.h 8 Aug 2006 16:17:21 -0000 *************** *** 257,262 **** --- 257,264 ---- * NumIndices # of indices existing on result relation * IndexRelationDescs array of relation descriptors for indices * IndexRelationInfo array of key/attr info for indices + * ClusterIndex index to the IndexRelationInfo array of the + * clustered index, or -1 if there's none * TrigDesc triggers to be fired, if any * TrigFunctions cached lookup info for trigger functions * TrigInstrument optional runtime measurements for triggers *************** *** 272,277 **** --- 274,280 ---- int ri_NumIndices; RelationPtr ri_IndexRelationDescs; IndexInfo **ri_IndexRelationInfo; + int ri_ClusterIndex; TriggerDesc *ri_TrigDesc; FmgrInfo *ri_TrigFunctions; struct Instrumentation *ri_TrigInstrument; Index: src/include/utils/rel.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/utils/rel.h,v retrieving revision 1.91 diff -c -r1.91 rel.h *** src/include/utils/rel.h 3 Jul 2006 22:45:41 -0000 1.91 --- src/include/utils/rel.h 8 Aug 2006 16:17:21 -0000 *************** *** 116,121 **** --- 116,122 ---- FmgrInfo amvacuumcleanup; FmgrInfo amcostestimate; FmgrInfo amoptions; + FmgrInfo amsuggestblock; } RelationAmInfo;
Heikki Linnakangas <heikki@enterprisedb.com> writes: > While thinking about index-organized-tables and similar ideas, it > occurred to me that there's some low-hanging-fruit: maintaining cluster > order on inserts by trying to place new heap tuples close to other > similar tuples. Doesn't this happen for free if there's enough space? UPDATE tries to place the new tuple on the same page it's already on. In practice people are only likely to cluster by primary keys (or other things that seldom change) so I don't particularly agree with inventing a large wart to support "optimally" placing things somewhere else... regards, tom lane
On 8/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > UPDATE tries to place the new tuple on the same page it's already > on. I think he meant for INSERT. -- Jonah H. Harris, Software Architect | phone: 732.331.1300 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 2nd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
I have a table that inserts lots of rows (million+ per day) int8 as primary key, and I cluster by a timestamp which is approximately the timestamp of the insert beforehand and is therefore in increasing order and doesn't change. Most of the rows are updated about 3 times over time roughly within the next 30 minutes. Should I assume that that all of these updates will be on separate pages unless I perform a cluster (which takes a long time) and performance will suffer due to this? Is it possible to preallocate space on the same page for future updates (whatever the average number of updates may be per row) decreasing the number of page loads for querying?
Gene
--
Eugene Hart
Gene
On 8/9/06, Jonah H. Harris <jonah.harris@gmail.com> wrote:
On 8/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> UPDATE tries to place the new tuple on the same page it's already
> on.
I think he meant for INSERT.
--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation | fax: 732.331.1301
33 Wood Ave S, 2nd Floor | jharris@enterprisedb.com
Iselin, New Jersey 08830 | http://www.enterprisedb.com/
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
--
Eugene Hart
Gene <genekhart@gmail.com> writes: > I have a table that inserts lots of rows (million+ per day) int8 as primary > key, and I cluster by a timestamp which is approximately the timestamp of > the insert beforehand and is therefore in increasing order and doesn't > change. Most of the rows are updated about 3 times over time roughly within > the next 30 minutes. ISTM you should hardly need to worry about clustering that --- the data will be in timestamp order pretty naturally. The main problem you're going to have is the update-3-times bit. You could keep updated rows on the same page as the original if you ran the table at fillfactor 25% (which you'll be able to do in 8.2) ... but while this might be sane for the leading edge of the table, you hardly want such low storage density in the stable part. You could reduce the fillfactor requirement if you could vacuum the table constantly (every 10 minutes or so) but I assume the table is large enough to make that unattractive. (Eventually we should have a version of vacuum that understands where the dirty stuff is, which might make this approach tenable ... but not in 8.2.) Your best bet might be to partition the table into two subtables, one with "stable" data and one with the fresh data, and transfer rows from one to the other once they get stable. Storage density in the "fresh" part would be poor, but it should be small enough you don't care. regards, tom lane
You are correct the main part I'm worried about is the updates, being so far from the originals. fyi I am partitioning the tables by the timestamp column,vacuum analyzing once per hour, creating one child partition per day in a cron job. Because I'm using hibernate for database abstraction (stateless sessions), I can only have one RULE since having more than one insert rule will not return the correct number of updated rows which confuses hibernate. The one rule just directs inserts to the latest child partition which seems to work well.
The reason I'm doing the clustering is I was hoping that with the "stable" non-updating partitions I could execute a CLUSTER at night (slow...) and it would compact the tables into their most efficient state for querying which always involves a date range. bad idea? In this fillfactor feature, will you be able to set it to 100% once you know that no more updates will occur? Or will doing a cluster effectively do this? Will the fill factor only apply for inserts?
"Your best bet might be to partition the table into two subtables, one
with "stable" data and one with the fresh data, and transfer rows from
one to the other once they get stable. Storage density in the "fresh"
part would be poor, but it should be small enough you don't care."
This sounds interesting, I could create a RULE/INSERT on the unstable table, I will know during the update if it is ready to be put in the stable table. What would be an efficient way to do the transfer? Since the updates occur somewhat randomly, wouldnt the tuples in the stable table then be out of natural timestamp order?
thanks for all of your help and comments! it is greatly appreciated!
Gene Hart
--
Eugene Hart
The reason I'm doing the clustering is I was hoping that with the "stable" non-updating partitions I could execute a CLUSTER at night (slow...) and it would compact the tables into their most efficient state for querying which always involves a date range. bad idea? In this fillfactor feature, will you be able to set it to 100% once you know that no more updates will occur? Or will doing a cluster effectively do this? Will the fill factor only apply for inserts?
"Your best bet might be to partition the table into two subtables, one
with "stable" data and one with the fresh data, and transfer rows from
one to the other once they get stable. Storage density in the "fresh"
part would be poor, but it should be small enough you don't care."
This sounds interesting, I could create a RULE/INSERT on the unstable table, I will know during the update if it is ready to be put in the stable table. What would be an efficient way to do the transfer? Since the updates occur somewhat randomly, wouldnt the tuples in the stable table then be out of natural timestamp order?
thanks for all of your help and comments! it is greatly appreciated!
Gene Hart
On 8/9/06, Tom Lane <tgl@sss.pgh.pa.us > wrote:
Gene <genekhart@gmail.com> writes:
> I have a table that inserts lots of rows (million+ per day) int8 as primary
> key, and I cluster by a timestamp which is approximately the timestamp of
> the insert beforehand and is therefore in increasing order and doesn't
> change. Most of the rows are updated about 3 times over time roughly within
> the next 30 minutes.
ISTM you should hardly need to worry about clustering that --- the data
will be in timestamp order pretty naturally.
The main problem you're going to have is the update-3-times bit. You
could keep updated rows on the same page as the original if you ran the
table at fillfactor 25% (which you'll be able to do in 8.2) ... but
while this might be sane for the leading edge of the table, you hardly
want such low storage density in the stable part.
You could reduce the fillfactor requirement if you could vacuum the
table constantly (every 10 minutes or so) but I assume the table is
large enough to make that unattractive. (Eventually we should have
a version of vacuum that understands where the dirty stuff is, which
might make this approach tenable ... but not in 8.2.)
Your best bet might be to partition the table into two subtables, one
with "stable" data and one with the fresh data, and transfer rows from
one to the other once they get stable. Storage density in the "fresh"
part would be poor, but it should be small enough you don't care.
regards, tom lane
--
Eugene Hart
Jonah H. Harris wrote: > On 8/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> UPDATE tries to place the new tuple on the same page it's already >> on. > > I think he meant for INSERT. > Right. Update is indeed taken care of already. One example where this would help would be a customer_history table that stores all transactions of a customer. Primary key is (customer_id, transaction_id). You would want to keep the table clustered by customer_id to make it quick to retrieve all transactions of a customer. In general, any table with more or less random insert/delete activity that you want to keep in order would benefit. - Heikki
Gene <genekhart@gmail.com> writes: > "Your best bet might be to partition the table into two subtables, one > with "stable" data and one with the fresh data, and transfer rows from > one to the other once they get stable. Storage density in the "fresh" > part would be poor, but it should be small enough you don't care." > > This sounds interesting, I could create a RULE/INSERT on the unstable table, > I will know during the update if it is ready to be put in the stable table. > What would be an efficient way to do the transfer? Since the updates occur > somewhat randomly, wouldnt the tuples in the stable table then be out of > natural timestamp order? You may find it easier to handle some of the logic in a low level application layer or layer of stored procedures rather than trying to make it entirely transparent with rules. If you do want it to be transparent you might also consider whether you want triggers instead of rules. Another direction you might want to consider is whether the columns that you're updating would be more normalized in a separate table. You might really want to have a record of those past states as well. So you might find having three records in this other table for each of your regular records in your main table might actually work out better. Even if you only have a 1-1 relationship sometimes this kind of horizontal partitioning (or do people consider this vertical partitioning?) is still worthwhile. If the columns being updated are very small or often not needed at all then it may be reasonably efficient to look them up separately and still let you store the bulk of the data efficiently and access it in a fast sequential scan. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Gene <genekhart@gmail.com> writes: >> I have a table that inserts lots of rows (million+ per day) int8 as primary >> key, and I cluster by a timestamp which is approximately the timestamp of >> the insert... > > ISTM you should hardly need to worry about clustering that --- the data > will be in timestamp order pretty naturally. In my case my biggest/slowest tables are clustered by zip-code (which does a reasonable job at keeping counties/cities/etc on the same pages too). Data comes in constantly (many records per minute, as we ramp up), pretty uniformly across the country; but most queries are geographically bounded. The data's pretty much insert-only. If I understand Heikki's patch, it would help for this use case. > Your best bet might be to partition the table into two subtables, one > with "stable" data and one with the fresh data, and transfer rows from > one to the other once they get stable. Storage density in the "fresh" > part would be poor, but it should be small enough you don't care. Hmm... that should work well for me too. Not sure if the use-case I mentioned above is still compelling anymore; since this seems like it'd give me much of the benefit; and I don't need an excessive fillfactor on the stable part of the table.
Ron Mayer wrote: > In my case my biggest/slowest tables are clustered by zip-code (which > does a reasonable job at keeping counties/cities/etc on the > same pages too). Data comes in constantly (many records per minute, as > we ramp up), pretty uniformly across the country; but most queries > are geographically bounded. The data's pretty much insert-only. No deletes? If the tables grow over time, you probably would need to run CLUSTER every now and then to get the best performance, though the patch would alleviate that quite a lot. Do you have a development environment where you could test what effect the patch would have? It would be interesting to have a real-world use case, since I don't have one myself at the moment. > If I understand Heikki's patch, it would help for this use case. Yes, it would. > > Your best bet might be to partition the table into two subtables, one > > with "stable" data and one with the fresh data, and transfer rows from > > one to the other once they get stable. Storage density in the "fresh" > > part would be poor, but it should be small enough you don't care. > > Hmm... that should work well for me too. Not sure if the use-case > I mentioned above is still compelling anymore; since this seems like > it'd give me much of the benefit; and I don't need an excessive > fillfactor on the stable part of the table. Umm, if your inserts are uniformly distributed across the country, you wouldn't have a stable part, right? - Heikki
Heikki Linnakangas wrote: > Ron Mayer wrote: >> In my case my biggest/slowest tables are clustered by zip-code (which >> does a reasonable job at keeping counties/cities/etc on the >> same pages too).... > > No deletes? If the tables grow over time, you probably would need to run > CLUSTER every now and then to get the best performance, though the patch > would alleviate that quite a lot. Yup, pretty much no deletes; since it's a historical archive of some government documents with address info. Though I the live system may periodically expunge data, say, 10+years old. > Do you have a development environment where you could test what effect > the patch would have? It would be interesting to have a real-world use > case, since I don't have one myself at the moment. I have a development environment, but it doesn't have the same real-time-growing behavior, and only a small region of the country. I suppose I could pre-load N-1 years and cluster it, and then incrementally insert the last year of data to simulate the effect. But sure, I'll attempt to try the patch; but don't really have any good benchmarking environment to give any definitive results. If an anecdotal "this is how it feels to me" is useful, I can give one of those. >> > Your best bet might be to partition the table into two subtables, one >> > with "stable" data and one with the fresh data. >> >> Hmm... that should work well for me too.... > > Umm, if your inserts are uniformly distributed across the country, you > wouldn't have a stable part, right? Hmm. Maybe. I was thinking when archiving to the large table an "order by" clause when inserting from the new partition to the stable partition could at least make the big table "piecewise" clustered so most records for a zip code fit in the same few disk pages, even though those pages would still end up lying around far apart on the disk. I wonder what part of "CLUSTER" gives the most benefit - that most records of a type fit on a few blocks; or that those blocks are next to each other so can be read sequentially?
[ Sorry I found this one only found recently.] Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- Heikki Linnakangas wrote: > While thinking about index-organized-tables and similar ideas, it > occurred to me that there's some low-hanging-fruit: maintaining cluster > order on inserts by trying to place new heap tuples close to other > similar tuples. That involves asking the index am where on the heap the > new tuple should go, and trying to insert it there before using the FSM. > Using the new fillfactor parameter makes it more likely that there's > room on the page. We don't worry about the order within the page. > > The API I'm thinking of introduces a new optional index am function, > amsuggestblock (suggestions for a better name are welcome). It gets the > same parameters as aminsert, and returns the heap block number that > would be optimal place to put the new tuple. It's be called from > ExecInsert before inserting the heap tuple, and the suggestion is passed > on to heap_insert and RelationGetBufferForTuple. > > I wrote a little patch to implement this for btree, attached. > > This could be optimized by changing the existing aminsert API, because > as it is, an insert will have to descend the btree twice. Once in > amsuggestblock and then in aminsert. amsuggestblock could keep the right > index page pinned so aminsert could locate it quicker. But I wanted to > keep this simple for now. Another improvement might be to allow > amsuggestblock to return a list of suggestions, but that makes it more > expensive to insert if there isn't room in the suggested pages, since > heap_insert will have to try them all before giving up. > > Comments regarding the general idea or the patch? There should probably > be a index option to turn the feature on and off. You'll want to turn it > off when you first load a table, and turn it on after CLUSTER to keep it > clustered. > > Since there's been discussion on keeping the TODO list more up-to-date, > I hereby officially claim the "Automatically maintain clustering on a > table" TODO item :). Feel free to bombard me with requests for status > reports. And just to be clear, I'm not trying to sneak this into 8.2 > anymore, this is 8.3 stuff. > > I won't be implementing a background daemon described on the TODO item, > since that would essentially be an online version of CLUSTER. Which sure > would be nice, but that's a different story. > > - Heikki > -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Ah, thanks! I had forgotten about it as well. Bruce Momjian wrote: > [ Sorry I found this one only found recently.] > > Your patch has been added to the PostgreSQL unapplied patches list at: > > http://momjian.postgresql.org/cgi-bin/pgpatches > > It will be applied as soon as one of the PostgreSQL committers reviews > and approves it. > > --------------------------------------------------------------------------- > > > Heikki Linnakangas wrote: >> While thinking about index-organized-tables and similar ideas, it >> occurred to me that there's some low-hanging-fruit: maintaining cluster >> order on inserts by trying to place new heap tuples close to other >> similar tuples. That involves asking the index am where on the heap the >> new tuple should go, and trying to insert it there before using the FSM. >> Using the new fillfactor parameter makes it more likely that there's >> room on the page. We don't worry about the order within the page. >> >> The API I'm thinking of introduces a new optional index am function, >> amsuggestblock (suggestions for a better name are welcome). It gets the >> same parameters as aminsert, and returns the heap block number that >> would be optimal place to put the new tuple. It's be called from >> ExecInsert before inserting the heap tuple, and the suggestion is passed >> on to heap_insert and RelationGetBufferForTuple. >> >> I wrote a little patch to implement this for btree, attached. >> >> This could be optimized by changing the existing aminsert API, because >> as it is, an insert will have to descend the btree twice. Once in >> amsuggestblock and then in aminsert. amsuggestblock could keep the right >> index page pinned so aminsert could locate it quicker. But I wanted to >> keep this simple for now. Another improvement might be to allow >> amsuggestblock to return a list of suggestions, but that makes it more >> expensive to insert if there isn't room in the suggested pages, since >> heap_insert will have to try them all before giving up. >> >> Comments regarding the general idea or the patch? There should probably >> be a index option to turn the feature on and off. You'll want to turn it >> off when you first load a table, and turn it on after CLUSTER to keep it >> clustered. >> >> Since there's been discussion on keeping the TODO list more up-to-date, >> I hereby officially claim the "Automatically maintain clustering on a >> table" TODO item :). Feel free to bombard me with requests for status >> reports. And just to be clear, I'm not trying to sneak this into 8.2 >> anymore, this is 8.3 stuff. >> >> I won't be implementing a background daemon described on the TODO item, >> since that would essentially be an online version of CLUSTER. Which sure >> would be nice, but that's a different story. >> >> - Heikki >> > > -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
What about adding the ability to ask the FSM for a page that's near a given page? That way if you did have to go to the FSM you could at least try and insert close to the page you originally wanted. On Tue, May 15, 2007 at 11:26:51PM +0100, Heikki Linnakangas wrote: > Ah, thanks! I had forgotten about it as well. > > Bruce Momjian wrote: > >[ Sorry I found this one only found recently.] > > > >Your patch has been added to the PostgreSQL unapplied patches list at: > > > > http://momjian.postgresql.org/cgi-bin/pgpatches > > > >It will be applied as soon as one of the PostgreSQL committers reviews > >and approves it. > > > >--------------------------------------------------------------------------- > > > > > >Heikki Linnakangas wrote: > >>While thinking about index-organized-tables and similar ideas, it > >>occurred to me that there's some low-hanging-fruit: maintaining cluster > >>order on inserts by trying to place new heap tuples close to other > >>similar tuples. That involves asking the index am where on the heap the > >>new tuple should go, and trying to insert it there before using the FSM. > >>Using the new fillfactor parameter makes it more likely that there's > >>room on the page. We don't worry about the order within the page. > >> > >>The API I'm thinking of introduces a new optional index am function, > >>amsuggestblock (suggestions for a better name are welcome). It gets the > >>same parameters as aminsert, and returns the heap block number that > >>would be optimal place to put the new tuple. It's be called from > >>ExecInsert before inserting the heap tuple, and the suggestion is passed > >>on to heap_insert and RelationGetBufferForTuple. > >> > >>I wrote a little patch to implement this for btree, attached. > >> > >>This could be optimized by changing the existing aminsert API, because > >>as it is, an insert will have to descend the btree twice. Once in > >>amsuggestblock and then in aminsert. amsuggestblock could keep the right > >>index page pinned so aminsert could locate it quicker. But I wanted to > >>keep this simple for now. Another improvement might be to allow > >>amsuggestblock to return a list of suggestions, but that makes it more > >>expensive to insert if there isn't room in the suggested pages, since > >>heap_insert will have to try them all before giving up. > >> > >>Comments regarding the general idea or the patch? There should probably > >>be a index option to turn the feature on and off. You'll want to turn it > >>off when you first load a table, and turn it on after CLUSTER to keep it > >>clustered. > >> > >>Since there's been discussion on keeping the TODO list more up-to-date, > >>I hereby officially claim the "Automatically maintain clustering on a > >>table" TODO item :). Feel free to bombard me with requests for status > >>reports. And just to be clear, I'm not trying to sneak this into 8.2 > >>anymore, this is 8.3 stuff. > >> > >>I won't be implementing a background daemon described on the TODO item, > >>since that would essentially be an online version of CLUSTER. Which sure > >>would be nice, but that's a different story. > >> > >>- Heikki > >> > > > > > > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim C. Nasby wrote: > What about adding the ability to ask the FSM for a page that's near a > given page? That way if you did have to go to the FSM you could at least > try and insert close to the page you originally wanted. Yeah, there's always room for improvement. I made the patch when I was working on clustered indexes, and was mostly concerned about getting inserts to the same page as other tuples with similar values so that the clustered index stays clustered. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 5/16/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote: > Jim C. Nasby wrote: > > What about adding the ability to ask the FSM for a page that's near a > > given page? That way if you did have to go to the FSM you could at least > > try and insert close to the page you originally wanted. > > Yeah, there's always room for improvement. I made the patch when I was > working on clustered indexes, and was mostly concerned about getting > inserts to the same page as other tuples with similar values so that the > clustered index stays clustered. > the patch doesn't apply in cvs... you'll need to update it... -- regards, Jaime Casanova "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs and the universe trying to produce bigger and better idiots. So far, the universe is winning." Richard Cook
On 5/17/07, Jaime Casanova <systemguards@gmail.com> wrote: > On 5/16/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote: > > Jim C. Nasby wrote: > > > What about adding the ability to ask the FSM for a page that's near a > > > given page? That way if you did have to go to the FSM you could at least > > > try and insert close to the page you originally wanted. > > > > Yeah, there's always room for improvement. I made the patch when I was > > working on clustered indexes, and was mostly concerned about getting > > inserts to the same page as other tuples with similar values so that the > > clustered index stays clustered. > > > > the patch doesn't apply in cvs... you'll need to update it... > cvs i wrote? head i was meaning... sorry, too late for me =) -- regards, Jaime Casanova "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs and the universe trying to produce bigger and better idiots. So far, the universe is winning." Richard Cook
Jaime Casanova wrote: > On 5/16/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote: >> Jim C. Nasby wrote: >> > What about adding the ability to ask the FSM for a page that's near a >> > given page? That way if you did have to go to the FSM you could at >> least >> > try and insert close to the page you originally wanted. >> >> Yeah, there's always room for improvement. I made the patch when I was >> working on clustered indexes, and was mostly concerned about getting >> inserts to the same page as other tuples with similar values so that the >> clustered index stays clustered. >> > > the patch doesn't apply in cvs... you'll need to update it... Oh, here you are. The implementation has changed a bit since August. I thought I had submitted an updated version in the winter but couldn't find it. Anyway, I updated and dusted off the source tree, tidied up the comments a little bit, and fixed some inconsistencies in pg_proc entries that made opr_sanity to fail. The beef of the patch is two new optional indexam API functions: amprepareinsert and amfinishinsert. amprepareinsert is called before inserting the heap tuple. It descends the tree and finds and pins the right leaf page to insert to, and returns a suggestion on where the heap tuple should be inserted. amfinishinsert is called after inserting the heap tuple to actually insert the index tuple. Documentation for these functions need to be added indexam.sgml, I noticed that that's not done yet. The cluster_inserts GUC option that you can use to enable/disable the feature should be removed before committing. The performance characteristics of this patch hasn't been thoroughly discussed yet. The reason why you want to cluster your tables is to speed up SELECTs that return a bunch of tuples with similar values, for example range queries. The reason for keeping them clustered on inserts is to reduce the need to run CLUSTER as often. It doesn't come without a cost, however. In the worst case, there never is room for new inserts on pages, and each insert needs to do one extra I/O to fetch the optimal heap page where the insert should go, see that there's no room, and then insert somewhere else. Using a non-zero fillfactor helps, but even when there is room on the page, it's often cheaper to just append to the end of the table and running CLUSTER at night for example, than do random access to insert to the "right" pages in the heap. So, should we have a WITH-option on the table to enable/disable this feature, and what would be the default? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: doc/src/sgml/catalogs.sgml =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/catalogs.sgml,v retrieving revision 2.137 diff -c -r2.137 catalogs.sgml *** doc/src/sgml/catalogs.sgml 12 Nov 2006 06:25:37 -0000 2.137 --- doc/src/sgml/catalogs.sgml 15 Jan 2007 17:24:52 -0000 *************** *** 499,504 **** --- 499,518 ---- <entry>Function to parse and validate <structfield>reloptions</> for an index</entry> </row> + <row> + <entry><structfield>amprepareinsert</structfield></entry> + <entry><type>regproc</type></entry> + <entry><literal><link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>.oid</literal></entry> + <entry>Performs the 1st phase of a two phase index insert, returning a suggestion of where in the heap to put a newtuple</entry> + </row> + + <row> + <entry><structfield>amfinishinsert</structfield></entry> + <entry><type>regproc</type></entry> + <entry><literal><link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>.oid</literal></entry> + <entry>Finishes an index insert started with amprepareinsert</entry> + </row> + </tbody> </tgroup> </table> Index: src/backend/access/heap/heapam.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v retrieving revision 1.222.2.1 diff -c -r1.222.2.1 heapam.c *** src/backend/access/heap/heapam.c 4 Feb 2007 20:00:49 -0000 1.222.2.1 --- src/backend/access/heap/heapam.c 18 May 2007 10:41:13 -0000 *************** *** 1363,1368 **** --- 1363,1372 ---- * use_fsm is passed directly to RelationGetBufferForTuple, which see for * more info. * + * If suggested_blk is a valid block number, the tuple will be inserted to + * that block if there's enough room. If it's full, a block will be chosen + * as if suggested_blk was not set. + * * The return value is the OID assigned to the tuple (either here or by the * caller), or InvalidOid if no OID. The header fields of *tup are updated * to match the stored tuple; in particular tup->t_self receives the actual *************** *** 1371,1377 **** */ Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, ! bool use_wal, bool use_fsm) { TransactionId xid = GetCurrentTransactionId(); HeapTuple heaptup; --- 1375,1381 ---- */ Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, ! bool use_wal, bool use_fsm, BlockNumber suggested_blk) { TransactionId xid = GetCurrentTransactionId(); HeapTuple heaptup; *************** *** 1421,1429 **** else heaptup = tup; ! /* Find buffer to insert this tuple into */ ! buffer = RelationGetBufferForTuple(relation, heaptup->t_len, ! InvalidBuffer, use_fsm); /* NO EREPORT(ERROR) from here till changes are logged */ START_CRIT_SECTION(); --- 1425,1467 ---- else heaptup = tup; ! /* Find buffer to insert this tuple into. Try the suggested block first ! * if caller gave one. ! */ ! if (suggested_blk != InvalidBlockNumber) ! { ! Buffer suggested_buf; ! Page pageHeader; ! Size pageFreeSpace; ! ! suggested_buf = ReadBuffer(relation, suggested_blk); ! pageHeader = (Page) BufferGetPage(suggested_buf); ! ! LockBuffer(suggested_buf, BUFFER_LOCK_EXCLUSIVE); ! ! /* Don't subtract fillfactor from the free space. That space is ! * reserved exactly for situations like this; keeping updated and ! * inserted tuples close to other tuples with similar values. ! */ ! pageFreeSpace = PageGetFreeSpace(pageHeader); ! ! if (heaptup->t_len <= pageFreeSpace) ! buffer = suggested_buf; ! else ! { ! /* Page was full. Release lock and pin and get another block ! * as if suggested_blk was not given. ! */ ! LockBuffer(suggested_buf, BUFFER_LOCK_UNLOCK); ! ReleaseBuffer(suggested_buf); ! ! buffer = RelationGetBufferForTuple(relation, heaptup->t_len, ! InvalidBuffer, use_fsm); ! } ! } ! else ! buffer = RelationGetBufferForTuple(relation, heaptup->t_len, ! InvalidBuffer, use_fsm); /* NO EREPORT(ERROR) from here till changes are logged */ START_CRIT_SECTION(); *************** *** 1531,1537 **** Oid simple_heap_insert(Relation relation, HeapTuple tup) { ! return heap_insert(relation, tup, GetCurrentCommandId(), true, true); } /* --- 1569,1576 ---- Oid simple_heap_insert(Relation relation, HeapTuple tup) { ! return heap_insert(relation, tup, GetCurrentCommandId(), true, ! true, InvalidBlockNumber); } /* Index: src/backend/access/index/indexam.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/index/indexam.c,v retrieving revision 1.95 diff -c -r1.95 indexam.c *** src/backend/access/index/indexam.c 4 Oct 2006 00:29:48 -0000 1.95 --- src/backend/access/index/indexam.c 18 May 2007 10:42:13 -0000 *************** *** 18,23 **** --- 18,25 ---- * index_rescan - restart a scan of an index * index_endscan - end a scan * index_insert - insert an index tuple into a relation + * index_prepareinsert - get desired insert location for a heap tuple + * index_finishinsert - insert a previously prepared index tuple * index_markpos - mark a scan position * index_restrpos - restore a scan position * index_getnext - get the next tuple from a scan *************** *** 202,207 **** --- 204,269 ---- BoolGetDatum(check_uniqueness))); } + /* ---------------- + * index_prepareinsert - get desired insert location for a heap tuple + * + * The returned BlockNumber is the *heap* page that is the best place + * to insert the given tuple to, according to the index am. The best + * place is one that maintains the cluster order. + * + * opaque should be passed to a later index_finishinsert to finish the + * insert. + * ---------------- + */ + BlockNumber + index_prepareinsert(Relation indexRelation, + Datum *values, + bool *isnull, + Relation heapRelation, + bool check_uniqueness, + void **opaque) + { + FmgrInfo *procedure; + + RELATION_CHECKS; + GET_REL_PROCEDURE(amprepareinsert); + + /* + * have the am's prepareinsert proc do all the work. + */ + return DatumGetUInt32(FunctionCall6(procedure, + PointerGetDatum(indexRelation), + PointerGetDatum(values), + PointerGetDatum(isnull), + PointerGetDatum(heapRelation), + BoolGetDatum(check_uniqueness), + PointerGetDatum(opaque))); + } + + /* ---------------- + * index_finishinsert - insert a previously prepared index tuple + * + * Finishes an insert operation initiated by an earlier call to + * index_prepareinsert. + * ---------------- + */ + bool + index_finishinsert(Relation indexRelation, + ItemPointer heap_t_ctid, void *opaque) + { + FmgrInfo *procedure; + + RELATION_CHECKS; + GET_REL_PROCEDURE(amfinishinsert); + + /* + * have the am's finishinsert proc do all the work. + */ + return DatumGetBool(FunctionCall2(procedure, + PointerGetDatum(heap_t_ctid), + PointerGetDatum(opaque))); + } + /* * index_beginscan - start a scan of an index with amgettuple * Index: src/backend/access/nbtree/nbtinsert.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtinsert.c,v retrieving revision 1.146.2.1 diff -c -r1.146.2.1 nbtinsert.c *** src/backend/access/nbtree/nbtinsert.c 27 Jan 2007 20:53:36 -0000 1.146.2.1 --- src/backend/access/nbtree/nbtinsert.c 18 May 2007 10:19:05 -0000 *************** *** 86,99 **** /* we need an insertion scan key to do our search, so build one */ itup_scankey = _bt_mkscankey(rel, itup); - top: /* find the first page containing this key */ stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE); /* trade in our read lock for a write lock */ LockBuffer(buf, BUFFER_LOCK_UNLOCK); LockBuffer(buf, BT_WRITE); /* * If the page was split between the time that we surrendered our read * lock and acquired our write lock, then this page may no longer be the --- 86,198 ---- /* we need an insertion scan key to do our search, so build one */ itup_scankey = _bt_mkscankey(rel, itup); /* find the first page containing this key */ stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE); /* trade in our read lock for a write lock */ LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + _bt_finishinsert(rel, heapRel, index_is_unique, itup, + itup_scankey, stack, buf); + } + + /* + * _bt_prepareinsert() -- Find the insert location for a new tuple + * + * Descends the tree and finds the location for a new index tuple. + * As a hint to the executor, returns the heap block number the previous + * index tuple at that location points to. By inserting the heap tuple + * to that block, the heap will stay better clustered than by inserting + * to a random block. + */ + BlockNumber + _bt_prepareinsert(Relation rel, IndexTuple itup, bool index_is_unique, + Relation heapRel, BTInsertInfo *opaquePtr) + { + int natts = rel->rd_rel->relnatts; + OffsetNumber offset; + Page page; + BTPageOpaque opaque; + + ScanKey itup_scankey; + BTStack stack; + Buffer buf; + BlockNumber suggestion = InvalidBlockNumber; + BTInsertInfo insert_opaque; + + /* we need an insertion scan key to do our search, so build one */ + itup_scankey = _bt_mkscankey(rel, itup); + + /* find the first page containing this key */ + stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_READ); + if(!BufferIsValid(buf)) + { + /* The index was completely empty. No suggestion then. */ + *opaquePtr = NULL; + return InvalidBlockNumber; + } + + page = BufferGetPage(buf); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + + /* Find the location in the page where the new index tuple would go to. */ + + offset = _bt_binsrch(rel, buf, natts, itup_scankey, false); + if (offset > PageGetMaxOffsetNumber(page)) + { + /* _bt_binsrch returned pointer to end-of-page. It means that + * there was no equal items on the page, and the new item should + * be inserted as the last tuple of the page. There could be equal + * items on the next page, however. + * + * At the moment, we just ignore the potential equal items on the + * right, and pretend there isn't any. We could instead walk right + * to the next page to check that, but let's keep it simple for now. + */ + offset = OffsetNumberPrev(offset); + } + if(offset < P_FIRSTDATAKEY(opaque)) + { + /* We landed on an empty page. We could step left or right until + * we find some items, but let's keep it simple for now. + */ + } else { + /* We're now positioned at the index tuple that we're interested in. */ + ItemId iid = PageGetItemId(page, offset); + IndexTuple curitup = (IndexTuple) PageGetItem(page, iid); + + suggestion = ItemPointerGetBlockNumber(&curitup->t_tid); + } + + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + insert_opaque = *opaquePtr = palloc(sizeof(struct BTInsertInfoData)); + insert_opaque->rel = rel; + insert_opaque->heapRel = heapRel; + insert_opaque->index_is_unique = index_is_unique; + insert_opaque->itup = itup; + insert_opaque->itup_scankey = itup_scankey; + insert_opaque->stack = stack; + insert_opaque->buf = buf; + + return suggestion; + } + + /* + * _bt_finishinsert() -- Handle insertion of a single index tuple in the tree. + * + */ + void + _bt_finishinsert(Relation rel, Relation heapRel, bool index_is_unique, + IndexTuple itup, ScanKey itup_scankey, + BTStack stack, Buffer buf) + { + int natts = rel->rd_rel->relnatts; + LockBuffer(buf, BT_WRITE); + top: + /* * If the page was split between the time that we surrendered our read * lock and acquired our write lock, then this page may no longer be the *************** *** 133,138 **** --- 232,245 ---- XactLockTableWait(xwait); /* start over... */ _bt_freestack(stack); + + /* find the first page containing this key */ + stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE); + + /* trade in our read lock for a write lock */ + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + LockBuffer(buf, BT_WRITE); + goto top; } } *************** *** 143,148 **** --- 250,256 ---- /* be tidy */ _bt_freestack(stack); _bt_freeskey(itup_scankey); + pfree(itup); } /* Index: src/backend/access/nbtree/nbtree.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtree.c,v retrieving revision 1.153 diff -c -r1.153 nbtree.c *** src/backend/access/nbtree/nbtree.c 1 Nov 2006 19:43:17 -0000 1.153 --- src/backend/access/nbtree/nbtree.c 18 May 2007 10:20:00 -0000 *************** *** 223,229 **** _bt_doinsert(rel, itup, checkUnique, heapRel); ! pfree(itup); PG_RETURN_BOOL(true); } --- 223,278 ---- _bt_doinsert(rel, itup, checkUnique, heapRel); ! PG_RETURN_BOOL(true); ! } ! ! /* ! * btprepareinsert() -- find the best place in the heap to put a new tuple. ! * ! * This uses the same logic as btinsert to find the place where the index ! * tuple would go if this was a btinsert call. ! */ ! Datum ! btprepareinsert(PG_FUNCTION_ARGS) ! { ! Relation rel = (Relation) PG_GETARG_POINTER(0); ! Datum *values = (Datum *) PG_GETARG_POINTER(1); ! bool *isnull = (bool *) PG_GETARG_POINTER(2); ! Relation heapRel = (Relation) PG_GETARG_POINTER(3); ! bool checkUnique = PG_GETARG_BOOL(4); ! void **opaquePtr = (void **) PG_GETARG_POINTER(5); ! IndexTuple itup; ! BlockNumber suggestion; ! ! /* generate an index tuple */ ! itup = index_form_tuple(RelationGetDescr(rel), values, isnull); ! ! suggestion =_bt_prepareinsert(rel, itup, checkUnique, heapRel, ! (BTInsertInfo *) opaquePtr); ! ! PG_RETURN_UINT32(suggestion); ! } ! ! /* ! * btfinishinsert() -- finish insert ! */ ! Datum ! btfinishinsert(PG_FUNCTION_ARGS) ! { ! ItemPointer ht_ctid = (ItemPointer) PG_GETARG_POINTER(0); ! BTInsertInfo opaque = (void *) PG_GETARG_POINTER(1); ! ! opaque->itup->t_tid = *ht_ctid; ! ! _bt_finishinsert(opaque->rel, ! opaque->heapRel, ! opaque->index_is_unique, ! opaque->itup, ! opaque->itup_scankey, ! opaque->stack, ! opaque->buf); ! ! pfree(opaque); PG_RETURN_BOOL(true); } Index: src/backend/executor/execMain.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execMain.c,v retrieving revision 1.280.2.2 diff -c -r1.280.2.2 execMain.c *** src/backend/executor/execMain.c 2 Feb 2007 00:07:27 -0000 1.280.2.2 --- src/backend/executor/execMain.c 18 May 2007 09:48:19 -0000 *************** *** 53,58 **** --- 53,59 ---- #include "utils/lsyscache.h" #include "utils/memutils.h" + bool cluster_inserts = true; /* GUC */ typedef struct evalPlanQual { *************** *** 847,854 **** --- 848,857 ---- resultRelInfo->ri_RangeTableIndex = resultRelationIndex; resultRelInfo->ri_RelationDesc = resultRelationDesc; resultRelInfo->ri_NumIndices = 0; + resultRelInfo->ri_ClusterIndex = -1; resultRelInfo->ri_IndexRelationDescs = NULL; resultRelInfo->ri_IndexRelationInfo = NULL; + resultRelInfo->ri_PreparedInsertOpaque = NULL; /* make a copy so as not to depend on relcache info not changing... */ resultRelInfo->ri_TrigDesc = CopyTriggerDesc(resultRelationDesc->trigdesc); if (resultRelInfo->ri_TrigDesc) *************** *** 1331,1336 **** --- 1334,1340 ---- ResultRelInfo *resultRelInfo; Relation resultRelationDesc; Oid newId; + BlockNumber suggestedBlock; /* * get the heap tuple out of the tuple table slot, making sure we have a *************** *** 1379,1384 **** --- 1383,1395 ---- if (resultRelationDesc->rd_att->constr) ExecConstraints(resultRelInfo, slot, estate); + /* Ask the index am of the clustered index for the + * best place to put it */ + if(cluster_inserts) + suggestedBlock = ExecPrepareIndexInsert(slot, estate); + else + suggestedBlock = InvalidBlockNumber; + /* * insert the tuple * *************** *** 1387,1393 **** */ newId = heap_insert(resultRelationDesc, tuple, estate->es_snapshot->curcid, ! true, true); IncrAppended(); (estate->es_processed)++; --- 1398,1404 ---- */ newId = heap_insert(resultRelationDesc, tuple, estate->es_snapshot->curcid, ! true, true, suggestedBlock); IncrAppended(); (estate->es_processed)++; *************** *** 2554,2560 **** tuple, estate->es_snapshot->curcid, estate->es_into_relation_use_wal, ! false); /* never any point in using FSM */ /* We know this is a newly created relation, so there are no indexes */ --- 2565,2572 ---- tuple, estate->es_snapshot->curcid, estate->es_into_relation_use_wal, ! false, /* never any point in using FSM */ ! InvalidBlockNumber); /* We know this is a newly created relation, so there are no indexes */ Index: src/backend/executor/execUtils.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execUtils.c,v retrieving revision 1.140.2.3 diff -c -r1.140.2.3 execUtils.c *** src/backend/executor/execUtils.c 6 Feb 2007 17:35:27 -0000 1.140.2.3 --- src/backend/executor/execUtils.c 18 May 2007 09:48:19 -0000 *************** *** 32,37 **** --- 32,38 ---- * ExecOpenIndices \ * ExecCloseIndices | referenced by InitPlan, EndPlan, * ExecInsertIndexTuples / ExecInsert, ExecUpdate + * ExecSuggestBlock Referenced by ExecInsert * * RegisterExprContextCallback Register function shutdown callback * UnregisterExprContextCallback Deregister function shutdown callback *************** *** 939,944 **** --- 940,946 ---- IndexInfo **indexInfoArray; resultRelInfo->ri_NumIndices = 0; + resultRelInfo->ri_ClusterIndex = -1; /* fast path if no indexes */ if (!RelationGetForm(resultRelation)->relhasindex) *************** *** 978,983 **** --- 980,990 ---- /* extract index key information from the index's pg_index info */ ii = BuildIndexInfo(indexDesc); + /* Remember which index is the clustered one. + * It's used to call the suggestblock-method on inserts */ + if(indexDesc->rd_index->indisclustered) + resultRelInfo->ri_ClusterIndex = i; + relationDescs[i] = indexDesc; indexInfoArray[i] = ii; i++; *************** *** 1044,1049 **** --- 1051,1058 ---- ExprContext *econtext; Datum values[INDEX_MAX_KEYS]; bool isnull[INDEX_MAX_KEYS]; + int clusterIndex; + bool preparedInsert; /* * Get information from the result relation info structure. *************** *** 1053,1058 **** --- 1062,1069 ---- relationDescs = resultRelInfo->ri_IndexRelationDescs; indexInfoArray = resultRelInfo->ri_IndexRelationInfo; heapRelation = resultRelInfo->ri_RelationDesc; + clusterIndex = resultRelInfo->ri_ClusterIndex; + preparedInsert = resultRelInfo->ri_PreparedInsertOpaque != NULL; /* * We will use the EState's per-tuple context for evaluating predicates *************** *** 1063,1068 **** --- 1074,1092 ---- /* Arrange for econtext's scan tuple to be the tuple under test */ econtext->ecxt_scantuple = slot; + if (preparedInsert) + { + index_finishinsert(relationDescs[clusterIndex], + tupleid, + resultRelInfo->ri_PreparedInsertOpaque); + resultRelInfo->ri_PreparedInsertOpaque = NULL; + + /* + * keep track of index inserts for debugging + */ + IncrIndexInserted(); + } + /* * for each index, form and insert the index tuple */ *************** *** 1073,1078 **** --- 1097,1105 ---- if (relationDescs[i] == NULL) continue; + if (preparedInsert && i == clusterIndex) + continue; /* insert to clustered index was already handled above */ + indexInfo = indexInfoArray[i]; /* Check for partial index */ *************** *** 1127,1132 **** --- 1154,1228 ---- } } + /* ---------------------------------------------------------------- + * ExecPrepareIndexInsert + * + * This routine asks the index am where a new heap tuple + * should be placed. + * ---------------------------------------------------------------- + */ + BlockNumber + ExecPrepareIndexInsert(TupleTableSlot *slot, + EState *estate) + { + ResultRelInfo *resultRelInfo; + int i; + Relation relationDesc; + Relation heapRelation; + ExprContext *econtext; + Datum values[INDEX_MAX_KEYS]; + bool isnull[INDEX_MAX_KEYS]; + IndexInfo *indexInfo; + + /* + * Get information from the result relation info structure. + */ + resultRelInfo = estate->es_result_relation_info; + i = resultRelInfo->ri_ClusterIndex; + if (i == -1) + return InvalidBlockNumber; /* there was no clustered index */ + + heapRelation = resultRelInfo->ri_RelationDesc; + relationDesc = resultRelInfo->ri_IndexRelationDescs[i]; + indexInfo = resultRelInfo->ri_IndexRelationInfo[i]; + + if (!OidIsValid(relationDesc->rd_am->amprepareinsert)) + return InvalidBlockNumber; /* the indexam doesn't support the + * two-phase insert API */ + + /* You can't cluster on a partial index */ + Assert(indexInfo->ii_Predicate == NIL); + + /* + * We will use the EState's per-tuple context for evaluating + * index expressions (creating it if it's not already there). + */ + econtext = GetPerTupleExprContext(estate); + + /* Arrange for econtext's scan tuple to be the tuple under test */ + econtext->ecxt_scantuple = slot; + + /* + * FormIndexDatum fills in its values and isnull parameters with the + * appropriate values for the column(s) of the index. + */ + FormIndexDatum(indexInfo, + slot, + estate, + values, + isnull); + + /* + * The index AM does the rest. + */ + return index_prepareinsert(relationDesc, /* index relation */ + values, /* array of index Datums */ + isnull, /* null flags */ + heapRelation, + relationDesc->rd_index->indisunique, + &resultRelInfo->ri_PreparedInsertOpaque); + } + /* * UpdateChangedParamSet * Add changed parameters to a plan node's chgParam set Index: src/backend/utils/misc/guc.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v retrieving revision 1.360.2.1 diff -c -r1.360.2.1 guc.c *** src/backend/utils/misc/guc.c 23 Apr 2007 15:13:30 -0000 1.360.2.1 --- src/backend/utils/misc/guc.c 18 May 2007 09:48:22 -0000 *************** *** 93,98 **** --- 93,99 ---- #define MS_PER_D (1000 * 60 * 60 * 24) /* XXX these should appear in other modules' header files */ + extern bool cluster_inserts; extern bool Log_disconnections; extern int CommitDelay; extern int CommitSiblings; *************** *** 403,408 **** --- 404,417 ---- static struct config_bool ConfigureNamesBool[] = { { + {"cluster_inserts", PGC_USERSET, DEVELOPER_OPTIONS, + gettext_noop("Tries to maintain cluster order on inserts."), + NULL + }, + &cluster_inserts, + true, NULL, NULL + }, + { {"enable_seqscan", PGC_USERSET, QUERY_TUNING_METHOD, gettext_noop("Enables the planner's use of sequential-scan plans."), NULL Index: src/include/access/genam.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/genam.h,v retrieving revision 1.65 diff -c -r1.65 genam.h *** src/include/access/genam.h 31 Jul 2006 20:09:05 -0000 1.65 --- src/include/access/genam.h 18 May 2007 10:21:49 -0000 *************** *** 93,98 **** --- 93,106 ---- ItemPointer heap_t_ctid, Relation heapRelation, bool check_uniqueness); + extern BlockNumber index_prepareinsert(Relation indexRelation, + Datum *values, bool *isnull, + Relation heapRelation, + bool check_uniqueness, + void **opauqe); + extern bool index_finishinsert(Relation indexRelation, + ItemPointer heap_t_ctid, + void *opaque); extern IndexScanDesc index_beginscan(Relation heapRelation, Relation indexRelation, Index: src/include/access/heapam.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/heapam.h,v retrieving revision 1.117 diff -c -r1.117 heapam.h *** src/include/access/heapam.h 5 Nov 2006 22:42:10 -0000 1.117 --- src/include/access/heapam.h 15 Jan 2007 17:24:52 -0000 *************** *** 157,163 **** extern void setLastTid(const ItemPointer tid); extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, ! bool use_wal, bool use_fsm); extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, ItemPointer ctid, TransactionId *update_xmax, CommandId cid, Snapshot crosscheck, bool wait); --- 157,163 ---- extern void setLastTid(const ItemPointer tid); extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, ! bool use_wal, bool use_fsm, BlockNumber suggestedblk); extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, ItemPointer ctid, TransactionId *update_xmax, CommandId cid, Snapshot crosscheck, bool wait); Index: src/include/access/nbtree.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/nbtree.h,v retrieving revision 1.106 diff -c -r1.106 nbtree.h *** src/include/access/nbtree.h 1 Nov 2006 19:43:17 -0000 1.106 --- src/include/access/nbtree.h 18 May 2007 10:22:09 -0000 *************** *** 479,488 **** --- 479,511 ---- extern Datum btbulkdelete(PG_FUNCTION_ARGS); extern Datum btvacuumcleanup(PG_FUNCTION_ARGS); extern Datum btoptions(PG_FUNCTION_ARGS); + extern Datum btprepareinsert(PG_FUNCTION_ARGS); + extern Datum btfinishinsert(PG_FUNCTION_ARGS); + + /* Filled in by _bt_prepareinsert */ + typedef struct BTInsertInfoData + { + Relation rel; + Relation heapRel; + bool index_is_unique; + IndexTuple itup; + ScanKey itup_scankey; + Buffer buf; /* pinned, not locked */ + BTStack stack; + } BTInsertInfoData; + + typedef BTInsertInfoData *BTInsertInfo; /* * prototypes for functions in nbtinsert.c */ + extern BlockNumber _bt_prepareinsert(Relation rel, IndexTuple itup, + bool index_is_unique, Relation heapRel, + BTInsertInfo *opaquePtr); + extern void _bt_finishinsert(Relation rel, Relation heapRel, + bool check_uniqueness, + IndexTuple itup, ScanKey itup_scankey, + BTStack stack, Buffer buf); extern void _bt_doinsert(Relation rel, IndexTuple itup, bool index_is_unique, Relation heapRel); extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access); Index: src/include/catalog/pg_am.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_am.h,v retrieving revision 1.46 diff -c -r1.46 pg_am.h *** src/include/catalog/pg_am.h 31 Jul 2006 20:09:05 -0000 1.46 --- src/include/catalog/pg_am.h 18 May 2007 10:25:26 -0000 *************** *** 65,70 **** --- 65,72 ---- regproc amvacuumcleanup; /* post-VACUUM cleanup function */ regproc amcostestimate; /* estimate cost of an indexscan */ regproc amoptions; /* parse AM-specific parameters */ + regproc amprepareinsert; /* get desired insert location on heap */ + regproc amfinishinsert; /* finish a prepared insert operation */ } FormData_pg_am; /* ---------------- *************** *** 78,84 **** * compiler constants for pg_am * ---------------- */ ! #define Natts_pg_am 23 #define Anum_pg_am_amname 1 #define Anum_pg_am_amstrategies 2 #define Anum_pg_am_amsupport 3 --- 80,86 ---- * compiler constants for pg_am * ---------------- */ ! #define Natts_pg_am 25 #define Anum_pg_am_amname 1 #define Anum_pg_am_amstrategies 2 #define Anum_pg_am_amsupport 3 *************** *** 102,123 **** #define Anum_pg_am_amvacuumcleanup 21 #define Anum_pg_am_amcostestimate 22 #define Anum_pg_am_amoptions 23 /* ---------------- * initial contents of pg_am * ---------------- */ ! DATA(insert OID = 403 ( btree 5 1 1 t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan btmarkposbtrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions )); DESCR("b-tree index access method"); #define BTREE_AM_OID 403 ! DATA(insert OID = 405 ( hash 1 1 0 f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan hashendscanhashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions )); DESCR("hash index access method"); #define HASH_AM_OID 405 ! DATA(insert OID = 783 ( gist 100 7 0 f t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan gistendscangistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions )); DESCR("GiST index access method"); #define GIST_AM_OID 783 ! DATA(insert OID = 2742 ( gin 100 4 0 f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan ginendscanginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions )); DESCR("GIN index access method"); #define GIN_AM_OID 2742 --- 104,127 ---- #define Anum_pg_am_amvacuumcleanup 21 #define Anum_pg_am_amcostestimate 22 #define Anum_pg_am_amoptions 23 + #define Anum_pg_am_amprepareinsert 24 + #define Anum_pg_am_amfinishinsert 25 /* ---------------- * initial contents of pg_am * ---------------- */ ! DATA(insert OID = 403 ( btree 5 1 1 t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan btmarkposbtrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions btprepareinsert btfinishinsert)); DESCR("b-tree index access method"); #define BTREE_AM_OID 403 ! DATA(insert OID = 405 ( hash 1 1 0 f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan hashendscanhashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions - -)); DESCR("hash index access method"); #define HASH_AM_OID 405 ! DATA(insert OID = 783 ( gist 100 7 0 f t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan gistendscangistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions - -)); DESCR("GiST index access method"); #define GIST_AM_OID 783 ! DATA(insert OID = 2742 ( gin 100 4 0 f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan ginendscanginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions - -)); DESCR("GIN index access method"); #define GIN_AM_OID 2742 Index: src/include/catalog/pg_proc.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_proc.h,v retrieving revision 1.429 diff -c -r1.429 pg_proc.h *** src/include/catalog/pg_proc.h 28 Nov 2006 19:18:44 -0000 1.429 --- src/include/catalog/pg_proc.h 18 May 2007 11:32:04 -0000 *************** *** 682,687 **** --- 682,691 ---- DESCR("btree(internal)"); DATA(insert OID = 2785 ( btoptions PGNSP PGUID 12 f f t f s 2 17 "1009 16" _null_ _null_ _null_ btoptions -_null_ )); DESCR("btree(internal)"); + DATA(insert OID = 5433 ( btprepareinsert PGNSP PGUID 12 f f t f v 6 23 "2281 2281 2281 2281 2281 2281" _null_ _null__null_ btprepareinsert - _null_ )); + DESCR("btree(internal)"); + DATA(insert OID = 5430 ( btfinishinsert PGNSP PGUID 12 f f t f v 2 16 "2281 2281" _null_ _null_ _null_ btfinishinsert- _null_ )); + DESCR("btree(internal)"); DATA(insert OID = 339 ( poly_same PGNSP PGUID 12 f f t f i 2 16 "604 604" _null_ _null_ _null_ poly_same - _null_)); DESCR("same as?"); Index: src/include/executor/executor.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/executor/executor.h,v retrieving revision 1.130.2.2 diff -c -r1.130.2.2 executor.h *** src/include/executor/executor.h 2 Feb 2007 00:07:28 -0000 1.130.2.2 --- src/include/executor/executor.h 18 May 2007 09:48:24 -0000 *************** *** 275,280 **** --- 275,281 ---- extern void ExecCloseIndices(ResultRelInfo *resultRelInfo); extern void ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid, EState *estate, bool is_vacuum); + extern BlockNumber ExecPrepareIndexInsert(TupleTableSlot *slot, EState *estate); extern void RegisterExprContextCallback(ExprContext *econtext, ExprContextCallbackFunction function, Index: src/include/nodes/execnodes.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/nodes/execnodes.h,v retrieving revision 1.161.2.2 diff -c -r1.161.2.2 execnodes.h *** src/include/nodes/execnodes.h 26 Apr 2007 23:24:57 -0000 1.161.2.2 --- src/include/nodes/execnodes.h 18 May 2007 09:48:25 -0000 *************** *** 259,264 **** --- 259,266 ---- * NumIndices # of indices existing on result relation * IndexRelationDescs array of relation descriptors for indices * IndexRelationInfo array of key/attr info for indices + * ClusterIndex index to the IndexRelationInfo array of the + * clustered index, or -1 if there's none * TrigDesc triggers to be fired, if any * TrigFunctions cached lookup info for trigger functions * TrigInstrument optional runtime measurements for triggers *************** *** 275,286 **** --- 277,291 ---- int ri_NumIndices; RelationPtr ri_IndexRelationDescs; IndexInfo **ri_IndexRelationInfo; + int ri_ClusterIndex; TriggerDesc *ri_TrigDesc; FmgrInfo *ri_TrigFunctions; struct Instrumentation *ri_TrigInstrument; List **ri_ConstraintExprs; JunkFilter *ri_junkFilter; ProjectionInfo *ri_projectReturning; + + void *ri_PreparedInsertOpaque; } ResultRelInfo; /* ---------------- Index: src/include/utils/rel.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/utils/rel.h,v retrieving revision 1.92 diff -c -r1.92 rel.h *** src/include/utils/rel.h 4 Oct 2006 00:30:10 -0000 1.92 --- src/include/utils/rel.h 15 Jan 2007 17:24:52 -0000 *************** *** 116,121 **** --- 116,123 ---- FmgrInfo amvacuumcleanup; FmgrInfo amcostestimate; FmgrInfo amoptions; + FmgrInfo amprepareinsert; + FmgrInfo amfinishinsert; } RelationAmInfo;
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- Heikki Linnakangas wrote: > Jaime Casanova wrote: > > On 5/16/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote: > >> Jim C. Nasby wrote: > >> > What about adding the ability to ask the FSM for a page that's near a > >> > given page? That way if you did have to go to the FSM you could at > >> least > >> > try and insert close to the page you originally wanted. > >> > >> Yeah, there's always room for improvement. I made the patch when I was > >> working on clustered indexes, and was mostly concerned about getting > >> inserts to the same page as other tuples with similar values so that the > >> clustered index stays clustered. > >> > > > > the patch doesn't apply in cvs... you'll need to update it... > > Oh, here you are. > > The implementation has changed a bit since August. I thought I had > submitted an updated version in the winter but couldn't find it. Anyway, > I updated and dusted off the source tree, tidied up the comments a > little bit, and fixed some inconsistencies in pg_proc entries that made > opr_sanity to fail. > > The beef of the patch is two new optional indexam API functions: > amprepareinsert and amfinishinsert. amprepareinsert is called before > inserting the heap tuple. It descends the tree and finds and pins the > right leaf page to insert to, and returns a suggestion on where the heap > tuple should be inserted. amfinishinsert is called after inserting the > heap tuple to actually insert the index tuple. Documentation for these > functions need to be added indexam.sgml, I noticed that that's not done yet. > > The cluster_inserts GUC option that you can use to enable/disable the > feature should be removed before committing. > > The performance characteristics of this patch hasn't been thoroughly > discussed yet. The reason why you want to cluster your tables is to > speed up SELECTs that return a bunch of tuples with similar values, for > example range queries. The reason for keeping them clustered on inserts > is to reduce the need to run CLUSTER as often. > > It doesn't come without a cost, however. In the worst case, there never > is room for new inserts on pages, and each insert needs to do one extra > I/O to fetch the optimal heap page where the insert should go, see that > there's no room, and then insert somewhere else. Using a non-zero > fillfactor helps, but even when there is room on the page, it's often > cheaper to just append to the end of the table and running CLUSTER at > night for example, than do random access to insert to the "right" pages > in the heap. > > So, should we have a WITH-option on the table to enable/disable this > feature, and what would be the default? > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Heikki Linnakangas <heikki@enterprisedb.com> writes: > The beef of the patch is two new optional indexam API functions: > amprepareinsert and amfinishinsert. amprepareinsert is called before > inserting the heap tuple. It descends the tree and finds and pins the > right leaf page to insert to, and returns a suggestion on where the heap > tuple should be inserted. amfinishinsert is called after inserting the > heap tuple to actually insert the index tuple. Documentation for these > functions need to be added indexam.sgml, I noticed that that's not done yet. What happens when there's more than one index? Is there a risk of deadlock during concurrent insertions (from different processes trying to lock the same buffers in different orders)? regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> The beef of the patch is two new optional indexam API functions: >> amprepareinsert and amfinishinsert. amprepareinsert is called before >> inserting the heap tuple. It descends the tree and finds and pins the >> right leaf page to insert to, and returns a suggestion on where the heap >> tuple should be inserted. amfinishinsert is called after inserting the >> heap tuple to actually insert the index tuple. Documentation for these >> functions need to be added indexam.sgml, I noticed that that's not done yet. > > What happens when there's more than one index? You can only cluster a table on one index. The other index inserts will use the aminsert-function, after inserting the heap tuple as usual. > Is there a risk of deadlock during concurrent insertions (from different > processes trying to lock the same buffers in different orders)? No, should be ok. btprepareinsert function will release locks (but not the pin) after descending the tree, and btfinishinsert will reacquire them. The locking order is the same as today. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 5/18/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote: > Jaime Casanova wrote: > > > > the patch doesn't apply in cvs... you'll need to update it... > > Oh, here you are. > > The implementation has changed a bit since August. I thought I had > submitted an updated version in the winter but couldn't find it. Anyway, > I updated and dusted off the source tree, tidied up the comments a > little bit, and fixed some inconsistencies in pg_proc entries that made > opr_sanity to fail. > this one doesn't apply either... there are problems with nbtinsert.c and pg_am.h -- regards, Jaime Casanova "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs and the universe trying to produce bigger and better idiots. So far, the universe is winning." Richard Cook
Jaime Casanova wrote: > On 5/18/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote: >> Jaime Casanova wrote: >> > >> > the patch doesn't apply in cvs... you'll need to update it... >> >> Oh, here you are. >> >> The implementation has changed a bit since August. I thought I had >> submitted an updated version in the winter but couldn't find it. Anyway, >> I updated and dusted off the source tree, tidied up the comments a >> little bit, and fixed some inconsistencies in pg_proc entries that made >> opr_sanity to fail. >> > > this one doesn't apply either... there are problems with nbtinsert.c and > pg_am.h Ah, sorry about that. For some reason my source tree was checked out from the 8.2 branch, instead of CVS HEAD. Here you are. Thanks for looking at this! -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: doc/src/sgml/catalogs.sgml =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/catalogs.sgml,v retrieving revision 2.152 diff -c -r2.152 catalogs.sgml *** doc/src/sgml/catalogs.sgml 15 May 2007 19:13:54 -0000 2.152 --- doc/src/sgml/catalogs.sgml 19 May 2007 16:23:49 -0000 *************** *** 517,522 **** --- 517,536 ---- <entry>Function to parse and validate <structfield>reloptions</> for an index</entry> </row> + <row> + <entry><structfield>amprepareinsert</structfield></entry> + <entry><type>regproc</type></entry> + <entry><literal><link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>.oid</literal></entry> + <entry>Performs the 1st phase of a two phase index insert, returning a suggestion of where in the heap to put a newtuple</entry> + </row> + + <row> + <entry><structfield>amfinishinsert</structfield></entry> + <entry><type>regproc</type></entry> + <entry><literal><link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>.oid</literal></entry> + <entry>Finishes an index insert started with amprepareinsert</entry> + </row> + </tbody> </tgroup> </table> Index: src/backend/access/heap/heapam.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v retrieving revision 1.232 diff -c -r1.232 heapam.c *** src/backend/access/heap/heapam.c 8 Apr 2007 01:26:27 -0000 1.232 --- src/backend/access/heap/heapam.c 19 May 2007 16:45:14 -0000 *************** *** 1368,1373 **** --- 1368,1377 ---- * Note that use_wal and use_fsm will be applied when inserting into the * heap's TOAST table, too, if the tuple requires any out-of-line data. * + * If suggested_blk is a valid block number, the tuple will be inserted to + * that block if there's enough room. If it's full, a block will be chosen + * as if suggested_blk was not set. + * * The return value is the OID assigned to the tuple (either here or by the * caller), or InvalidOid if no OID. The header fields of *tup are updated * to match the stored tuple; in particular tup->t_self receives the actual *************** *** 1376,1382 **** */ Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, ! bool use_wal, bool use_fsm) { TransactionId xid = GetCurrentTransactionId(); HeapTuple heaptup; --- 1380,1386 ---- */ Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, ! bool use_wal, bool use_fsm, BlockNumber suggested_blk) { TransactionId xid = GetCurrentTransactionId(); HeapTuple heaptup; *************** *** 1432,1440 **** else heaptup = tup; ! /* Find buffer to insert this tuple into */ ! buffer = RelationGetBufferForTuple(relation, heaptup->t_len, ! InvalidBuffer, use_fsm); /* NO EREPORT(ERROR) from here till changes are logged */ START_CRIT_SECTION(); --- 1436,1478 ---- else heaptup = tup; ! /* Find buffer to insert this tuple into. Try the suggested block first ! * if caller gave one. ! */ ! if (suggested_blk != InvalidBlockNumber) ! { ! Buffer suggested_buf; ! Page pageHeader; ! Size pageFreeSpace; ! ! suggested_buf = ReadBuffer(relation, suggested_blk); ! pageHeader = (Page) BufferGetPage(suggested_buf); ! ! LockBuffer(suggested_buf, BUFFER_LOCK_EXCLUSIVE); ! ! /* Don't subtract fillfactor from the free space. That space is ! * reserved exactly for situations like this; keeping updated and ! * inserted tuples close to other tuples with similar values. ! */ ! pageFreeSpace = PageGetFreeSpace(pageHeader); ! ! if (heaptup->t_len <= pageFreeSpace) ! buffer = suggested_buf; ! else ! { ! /* Page was full. Release lock and pin and get another block ! * as if suggested_blk was not given. ! */ ! LockBuffer(suggested_buf, BUFFER_LOCK_UNLOCK); ! ReleaseBuffer(suggested_buf); ! ! buffer = RelationGetBufferForTuple(relation, heaptup->t_len, ! InvalidBuffer, use_fsm); ! } ! } ! else ! buffer = RelationGetBufferForTuple(relation, heaptup->t_len, ! InvalidBuffer, use_fsm); /* NO EREPORT(ERROR) from here till changes are logged */ START_CRIT_SECTION(); *************** *** 1544,1550 **** Oid simple_heap_insert(Relation relation, HeapTuple tup) { ! return heap_insert(relation, tup, GetCurrentCommandId(), true, true); } /* --- 1582,1589 ---- Oid simple_heap_insert(Relation relation, HeapTuple tup) { ! return heap_insert(relation, tup, GetCurrentCommandId(), true, ! true, InvalidBlockNumber); } /* Index: src/backend/access/heap/tuptoaster.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/tuptoaster.c,v retrieving revision 1.74 diff -c -r1.74 tuptoaster.c *** src/backend/access/heap/tuptoaster.c 6 Apr 2007 04:21:41 -0000 1.74 --- src/backend/access/heap/tuptoaster.c 19 May 2007 16:45:39 -0000 *************** *** 1146,1152 **** if (!HeapTupleIsValid(toasttup)) elog(ERROR, "failed to build TOAST tuple"); ! heap_insert(toastrel, toasttup, mycid, use_wal, use_fsm); /* * Create the index entry. We cheat a little here by not using --- 1146,1153 ---- if (!HeapTupleIsValid(toasttup)) elog(ERROR, "failed to build TOAST tuple"); ! heap_insert(toastrel, toasttup, mycid, use_wal, use_fsm, ! InvalidBlockNumber); /* * Create the index entry. We cheat a little here by not using Index: src/backend/access/index/indexam.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/index/indexam.c,v retrieving revision 1.97 diff -c -r1.97 indexam.c *** src/backend/access/index/indexam.c 5 Jan 2007 22:19:23 -0000 1.97 --- src/backend/access/index/indexam.c 19 May 2007 16:23:57 -0000 *************** *** 18,23 **** --- 18,25 ---- * index_rescan - restart a scan of an index * index_endscan - end a scan * index_insert - insert an index tuple into a relation + * index_prepareinsert - get desired insert location for a heap tuple + * index_finishinsert - insert a previously prepared index tuple * index_markpos - mark a scan position * index_restrpos - restore a scan position * index_getnext - get the next tuple from a scan *************** *** 202,207 **** --- 204,269 ---- BoolGetDatum(check_uniqueness))); } + /* ---------------- + * index_prepareinsert - get desired insert location for a heap tuple + * + * The returned BlockNumber is the *heap* page that is the best place + * to insert the given tuple to, according to the index am. The best + * place is one that maintains the cluster order. + * + * opaque should be passed to a later index_finishinsert to finish the + * insert. + * ---------------- + */ + BlockNumber + index_prepareinsert(Relation indexRelation, + Datum *values, + bool *isnull, + Relation heapRelation, + bool check_uniqueness, + void **opaque) + { + FmgrInfo *procedure; + + RELATION_CHECKS; + GET_REL_PROCEDURE(amprepareinsert); + + /* + * have the am's prepareinsert proc do all the work. + */ + return DatumGetUInt32(FunctionCall6(procedure, + PointerGetDatum(indexRelation), + PointerGetDatum(values), + PointerGetDatum(isnull), + PointerGetDatum(heapRelation), + BoolGetDatum(check_uniqueness), + PointerGetDatum(opaque))); + } + + /* ---------------- + * index_finishinsert - insert a previously prepared index tuple + * + * Finishes an insert operation initiated by an earlier call to + * index_prepareinsert. + * ---------------- + */ + bool + index_finishinsert(Relation indexRelation, + ItemPointer heap_t_ctid, void *opaque) + { + FmgrInfo *procedure; + + RELATION_CHECKS; + GET_REL_PROCEDURE(amfinishinsert); + + /* + * have the am's finishinsert proc do all the work. + */ + return DatumGetBool(FunctionCall2(procedure, + PointerGetDatum(heap_t_ctid), + PointerGetDatum(opaque))); + } + /* * index_beginscan - start a scan of an index with amgettuple * Index: src/backend/access/nbtree/nbtinsert.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtinsert.c,v retrieving revision 1.156 diff -c -r1.156 nbtinsert.c *** src/backend/access/nbtree/nbtinsert.c 11 Apr 2007 20:47:37 -0000 1.156 --- src/backend/access/nbtree/nbtinsert.c 19 May 2007 18:12:25 -0000 *************** *** 96,114 **** /* we need an insertion scan key to do our search, so build one */ itup_scankey = _bt_mkscankey(rel, itup); - top: /* find the first page containing this key */ stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE); offset = InvalidOffsetNumber; ! /* trade in our read lock for a write lock */ LockBuffer(buf, BUFFER_LOCK_UNLOCK); LockBuffer(buf, BT_WRITE); /* * If the page was split between the time that we surrendered our read ! * lock and acquired our write lock, then this page may no longer be the * right place for the key we want to insert. In this case, we need to * move right in the tree. See Lehman and Yao for an excruciatingly * precise description. --- 96,224 ---- /* we need an insertion scan key to do our search, so build one */ itup_scankey = _bt_mkscankey(rel, itup); /* find the first page containing this key */ stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE); offset = InvalidOffsetNumber; ! /* release our read lock. _bt_finishinsert will relock the page in ! * exclusive mode. ! */ LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + _bt_finishinsert(rel, heapRel, index_is_unique, itup, + itup_scankey, stack, buf); + } + + /* + * _bt_prepareinsert() -- Find the insert location for a new tuple + * + * Descends the tree and finds the location for a new index tuple. + * As a hint to the executor, returns the heap block number the previous + * index tuple at that location points to. By inserting the heap tuple + * to that block, the heap will stay better clustered than by inserting + * to a random block. + * + * The leaf page is pinned and a reference to it, among other information + * needed to finish the insert, is stored in opaquePtr. + */ + BlockNumber + _bt_prepareinsert(Relation rel, IndexTuple itup, bool index_is_unique, + Relation heapRel, BTInsertInfo *opaquePtr) + { + int natts = rel->rd_rel->relnatts; + OffsetNumber offset; + Page page; + BTPageOpaque opaque; + + ScanKey itup_scankey; + BTStack stack; + Buffer buf; + BlockNumber suggestion = InvalidBlockNumber; + BTInsertInfo insert_opaque; + + /* we need an insertion scan key to do our search, so build one */ + itup_scankey = _bt_mkscankey(rel, itup); + + /* find the first page containing this key */ + stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_READ); + if(!BufferIsValid(buf)) + { + /* The index was completely empty. No suggestion then. */ + *opaquePtr = NULL; + return InvalidBlockNumber; + } + + page = BufferGetPage(buf); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + + /* Find the location in the page where the new index tuple would go to. */ + + offset = _bt_binsrch(rel, buf, natts, itup_scankey, false); + if (offset > PageGetMaxOffsetNumber(page)) + { + /* _bt_binsrch returned pointer to end-of-page. It means that + * there was no equal items on the page, and the new item should + * be inserted as the last tuple of the page. There could be equal + * items on the next page, however. + * + * At the moment, we just ignore the potential equal items on the + * right, and pretend there isn't any. We could instead walk right + * to the next page to check that, but let's keep it simple for now. + */ + offset = OffsetNumberPrev(offset); + } + if(offset < P_FIRSTDATAKEY(opaque)) + { + /* We landed on an empty page. We could step left or right until + * we find some items, but let's keep it simple for now. + */ + } else { + /* We're now positioned at the index tuple that we're interested in. */ + ItemId iid = PageGetItemId(page, offset); + IndexTuple curitup = (IndexTuple) PageGetItem(page, iid); + + suggestion = ItemPointerGetBlockNumber(&curitup->t_tid); + } + + /* Release the read lock. _bt_finishinsert will later reacquire it in + * exclusive mode. Keeping the buffer locked would be deadlock-prone + * as well; who knows what the caller is going to do, and what pages to + * lock, before calling finishinsert. + */ + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + /* Return a struct with all information needed to finish this insert. */ + insert_opaque = *opaquePtr = palloc(sizeof(struct BTInsertInfoData)); + insert_opaque->rel = rel; + insert_opaque->heapRel = heapRel; + insert_opaque->index_is_unique = index_is_unique; + insert_opaque->itup = itup; + insert_opaque->itup_scankey = itup_scankey; + insert_opaque->stack = stack; + insert_opaque->buf = buf; + + return suggestion; + } + + /* + * _bt_finishinsert() -- Finish an insert prepared with prepareinsert + */ + void + _bt_finishinsert(Relation rel, Relation heapRel, bool index_is_unique, + IndexTuple itup, ScanKey itup_scankey, + BTStack stack, Buffer buf) + { + int natts = rel->rd_rel->relnatts; + OffsetNumber offset = InvalidOffsetNumber; + LockBuffer(buf, BT_WRITE); + top: + /* * If the page was split between the time that we surrendered our read ! * lock in _bt_prepareinsert or _bt_doinsert, and acquired our write lock, then this page may no longer be the * right place for the key we want to insert. In this case, we need to * move right in the tree. See Lehman and Yao for an excruciatingly * precise description. *************** *** 146,151 **** --- 256,269 ---- XactLockTableWait(xwait); /* start over... */ _bt_freestack(stack); + + /* find the first page containing this key */ + stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE); + + /* trade in our read lock for a write lock */ + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + LockBuffer(buf, BT_WRITE); + goto top; } } *************** *** 157,162 **** --- 275,281 ---- /* be tidy */ _bt_freestack(stack); _bt_freeskey(itup_scankey); + pfree(itup); } /* Index: src/backend/access/nbtree/nbtree.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtree.c,v retrieving revision 1.154 diff -c -r1.154 nbtree.c *** src/backend/access/nbtree/nbtree.c 5 Jan 2007 22:19:23 -0000 1.154 --- src/backend/access/nbtree/nbtree.c 19 May 2007 16:23:58 -0000 *************** *** 223,229 **** _bt_doinsert(rel, itup, checkUnique, heapRel); ! pfree(itup); PG_RETURN_BOOL(true); } --- 223,278 ---- _bt_doinsert(rel, itup, checkUnique, heapRel); ! PG_RETURN_BOOL(true); ! } ! ! /* ! * btprepareinsert() -- find the best place in the heap to put a new tuple. ! * ! * This uses the same logic as btinsert to find the place where the index ! * tuple would go if this was a btinsert call. ! */ ! Datum ! btprepareinsert(PG_FUNCTION_ARGS) ! { ! Relation rel = (Relation) PG_GETARG_POINTER(0); ! Datum *values = (Datum *) PG_GETARG_POINTER(1); ! bool *isnull = (bool *) PG_GETARG_POINTER(2); ! Relation heapRel = (Relation) PG_GETARG_POINTER(3); ! bool checkUnique = PG_GETARG_BOOL(4); ! void **opaquePtr = (void **) PG_GETARG_POINTER(5); ! IndexTuple itup; ! BlockNumber suggestion; ! ! /* generate an index tuple */ ! itup = index_form_tuple(RelationGetDescr(rel), values, isnull); ! ! suggestion =_bt_prepareinsert(rel, itup, checkUnique, heapRel, ! (BTInsertInfo *) opaquePtr); ! ! PG_RETURN_UINT32(suggestion); ! } ! ! /* ! * btfinishinsert() -- finish insert ! */ ! Datum ! btfinishinsert(PG_FUNCTION_ARGS) ! { ! ItemPointer ht_ctid = (ItemPointer) PG_GETARG_POINTER(0); ! BTInsertInfo opaque = (void *) PG_GETARG_POINTER(1); ! ! opaque->itup->t_tid = *ht_ctid; ! ! _bt_finishinsert(opaque->rel, ! opaque->heapRel, ! opaque->index_is_unique, ! opaque->itup, ! opaque->itup_scankey, ! opaque->stack, ! opaque->buf); ! ! pfree(opaque); PG_RETURN_BOOL(true); } Index: src/backend/commands/copy.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v retrieving revision 1.283 diff -c -r1.283 copy.c *** src/backend/commands/copy.c 27 Apr 2007 22:05:46 -0000 1.283 --- src/backend/commands/copy.c 19 May 2007 17:14:53 -0000 *************** *** 2109,2115 **** ExecConstraints(resultRelInfo, slot, estate); /* OK, store the tuple and create index entries for it */ ! heap_insert(cstate->rel, tuple, mycid, use_wal, use_fsm); if (resultRelInfo->ri_NumIndices > 0) ExecInsertIndexTuples(slot, &(tuple->t_self), estate, false); --- 2109,2116 ---- ExecConstraints(resultRelInfo, slot, estate); /* OK, store the tuple and create index entries for it */ ! heap_insert(cstate->rel, tuple, mycid, use_wal, use_fsm, ! InvalidBlockNumber); if (resultRelInfo->ri_NumIndices > 0) ExecInsertIndexTuples(slot, &(tuple->t_self), estate, false); Index: src/backend/executor/execMain.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execMain.c,v retrieving revision 1.293 diff -c -r1.293 execMain.c *** src/backend/executor/execMain.c 27 Apr 2007 22:05:47 -0000 1.293 --- src/backend/executor/execMain.c 19 May 2007 16:24:01 -0000 *************** *** 53,58 **** --- 53,59 ---- #include "utils/lsyscache.h" #include "utils/memutils.h" + bool cluster_inserts = true; /* GUC */ typedef struct evalPlanQual { *************** *** 869,876 **** --- 870,879 ---- resultRelInfo->ri_RangeTableIndex = resultRelationIndex; resultRelInfo->ri_RelationDesc = resultRelationDesc; resultRelInfo->ri_NumIndices = 0; + resultRelInfo->ri_ClusterIndex = -1; resultRelInfo->ri_IndexRelationDescs = NULL; resultRelInfo->ri_IndexRelationInfo = NULL; + resultRelInfo->ri_PreparedInsertOpaque = NULL; /* make a copy so as not to depend on relcache info not changing... */ resultRelInfo->ri_TrigDesc = CopyTriggerDesc(resultRelationDesc->trigdesc); if (resultRelInfo->ri_TrigDesc) *************** *** 1353,1358 **** --- 1356,1362 ---- ResultRelInfo *resultRelInfo; Relation resultRelationDesc; Oid newId; + BlockNumber suggestedBlock; /* * get the heap tuple out of the tuple table slot, making sure we have a *************** *** 1401,1406 **** --- 1405,1417 ---- if (resultRelationDesc->rd_att->constr) ExecConstraints(resultRelInfo, slot, estate); + /* Ask the index am of the clustered index for the + * best place to put it */ + if(cluster_inserts) + suggestedBlock = ExecPrepareIndexInsert(slot, estate); + else + suggestedBlock = InvalidBlockNumber; + /* * insert the tuple * *************** *** 1409,1415 **** */ newId = heap_insert(resultRelationDesc, tuple, estate->es_snapshot->curcid, ! true, true); IncrAppended(); (estate->es_processed)++; --- 1420,1426 ---- */ newId = heap_insert(resultRelationDesc, tuple, estate->es_snapshot->curcid, ! true, true, suggestedBlock); IncrAppended(); (estate->es_processed)++; *************** *** 2600,2606 **** tuple, estate->es_snapshot->curcid, estate->es_into_relation_use_wal, ! false); /* never any point in using FSM */ /* We know this is a newly created relation, so there are no indexes */ --- 2611,2618 ---- tuple, estate->es_snapshot->curcid, estate->es_into_relation_use_wal, ! false, /* never any point in using FSM */ ! InvalidBlockNumber); /* We know this is a newly created relation, so there are no indexes */ Index: src/backend/executor/execUtils.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execUtils.c,v retrieving revision 1.147 diff -c -r1.147 execUtils.c *** src/backend/executor/execUtils.c 27 Feb 2007 01:11:25 -0000 1.147 --- src/backend/executor/execUtils.c 19 May 2007 18:22:33 -0000 *************** *** 31,36 **** --- 31,37 ---- * ExecOpenIndices \ * ExecCloseIndices | referenced by InitPlan, EndPlan, * ExecInsertIndexTuples / ExecInsert, ExecUpdate + * ExecPrepareIndexInsert Referenced by ExecInsert * * RegisterExprContextCallback Register function shutdown callback * UnregisterExprContextCallback Deregister function shutdown callback *************** *** 902,907 **** --- 903,909 ---- IndexInfo **indexInfoArray; resultRelInfo->ri_NumIndices = 0; + resultRelInfo->ri_ClusterIndex = -1; /* fast path if no indexes */ if (!RelationGetForm(resultRelation)->relhasindex) *************** *** 941,946 **** --- 943,953 ---- /* extract index key information from the index's pg_index info */ ii = BuildIndexInfo(indexDesc); + /* Remember which index is the clustered one. + * It's used to call the suggestblock-method on inserts */ + if(indexDesc->rd_index->indisclustered) + resultRelInfo->ri_ClusterIndex = i; + relationDescs[i] = indexDesc; indexInfoArray[i] = ii; i++; *************** *** 1007,1012 **** --- 1014,1021 ---- ExprContext *econtext; Datum values[INDEX_MAX_KEYS]; bool isnull[INDEX_MAX_KEYS]; + int clusterIndex; + bool preparedInsert; /* * Get information from the result relation info structure. *************** *** 1016,1021 **** --- 1025,1049 ---- relationDescs = resultRelInfo->ri_IndexRelationDescs; indexInfoArray = resultRelInfo->ri_IndexRelationInfo; heapRelation = resultRelInfo->ri_RelationDesc; + clusterIndex = resultRelInfo->ri_ClusterIndex; + preparedInsert = resultRelInfo->ri_PreparedInsertOpaque != NULL; + + /* + * If the insert to the clustering index was already prepared, + * finish it. + */ + if (preparedInsert) + { + index_finishinsert(relationDescs[clusterIndex], + tupleid, + resultRelInfo->ri_PreparedInsertOpaque); + resultRelInfo->ri_PreparedInsertOpaque = NULL; + + /* + * keep track of index inserts for debugging + */ + IncrIndexInserted(); + } /* * We will use the EState's per-tuple context for evaluating predicates *************** *** 1036,1041 **** --- 1064,1072 ---- if (relationDescs[i] == NULL) continue; + if (preparedInsert && i == clusterIndex) + continue; /* insert to clustered index was already handled above */ + indexInfo = indexInfoArray[i]; /* Check for partial index */ *************** *** 1090,1095 **** --- 1121,1196 ---- } } + /* ---------------------------------------------------------------- + * ExecPrepareIndexInsert + * + * This routine asks the index am where a new heap tuple + * should be placed. + * ---------------------------------------------------------------- + */ + BlockNumber + ExecPrepareIndexInsert(TupleTableSlot *slot, + EState *estate) + { + ResultRelInfo *resultRelInfo; + int clusterIndex; + Relation relationDesc; + Relation heapRelation; + ExprContext *econtext; + Datum values[INDEX_MAX_KEYS]; + bool isnull[INDEX_MAX_KEYS]; + IndexInfo *indexInfo; + + /* + * Get information from the result relation info structure. + */ + resultRelInfo = estate->es_result_relation_info; + clusterIndex = resultRelInfo->ri_ClusterIndex; + + if (clusterIndex == -1) + return InvalidBlockNumber; /* there was no clustered index */ + + heapRelation = resultRelInfo->ri_RelationDesc; + relationDesc = resultRelInfo->ri_IndexRelationDescs[clusterIndex]; + indexInfo = resultRelInfo->ri_IndexRelationInfo[clusterIndex]; + + if (!OidIsValid(relationDesc->rd_am->amprepareinsert)) + return InvalidBlockNumber; /* the indexam doesn't support the + * two-phase insert API */ + + /* You can't cluster on a partial index */ + Assert(indexInfo->ii_Predicate == NIL); + + /* + * We will use the EState's per-tuple context for evaluating + * index expressions (creating it if it's not already there). + */ + econtext = GetPerTupleExprContext(estate); + + /* Arrange for econtext's scan tuple to be the tuple under test */ + econtext->ecxt_scantuple = slot; + + /* + * FormIndexDatum fills in its values and isnull parameters with the + * appropriate values for the column(s) of the index. + */ + FormIndexDatum(indexInfo, + slot, + estate, + values, + isnull); + + /* + * The index AM does the rest. + */ + return index_prepareinsert(relationDesc, /* index relation */ + values, /* array of index Datums */ + isnull, /* null flags */ + heapRelation, + relationDesc->rd_index->indisunique, + &resultRelInfo->ri_PreparedInsertOpaque); + } + /* * UpdateChangedParamSet * Add changed parameters to a plan node's chgParam set Index: src/backend/utils/misc/guc.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v retrieving revision 1.391 diff -c -r1.391 guc.c *** src/backend/utils/misc/guc.c 8 May 2007 16:33:51 -0000 1.391 --- src/backend/utils/misc/guc.c 19 May 2007 16:24:17 -0000 *************** *** 99,104 **** --- 99,105 ---- #define MS_PER_D (1000 * 60 * 60 * 24) /* XXX these should appear in other modules' header files */ + extern bool cluster_inserts; extern bool Log_disconnections; extern int CommitDelay; extern int CommitSiblings; *************** *** 427,432 **** --- 428,441 ---- static struct config_bool ConfigureNamesBool[] = { { + {"cluster_inserts", PGC_USERSET, DEVELOPER_OPTIONS, + gettext_noop("Tries to maintain cluster order on inserts."), + NULL + }, + &cluster_inserts, + true, NULL, NULL + }, + { {"enable_seqscan", PGC_USERSET, QUERY_TUNING_METHOD, gettext_noop("Enables the planner's use of sequential-scan plans."), NULL Index: src/include/access/genam.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/genam.h,v retrieving revision 1.66 diff -c -r1.66 genam.h *** src/include/access/genam.h 5 Jan 2007 22:19:50 -0000 1.66 --- src/include/access/genam.h 19 May 2007 16:24:26 -0000 *************** *** 93,98 **** --- 93,106 ---- ItemPointer heap_t_ctid, Relation heapRelation, bool check_uniqueness); + extern BlockNumber index_prepareinsert(Relation indexRelation, + Datum *values, bool *isnull, + Relation heapRelation, + bool check_uniqueness, + void **opauqe); + extern bool index_finishinsert(Relation indexRelation, + ItemPointer heap_t_ctid, + void *opaque); extern IndexScanDesc index_beginscan(Relation heapRelation, Relation indexRelation, Index: src/include/access/heapam.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/heapam.h,v retrieving revision 1.123 diff -c -r1.123 heapam.h *** src/include/access/heapam.h 8 Apr 2007 01:26:33 -0000 1.123 --- src/include/access/heapam.h 19 May 2007 16:24:26 -0000 *************** *** 157,163 **** extern void setLastTid(const ItemPointer tid); extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, ! bool use_wal, bool use_fsm); extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, ItemPointer ctid, TransactionId *update_xmax, CommandId cid, Snapshot crosscheck, bool wait); --- 157,163 ---- extern void setLastTid(const ItemPointer tid); extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, ! bool use_wal, bool use_fsm, BlockNumber suggestedblk); extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, ItemPointer ctid, TransactionId *update_xmax, CommandId cid, Snapshot crosscheck, bool wait); Index: src/include/access/nbtree.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/nbtree.h,v retrieving revision 1.113 diff -c -r1.113 nbtree.h *** src/include/access/nbtree.h 11 Apr 2007 20:47:38 -0000 1.113 --- src/include/access/nbtree.h 19 May 2007 16:24:26 -0000 *************** *** 508,517 **** --- 508,540 ---- extern Datum btbulkdelete(PG_FUNCTION_ARGS); extern Datum btvacuumcleanup(PG_FUNCTION_ARGS); extern Datum btoptions(PG_FUNCTION_ARGS); + extern Datum btprepareinsert(PG_FUNCTION_ARGS); + extern Datum btfinishinsert(PG_FUNCTION_ARGS); + + /* Filled in by _bt_prepareinsert */ + typedef struct BTInsertInfoData + { + Relation rel; + Relation heapRel; + bool index_is_unique; + IndexTuple itup; + ScanKey itup_scankey; + Buffer buf; /* pinned, not locked */ + BTStack stack; + } BTInsertInfoData; + + typedef BTInsertInfoData *BTInsertInfo; /* * prototypes for functions in nbtinsert.c */ + extern BlockNumber _bt_prepareinsert(Relation rel, IndexTuple itup, + bool index_is_unique, Relation heapRel, + BTInsertInfo *opaquePtr); + extern void _bt_finishinsert(Relation rel, Relation heapRel, + bool check_uniqueness, + IndexTuple itup, ScanKey itup_scankey, + BTStack stack, Buffer buf); extern void _bt_doinsert(Relation rel, IndexTuple itup, bool index_is_unique, Relation heapRel); extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access); Index: src/include/catalog/pg_am.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_am.h,v retrieving revision 1.51 diff -c -r1.51 pg_am.h *** src/include/catalog/pg_am.h 6 Apr 2007 22:33:43 -0000 1.51 --- src/include/catalog/pg_am.h 19 May 2007 16:42:48 -0000 *************** *** 66,71 **** --- 66,73 ---- regproc amvacuumcleanup; /* post-VACUUM cleanup function */ regproc amcostestimate; /* estimate cost of an indexscan */ regproc amoptions; /* parse AM-specific parameters */ + regproc amprepareinsert; /* get desired insert location on heap */ + regproc amfinishinsert; /* finish a prepared insert operation */ } FormData_pg_am; /* ---------------- *************** *** 79,85 **** * compiler constants for pg_am * ---------------- */ ! #define Natts_pg_am 24 #define Anum_pg_am_amname 1 #define Anum_pg_am_amstrategies 2 #define Anum_pg_am_amsupport 3 --- 81,87 ---- * compiler constants for pg_am * ---------------- */ ! #define Natts_pg_am 26 #define Anum_pg_am_amname 1 #define Anum_pg_am_amstrategies 2 #define Anum_pg_am_amsupport 3 *************** *** 104,125 **** #define Anum_pg_am_amvacuumcleanup 22 #define Anum_pg_am_amcostestimate 23 #define Anum_pg_am_amoptions 24 /* ---------------- * initial contents of pg_am * ---------------- */ ! DATA(insert OID = 403 ( btree 5 1 t t t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan btmarkposbtrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions )); DESCR("b-tree index access method"); #define BTREE_AM_OID 403 ! DATA(insert OID = 405 ( hash 1 1 f f f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan hashendscanhashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions )); DESCR("hash index access method"); #define HASH_AM_OID 405 ! DATA(insert OID = 783 ( gist 0 7 f f t t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan gistendscangistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions )); DESCR("GiST index access method"); #define GIST_AM_OID 783 ! DATA(insert OID = 2742 ( gin 0 4 f f f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan ginendscanginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions )); DESCR("GIN index access method"); #define GIN_AM_OID 2742 --- 106,129 ---- #define Anum_pg_am_amvacuumcleanup 22 #define Anum_pg_am_amcostestimate 23 #define Anum_pg_am_amoptions 24 + #define Anum_pg_am_amprepareinsert 25 + #define Anum_pg_am_amfinishinsert 26 /* ---------------- * initial contents of pg_am * ---------------- */ ! DATA(insert OID = 403 ( btree 5 1 t t t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan btmarkposbtrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions btprepareinsert btfinishinsert)); DESCR("b-tree index access method"); #define BTREE_AM_OID 403 ! DATA(insert OID = 405 ( hash 1 1 f f f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan hashendscanhashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions - -)); DESCR("hash index access method"); #define HASH_AM_OID 405 ! DATA(insert OID = 783 ( gist 0 7 f f t t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan gistendscangistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions - -)); DESCR("GiST index access method"); #define GIST_AM_OID 783 ! DATA(insert OID = 2742 ( gin 0 4 f f f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan ginendscanginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions - -)); DESCR("GIN index access method"); #define GIN_AM_OID 2742 Index: src/include/catalog/pg_proc.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_proc.h,v retrieving revision 1.455 diff -c -r1.455 pg_proc.h *** src/include/catalog/pg_proc.h 8 May 2007 18:56:48 -0000 1.455 --- src/include/catalog/pg_proc.h 19 May 2007 17:20:23 -0000 *************** *** 688,693 **** --- 688,697 ---- DESCR("btree(internal)"); DATA(insert OID = 2785 ( btoptions PGNSP PGUID 12 1 0 f f t f s 2 17 "1009 16" _null_ _null_ _null_ btoptions- _null_ )); DESCR("btree(internal)"); + DATA(insert OID = 5433 ( btprepareinsert PGNSP PGUID 12 1 0 f f t f v 6 23 "2281 2281 2281 2281 2281 2281" _null_ _null__null_ btprepareinsert - _null_ )); + DESCR("btree(internal)"); + DATA(insert OID = 5430 ( btfinishinsert PGNSP PGUID 12 1 0 f f t f v 2 16 "2281 2281" _null_ _null_ _null_ btfinishinsert- _null_ )); + DESCR("btree(internal)"); DATA(insert OID = 339 ( poly_same PGNSP PGUID 12 1 0 f f t f i 2 16 "604 604" _null_ _null_ _null_ poly_same- _null_ )); DESCR("same as?"); Index: src/include/executor/executor.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/executor/executor.h,v retrieving revision 1.139 diff -c -r1.139 executor.h *** src/include/executor/executor.h 27 Feb 2007 01:11:25 -0000 1.139 --- src/include/executor/executor.h 19 May 2007 16:24:27 -0000 *************** *** 276,281 **** --- 276,282 ---- extern void ExecCloseIndices(ResultRelInfo *resultRelInfo); extern void ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid, EState *estate, bool is_vacuum); + extern BlockNumber ExecPrepareIndexInsert(TupleTableSlot *slot, EState *estate); extern void RegisterExprContextCallback(ExprContext *econtext, ExprContextCallbackFunction function, Index: src/include/nodes/execnodes.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/nodes/execnodes.h,v retrieving revision 1.174 diff -c -r1.174 execnodes.h *** src/include/nodes/execnodes.h 17 May 2007 19:35:08 -0000 1.174 --- src/include/nodes/execnodes.h 19 May 2007 16:24:27 -0000 *************** *** 264,269 **** --- 264,271 ---- * NumIndices # of indices existing on result relation * IndexRelationDescs array of relation descriptors for indices * IndexRelationInfo array of key/attr info for indices + * ClusterIndex index to the IndexRelationInfo array of the + * clustered index, or -1 if there's none * TrigDesc triggers to be fired, if any * TrigFunctions cached lookup info for trigger functions * TrigInstrument optional runtime measurements for triggers *************** *** 280,291 **** --- 282,296 ---- int ri_NumIndices; RelationPtr ri_IndexRelationDescs; IndexInfo **ri_IndexRelationInfo; + int ri_ClusterIndex; TriggerDesc *ri_TrigDesc; FmgrInfo *ri_TrigFunctions; struct Instrumentation *ri_TrigInstrument; List **ri_ConstraintExprs; JunkFilter *ri_junkFilter; ProjectionInfo *ri_projectReturning; + + void *ri_PreparedInsertOpaque; } ResultRelInfo; /* ---------------- Index: src/include/utils/rel.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/utils/rel.h,v retrieving revision 1.100 diff -c -r1.100 rel.h *** src/include/utils/rel.h 29 Mar 2007 00:15:39 -0000 1.100 --- src/include/utils/rel.h 19 May 2007 16:24:29 -0000 *************** *** 117,122 **** --- 117,124 ---- FmgrInfo amvacuumcleanup; FmgrInfo amcostestimate; FmgrInfo amoptions; + FmgrInfo amprepareinsert; + FmgrInfo amfinishinsert; } RelationAmInfo;
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- Heikki Linnakangas wrote: > Jaime Casanova wrote: > > On 5/18/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote: > >> Jaime Casanova wrote: > >> > > >> > the patch doesn't apply in cvs... you'll need to update it... > >> > >> Oh, here you are. > >> > >> The implementation has changed a bit since August. I thought I had > >> submitted an updated version in the winter but couldn't find it. Anyway, > >> I updated and dusted off the source tree, tidied up the comments a > >> little bit, and fixed some inconsistencies in pg_proc entries that made > >> opr_sanity to fail. > >> > > > > this one doesn't apply either... there are problems with nbtinsert.c and > > pg_am.h > > Ah, sorry about that. For some reason my source tree was checked out > from the 8.2 branch, instead of CVS HEAD. > > Here you are. Thanks for looking at this! > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 5/19/07, Heikki Linnakangas <heikki@enterprisedb.com > wrote:
Ah, sorry about that. For some reason my source tree was checked out
from the 8.2 branch, instead of CVS HEAD.
I looked at the patch. Not that I am very comfortable with this part
of the code, but nevertheless here are my comments:
I am wondering if we can optimize _bt_moveright() code by remembering
the PageLSN of the page before releasing the READ LOCK. If the page
is indeed changed since we last saw it, the PageLSN must be different
than what we had remembered and gives us a hint that we might need
to move right.
! /* Page was full. Release lock and pin and get another block
! * as if suggested_blk was not given.
! */
! LockBuffer(suggested_buf, BUFFER_LOCK_UNLOCK);
! ReleaseBuffer(suggested_buf);
May be you want to use UnlockReleaseBuffer() or _bt_relbuf() just as a shorthand.
Btw, I wonder why _bt_relbuf() exists and even so why does it take "Relation"
argument ?
Thanks,
Pavan
--
EnterpriseDB http://www.enterprisedb.com
On 5/21/07, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
On 5/19/07, Heikki Linnakangas < heikki@enterprisedb.com > wrote:
Ah, sorry about that. For some reason my source tree was checked out
from the 8.2 branch, instead of CVS HEAD.
I looked at the patch. Not that I am very comfortable with this part
of the code, but nevertheless here are my comments:
Another problem that I noticed with the patch is that it disregards
the fillfactor while inserting the tuples. ISTM that this is a correct
behavior when a tuple is being inserted in the block to preserve cluster
ordering. But when the tuple is being inserted as a normal operation,
IMO we should respect the fillfactor.
The easiest way to reproduce this is to insert tuples sorted on the
cluster index key.
Thanks,
Pavan
--
EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote: > On 5/21/07, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: >> >> >> >> On 5/19/07, Heikki Linnakangas <heikki@enterprisedb.com > wrote: >> > >> > >> > Ah, sorry about that. For some reason my source tree was checked out >> > from the 8.2 branch, instead of CVS HEAD. >> > >> > >> I looked at the patch. Not that I am very comfortable with this part >> of the code, but nevertheless here are my comments: >> >> > Another problem that I noticed with the patch is that it disregards > the fillfactor while inserting the tuples. ISTM that this is a correct > behavior when a tuple is being inserted in the block to preserve cluster > ordering. But when the tuple is being inserted as a normal operation, > IMO we should respect the fillfactor. > > The easiest way to reproduce this is to insert tuples sorted on the > cluster index key. Actually it does respect the fillfactor when the block suggested by indexam is full. When you insert tuples sorted on the cluster index key, the suggested page after the first tuple on the new page is always the last page. Try inserting in random order instead. IOW it's working as designed. But maybe it's not the desired behavior. Should we have a special case and always respect the fillfactor when inserting to the last page of the heap? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, May 21, 2007 at 10:48:59AM +0100, Heikki Linnakangas wrote: > >Another problem that I noticed with the patch is that it disregards > >the fillfactor while inserting the tuples. ISTM that this is a correct > >behavior when a tuple is being inserted in the block to preserve cluster > >ordering. But when the tuple is being inserted as a normal operation, > >IMO we should respect the fillfactor. > > > >The easiest way to reproduce this is to insert tuples sorted on the > >cluster index key. > > Actually it does respect the fillfactor when the block suggested by > indexam is full. When you insert tuples sorted on the cluster index key, > the suggested page after the first tuple on the new page is always the > last page. Try inserting in random order instead. > > IOW it's working as designed. But maybe it's not the desired behavior. > Should we have a special case and always respect the fillfactor when > inserting to the last page of the heap? I think that would be following with least surprise. -- Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Attachment
On 5/27/07, Jim C. Nasby <decibel@decibel.org> wrote: > On Mon, May 21, 2007 at 10:48:59AM +0100, Heikki Linnakangas wrote: > > > > IOW it's working as designed. But maybe it's not the desired behavior. > > Should we have a special case and always respect the fillfactor when > > inserting to the last page of the heap? > > I think that would be following with least surprise. What's the status of this patch? are we waiting an update? AFAIU, it's not fair to say that the patch maintain cluster order... it just try to keep similar rows on the same page if possible (it's not the same thing)... if it can't then it simply insert at the last page as usual but we have wasted time in the try... so the real question is if there is any performance win on this... have you some numbers? another question: if the fillfactor is 100% then is a complete waste of time to look for a suggested block. maybe we could check for that? -- regards, Jaime Casanova "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs and the universe trying to produce bigger and better idiots. So far, the universe is winning." Richard Cook
"Jaime Casanova" <systemguards@gmail.com> writes: > another question: if the fillfactor is 100% then is a complete waste > of time to look for a suggested block. maybe we could check for that? No, it isn't, since the page might have been vacuumed since it was last filled up. regards, tom lane
On 6/16/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Jaime Casanova" <systemguards@gmail.com> writes: > > another question: if the fillfactor is 100% then is a complete waste > > of time to look for a suggested block. maybe we could check for that? > > No, it isn't, since the page might have been vacuumed since it was last > filled up. > > regards, tom lane > ahh... that vacuum thing!!! yeah!!! ;) -- regards, Jaime Casanova "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs and the universe trying to produce bigger and better idiots. So far, the universe is winning." Richard Cook
Jaime Casanova wrote: > On 5/27/07, Jim C. Nasby <decibel@decibel.org> wrote: >> On Mon, May 21, 2007 at 10:48:59AM +0100, Heikki Linnakangas wrote: >> > >> > IOW it's working as designed. But maybe it's not the desired behavior. >> > Should we have a special case and always respect the fillfactor when >> > inserting to the last page of the heap? >> >> I think that would be following with least surprise. > > What's the status of this patch? are we waiting an update? > > AFAIU, it's not fair to say that the patch maintain cluster order... > it just try to keep similar rows on the same page if possible (it's > not the same thing)... if it can't then it simply insert at the last > page as usual but we have wasted time in the try... That's right. Or more accurately, if there's no room on the same page, it uses the FSM and if it still can't find room, it inserts at the last page. > so the real question is if there is any performance win on this... > have you some numbers? Hmm. I ran a small test with random inserts/deletes back in August. The conclusion was that the table loses clustering a lot slower, but I can't find the results now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > The implementation has changed a bit since August. I thought I had > submitted an updated version in the winter but couldn't find it. Anyway, > I updated and dusted off the source tree, tidied up the comments a > little bit, and fixed some inconsistencies in pg_proc entries that made > opr_sanity to fail. I started looking at this patch. My first reaction is that I liked last August's API (an independent "suggestblock" call) a lot better. I think trying to hold an index page lock across the heap insert is an extremely bad idea; it will hurt concurrency and possibly cause deadlocks (especially for unique indexes). The other question is why is execMain involved in this? That makes the design nonfunctional for tuples inserted in any other way than through the main executor --- COPY for instance. Also, if this is successful, I could see using it on system catalogs eventually. I'm inclined to think that the right design is for heap_insert to call amsuggestblock for itself, or maybe even push that down to RelationGetBufferForTuple. (Note: having heap_insert contain logic that duplicates RelationGetBufferForTuple's is another bit of bad design here, but that's at least correctable locally.) Now the difficulty in that is that the heapam.c routines don't have hold of any data structure containing index knowledge ... but they do have hold of the Relation structure for the heap. I suggest making RelationGetIndexList() cache the OID of the clustered index (if any) in relcache entries, and add RelationGetClusterIndex much like RelationGetOidIndex, and then heap_insert can use that. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> The implementation has changed a bit since August. I thought I had >> submitted an updated version in the winter but couldn't find it. Anyway, >> I updated and dusted off the source tree, tidied up the comments a >> little bit, and fixed some inconsistencies in pg_proc entries that made >> opr_sanity to fail. > > I started looking at this patch. My first reaction is that I liked last > August's API (an independent "suggestblock" call) a lot better. I think > trying to hold an index page lock across the heap insert is an extremely > bad idea; it will hurt concurrency and possibly cause deadlocks > (especially for unique indexes). The index page is not kept locked across the calls. Just pinned. The reason for switching to the new API instead of the amsuggestblock API is CPU overhead. It avoids constructing the IndexTuple twice and descending the tree twice. Clustering is mainly useful when the table doesn't fit in cache, so one could argue that if you care about clustering you're most likely I/O bound and don't care about the CPU overhead that much. Nevertheless, avoiding it seems like a good idea to me. The amsuggestblock API is simpler, though. We might choose it on those grounds alone. > The other question is why is execMain involved in this? That makes the > design nonfunctional for tuples inserted in any other way than through > the main executor --- COPY for instance. Also, if this is successful, > I could see using it on system catalogs eventually. I'm inclined to > think that the right design is for heap_insert to call amsuggestblock > for itself, or maybe even push that down to RelationGetBufferForTuple. > (Note: having heap_insert contain logic that duplicates > RelationGetBufferForTuple's is another bit of bad design here, but > that's at least correctable locally.) Now the difficulty in that is > that the heapam.c routines don't have hold of any data structure > containing index knowledge ... but they do have hold of the Relation > structure for the heap. I suggest making RelationGetIndexList() cache > the OID of the clustered index (if any) in relcache entries, and add > RelationGetClusterIndex much like RelationGetOidIndex, and then > heap_insert can use that. Hmm. My first reaction on that is that having heap_insert reach into an index is a modularity violation. It wouldn't be too hard to make similar changes to COPY that I did to the executor. I doubt it's very useful for system catalogs; they should never grow very large, and should stay mostly cached anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > The reason for switching to the new API instead of the amsuggestblock API is > CPU overhead. It avoids constructing the IndexTuple twice and descending the > tree twice. I wonder if it's possible to finesse this. Have the suggestblock function remember the last block it suggested and somehow recognize when it's given the same tuple to index. Keeping the block pinned would still be the sticky point though. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
[ back to this thread... ] Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> The other question is why is execMain involved in this? That makes the >> design nonfunctional for tuples inserted in any other way than through >> the main executor --- COPY for instance. Also, if this is successful, >> I could see using it on system catalogs eventually. I'm inclined to >> think that the right design is for heap_insert to call amsuggestblock >> for itself, or maybe even push that down to RelationGetBufferForTuple. > Hmm. My first reaction on that is that having heap_insert reach into an > index is a modularity violation. It wouldn't be too hard to make similar > changes to COPY that I did to the executor. Well, this entire patch is a modularity violation by definition. Bringing the executor into it just makes the damage worse, by involving relatively high-level code in the communication between two relatively low-level bits of code, and ensuring that we'll have to involve still other bits of high-level code in future if we want to make the behavior work in more scenarios. There is a problem with my suggestion of relying on the relcache: that would only give you the index relation's OID, not a handle to an open Relation for it. I'm not sure this is a killer, however, as the cost of an index_open/index_close cycle isn't really much more than a dynahash search if the index is already open and locked by some upper code level. regards, tom lane
[ back to the cluster-order patch ] Awhile back, Heikki Linnakangas <heikki@enterprisedb.com> wrote: > The performance characteristics of this patch hasn't been thoroughly > discussed yet. The reason why you want to cluster your tables is to > speed up SELECTs that return a bunch of tuples with similar values, for > example range queries. The reason for keeping them clustered on inserts > is to reduce the need to run CLUSTER as often. > It doesn't come without a cost, however. In the worst case, there never > is room for new inserts on pages, and each insert needs to do one extra > I/O to fetch the optimal heap page where the insert should go, see that > there's no room, and then insert somewhere else. Using a non-zero > fillfactor helps, but even when there is room on the page, it's often > cheaper to just append to the end of the table and running CLUSTER at > night for example, than do random access to insert to the "right" pages > in the heap. I looked through the thread and could not find any clear evidence that anyone had done any performance testing of this patch at all, so I hacked together a crude test that alternates between inserting/deleting a random subset of rows and SELECTing a range of rows. The results do not look very good: while there is detectable improvement in the SELECT performance, it's not much, and it comes at a *very* sizable penalty in INSERT performance. Attached is a test program, which I ran against today's CVS HEAD with and without the v8 version of the patch. The test program sets up a clustered million-row table with a row width chosen to fit about 20 rows per page. It then iterates a loop of: * delete a random 2% of the table * vacuum to recover space * insert a random 2% of the table * select (about) 1000 consecutively-numbered rows * select all the rows (this is just a cross check that the number of rows isn't changing too much) What you would hope to see as the benefit of the patch is that the time for the range SELECT degrades more slowly as more of the table is replaced. Ignoring the first SELECT as being a startup transient, it looks like HEAD degrades from about 3 msec to 6 msec over 10 iterations (20% replacement of the table), whereas with the patch it's about 3 msec to about 4 and a half. However, the INSERT steps went from around 20 sec each to about twice that. I used just the following nondefault parameter settings: shared_buffers = 10000 # min 128kB or max_connections*16kB maintenance_work_mem = 160MB # min 1MB max_fsm_pages = 2048000 # min max_fsm_relations*16, 6 bytes each wal_buffers = 640kB # min 32kB checkpoint_segments = 30 # in logfile segments, min 1, 16MB each enable_bitmapscan = off enable_seqscan = off autovacuum = off # enable autovacuum subprocess? which for the most part are just 10x the factory defaults. The hardware is just a Dell x86_64 workstation with crappy IDE disk, so maybe things would look better elsewhere, but it's all I have to work with. Considering that the patch is really severely ugly from a modularity and layering standpoint, I'm now inclined to reject it. AFAICT this test case is showing the patch at its best possible advantage; under real-world conditions the benefit would be less. It doesn't look to me like the gain is worth the loss of system understandability and maintainability that the patch would impose. regards, tom lane /* * derived from testlibpq3.c */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #include "libpq-fe.h" #define TABLESIZE 1000000 /* rows in table */ #define ITERS 10 static void exit_nicely(PGconn *conn) { PQfinish(conn); exit(1); } static char * elapsed_time(struct timeval start_t, struct timeval elapse_t) { static char buf[64]; if (elapse_t.tv_usec < start_t.tv_usec) { elapse_t.tv_sec--; elapse_t.tv_usec += 1000000; } sprintf(buf, "%3ld.%06ld", (long) (elapse_t.tv_sec - start_t.tv_sec), (long) (elapse_t.tv_usec - start_t.tv_usec)); return buf; } static void do_cmd(PGconn *conn, const char *cmd) { PGresult *res; struct timeval start_t; struct timeval elapse_t; gettimeofday(&start_t, NULL); res = PQexec(conn, cmd); gettimeofday(&elapse_t, NULL); if (PQresultStatus(res) != PGRES_COMMAND_OK) { fprintf(stderr, "%s\nCommand failed: %s", cmd, PQerrorMessage(conn)); PQclear(res); exit_nicely(conn); } printf("executed %-.6s in %s sec\n", cmd, elapsed_time(start_t, elapse_t)); fflush(stdout); PQclear(res); } static void do_select(PGconn *conn, const char *cmd) { PGresult *res; struct timeval start_t; struct timeval elapse_t; gettimeofday(&start_t, NULL); res = PQexec(conn, cmd); gettimeofday(&elapse_t, NULL); if (PQresultStatus(res) != PGRES_TUPLES_OK) { fprintf(stderr, "%s\nCommand failed: %s", cmd, PQerrorMessage(conn)); PQclear(res); exit_nicely(conn); } printf("retrieved %4s tuples in %s sec\n", PQgetvalue(res, 0, 0), elapsed_time(start_t, elapse_t)); fflush(stdout); PQclear(res); } int main(int argc, char **argv) { const char *conninfo; PGconn *conn; PGresult *res; char cmd[1024]; int i; /* * If the user supplies a parameter on the command line, use it as the * conninfo string; otherwise default to setting dbname=postgres and using * environment variables or defaults for all other connection parameters. */ if (argc > 1) conninfo = argv[1]; else conninfo = "dbname = postgres"; /* Make a connection to the database */ conn = PQconnectdb(conninfo); /* Check to see that the backend connection was successfully made */ if (PQstatus(conn) != CONNECTION_OK) { fprintf(stderr, "Connection to database failed: %s", PQerrorMessage(conn)); exit_nicely(conn); } /* Ignore any error while dropping old table */ res = PQexec(conn, "DROP TABLE testtab"); PQclear(res); /* Set up test table */ do_cmd(conn, "CREATE TABLE testtab(k int, d text)"); sprintf(cmd, "INSERT INTO testtab " "SELECT x, repeat('x',350) FROM generate_series(1,%d) x", TABLESIZE); do_cmd(conn, cmd); do_cmd(conn, "CREATE INDEX testtabi ON testtab(k)"); do_cmd(conn, "CLUSTER testtab USING testtabi"); do_cmd(conn, "VACUUM ANALYZE testtab"); for (i = 0; i < ITERS; i++) { int st; sprintf(cmd, "DELETE FROM testtab WHERE random() < 1.0/50"); do_cmd(conn, cmd); do_cmd(conn, "VACUUM testtab"); sprintf(cmd, "INSERT INTO testtab " "SELECT (random() * %d)::int, repeat('x',350) " "FROM generate_series(1,%d)", TABLESIZE, TABLESIZE/50); do_cmd(conn, cmd); st = ((double) random() * (TABLESIZE - 1000)) / 0x7FFFFFFF; sprintf(cmd, "SELECT count(*) FROM testtab " "WHERE k BETWEEN %d AND %d", st, st + 1000); do_select(conn, cmd); do_select(conn, "SELECT count(*) FROM testtab"); } /* close the connection to the database and cleanup */ PQfinish(conn); return 0; } executed CREATE in 0.066563 sec executed INSERT in 40.465653 sec executed CREATE in 9.152698 sec executed CLUSTE in 20.036375 sec executed VACUUM in 1.440232 sec executed DELETE in 14.770937 sec executed VACUUM in 10.663301 sec executed INSERT in 2.449248 sec retrieved 995 tuples in 0.170685 sec retrieved 999971 tuples in 0.648237 sec executed DELETE in 33.760709 sec executed VACUUM in 28.762174 sec executed INSERT in 12.212027 sec retrieved 999 tuples in 0.003139 sec retrieved 999642 tuples in 0.646765 sec executed DELETE in 30.541053 sec executed VACUUM in 13.204475 sec executed INSERT in 17.972502 sec retrieved 985 tuples in 0.003621 sec retrieved 999612 tuples in 0.623749 sec executed DELETE in 14.911813 sec executed VACUUM in 27.443921 sec executed INSERT in 19.125950 sec retrieved 1002 tuples in 0.004900 sec retrieved 999784 tuples in 0.667716 sec executed DELETE in 22.651369 sec executed VACUUM in 10.743926 sec executed INSERT in 21.631076 sec retrieved 987 tuples in 0.004566 sec retrieved 999711 tuples in 0.632185 sec executed DELETE in 11.587629 sec executed VACUUM in 15.278964 sec executed INSERT in 23.725325 sec retrieved 1011 tuples in 0.005960 sec retrieved 999624 tuples in 0.680135 sec executed DELETE in 3.905152 sec executed VACUUM in 42.848288 sec executed INSERT in 18.619609 sec retrieved 1006 tuples in 0.007106 sec retrieved 999479 tuples in 0.678316 sec executed DELETE in 27.288273 sec executed VACUUM in 9.329839 sec executed INSERT in 23.361110 sec retrieved 983 tuples in 0.005997 sec retrieved 999354 tuples in 0.615004 sec executed DELETE in 20.416503 sec executed VACUUM in 17.537463 sec executed INSERT in 19.781416 sec retrieved 1021 tuples in 0.006894 sec retrieved 999688 tuples in 0.670918 sec executed DELETE in 14.063273 sec executed VACUUM in 17.439971 sec executed INSERT in 17.381126 sec retrieved 987 tuples in 0.006664 sec retrieved 999930 tuples in 0.632494 sec executed CREATE in 0.086862 sec executed INSERT in 50.746362 sec executed CREATE in 12.115655 sec executed CLUSTE in 33.656341 sec executed VACUUM in 4.306563 sec executed DELETE in 18.062664 sec executed VACUUM in 28.487570 sec executed INSERT in 25.638022 sec retrieved 998 tuples in 1.498475 sec retrieved 1000019 tuples in 0.624082 sec executed DELETE in 30.211317 sec executed VACUUM in 24.147135 sec executed INSERT in 40.759404 sec retrieved 1006 tuples in 0.002711 sec retrieved 1000184 tuples in 0.984822 sec executed DELETE in 31.616131 sec executed VACUUM in 22.383567 sec executed INSERT in 36.174291 sec retrieved 990 tuples in 0.002840 sec retrieved 1000275 tuples in 0.723260 sec executed DELETE in 34.279871 sec executed VACUUM in 34.855060 sec executed INSERT in 36.652868 sec retrieved 990 tuples in 0.003554 sec retrieved 1000376 tuples in 0.715215 sec executed DELETE in 37.396236 sec executed VACUUM in 17.721296 sec executed INSERT in 32.413756 sec retrieved 974 tuples in 0.003498 sec retrieved 1000383 tuples in 0.723111 sec executed DELETE in 44.696337 sec executed VACUUM in 13.850628 sec executed INSERT in 47.649557 sec retrieved 1002 tuples in 0.004165 sec retrieved 1000554 tuples in 0.722704 sec executed DELETE in 17.854653 sec executed VACUUM in 52.446585 sec executed INSERT in 41.293512 sec retrieved 1038 tuples in 0.004169 sec retrieved 1000422 tuples in 0.692400 sec executed DELETE in 32.102991 sec executed VACUUM in 22.018916 sec executed INSERT in 44.296194 sec retrieved 991 tuples in 0.004636 sec retrieved 1000713 tuples in 0.739701 sec executed DELETE in 31.678825 sec executed VACUUM in 12.643271 sec executed INSERT in 53.179401 sec retrieved 1015 tuples in 0.004745 sec retrieved 1000797 tuples in 0.708603 sec executed DELETE in 35.682182 sec executed VACUUM in 42.612243 sec executed INSERT in 45.726663 sec retrieved 1017 tuples in 0.004753 sec retrieved 1000755 tuples in 0.696441 sec
Tom Lane wrote: > What you would hope to see as the benefit of the patch is that the time > for the range SELECT degrades more slowly as more of the table is > replaced. Ignoring the first SELECT as being a startup transient, it > looks like HEAD degrades from about 3 msec to 6 msec over 10 iterations > (20% replacement of the table), whereas with the patch it's about 3 msec > to about 4 and a half. However, the INSERT steps went from around 20 > sec each to about twice that. Clustering in general isn't very important on a table that fits in cache. Also note that the overhead of descending the b-tree twice on INSERTs hurts the most in a CPU bound test like that. The overhead of the two-phase method I proposed should be lower. Test 1: > executed CREATE in 0.066563 sec > executed INSERT in 40.465653 sec > executed CREATE in 9.152698 sec > executed CLUSTE in 20.036375 sec > executed VACUUM in 1.440232 sec Test 2: > executed CREATE in 0.086862 sec > executed INSERT in 50.746362 sec > executed CREATE in 12.115655 sec > executed CLUSTE in 33.656341 sec > executed VACUUM in 4.306563 sec Why is there such a big difference in all these initialization steps as well? Is it just random noise? I played a bit with that test program, results from my laptop attached. I used a version patched with the latest patch I submitted, because that's what I had readily available. I used the same patched binaries in both test runs, I just didn't CLUSTER the table in the other run. The main difference to your results is that the DELETE, VACUUM and INSERT operations are much faster both with and without the patch. Most INSERTs for example took < 1 s, and in your results they took > 15 s. Any idea why? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com executed CREATE in 0.098872 sec executed INSERT in 36.958824 sec executed CREATE in 7.320412 sec executed CLUSTE in 22.579578 sec executed VACUUM in 0.809287 sec executed DELETE in 8.567730 sec executed VACUUM in 5.194736 sec executed INSERT in 0.882852 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 839347 AND 840347 retrieved 993 tuples in 0.239210 sec executing select: SELECT count(*) FROM testtab retrieved 999898 tuples in 0.417749 sec executed DELETE in 5.398222 sec executed VACUUM in 0.949003 sec executed INSERT in 0.771617 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 393988 AND 394988 retrieved 1015 tuples in 0.001261 sec executing select: SELECT count(*) FROM testtab retrieved 1000038 tuples in 0.373199 sec executed DELETE in 5.613083 sec executed VACUUM in 1.289501 sec executed INSERT in 1.434903 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 782316 AND 783316 retrieved 1011 tuples in 0.001723 sec executing select: SELECT count(*) FROM testtab retrieved 999826 tuples in 0.353616 sec executed DELETE in 3.729799 sec executed VACUUM in 1.406940 sec executed INSERT in 1.169133 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 797641 AND 798641 retrieved 991 tuples in 0.001118 sec executing select: SELECT count(*) FROM testtab retrieved 999731 tuples in 0.340721 sec executed DELETE in 1.448537 sec executed VACUUM in 0.918642 sec executed INSERT in 0.471951 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 910735 AND 911735 retrieved 1029 tuples in 0.001174 sec executing select: SELECT count(*) FROM testtab retrieved 999773 tuples in 0.334880 sec executed DELETE in 0.972467 sec executed VACUUM in 0.925557 sec executed INSERT in 0.645214 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 197353 AND 198353 retrieved 975 tuples in 0.001076 sec executing select: SELECT count(*) FROM testtab retrieved 999751 tuples in 0.334343 sec executed DELETE in 0.966780 sec executed VACUUM in 0.923394 sec executed INSERT in 0.468693 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 334887 AND 335887 retrieved 983 tuples in 0.001381 sec executing select: SELECT count(*) FROM testtab retrieved 999801 tuples in 0.335184 sec executed DELETE in 0.792456 sec executed VACUUM in 0.934451 sec executed INSERT in 0.435028 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 767461 AND 768461 retrieved 998 tuples in 0.001279 sec executing select: SELECT count(*) FROM testtab retrieved 999854 tuples in 0.338010 sec executed DELETE in 0.553884 sec executed VACUUM in 0.933886 sec executed INSERT in 0.435980 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 277496 AND 278496 retrieved 1017 tuples in 0.001257 sec executing select: SELECT count(*) FROM testtab retrieved 999923 tuples in 0.333156 sec executed DELETE in 0.518371 sec executed VACUUM in 0.932207 sec executed INSERT in 18.031924 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 553415 AND 554415 retrieved 1012 tuples in 0.001280 sec executing select: SELECT count(*) FROM testtab retrieved 1000135 tuples in 0.343191 sec executed CREATE in 0.014632 sec executed INSERT in 40.509772 sec executed CREATE in 4.485714 sec executed VACUUM in 0.396134 sec executed DELETE in 15.366893 sec executed VACUUM in 4.004228 sec executed INSERT in 1.794740 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 839347 AND 840347 retrieved 1008 tuples in 0.001182 sec executing select: SELECT count(*) FROM testtab retrieved 1000088 tuples in 0.314738 sec executed DELETE in 19.191114 sec executed VACUUM in 0.862771 sec executed INSERT in 0.595838 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 393988 AND 394988 retrieved 1010 tuples in 0.001081 sec executing select: SELECT count(*) FROM testtab retrieved 1000243 tuples in 0.312002 sec executed DELETE in 4.152384 sec executed VACUUM in 0.876611 sec executed INSERT in 0.748715 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 782316 AND 783316 retrieved 1006 tuples in 0.001168 sec executing select: SELECT count(*) FROM testtab retrieved 1000469 tuples in 0.313215 sec executed DELETE in 1.860138 sec executed VACUUM in 0.887234 sec executed INSERT in 0.381926 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 797641 AND 798641 retrieved 1001 tuples in 0.001170 sec executing select: SELECT count(*) FROM testtab retrieved 1000399 tuples in 0.313861 sec executed DELETE in 1.203454 sec executed VACUUM in 0.894531 sec executed INSERT in 0.414772 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 910735 AND 911735 retrieved 1005 tuples in 0.001192 sec executing select: SELECT count(*) FROM testtab retrieved 1000324 tuples in 0.313666 sec executed DELETE in 1.103641 sec executed VACUUM in 0.911423 sec executed INSERT in 0.579772 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 197353 AND 198353 retrieved 1012 tuples in 0.001281 sec executing select: SELECT count(*) FROM testtab retrieved 1000150 tuples in 0.314685 sec executed DELETE in 0.757131 sec executed VACUUM in 0.904153 sec executed INSERT in 0.412712 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 334887 AND 335887 retrieved 1003 tuples in 0.001255 sec executing select: SELECT count(*) FROM testtab retrieved 1000248 tuples in 0.342517 sec executed DELETE in 0.572931 sec executed VACUUM in 0.910369 sec executed INSERT in 0.701230 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 767461 AND 768461 retrieved 1022 tuples in 0.001361 sec executing select: SELECT count(*) FROM testtab retrieved 1000353 tuples in 0.313835 sec executed DELETE in 0.529490 sec executed VACUUM in 0.913697 sec executed INSERT in 0.407656 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 277496 AND 278496 retrieved 1031 tuples in 0.001368 sec executing select: SELECT count(*) FROM testtab retrieved 1000435 tuples in 0.314925 sec executed DELETE in 4.871267 sec executed VACUUM in 9.618490 sec executed INSERT in 2.496079 sec executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 553415 AND 554415 retrieved 987 tuples in 0.001641 sec executing select: SELECT count(*) FROM testtab retrieved 1000648 tuples in 0.326064 sec max_connections = 100 # (change requires restart) shared_buffers = 700MB # min 128kB or max_connections*16kB maintenance_work_mem = 160MB # min 1MB max_fsm_pages = 204800 # min max_fsm_relations*16, 6 bytes each full_page_writes = on # recover from partial page writes wal_buffers = 640kB # min 32kB checkpoint_segments = 30 # in logfile segments, min 1, 16MB each checkpoint_timeout =30min # range 30s-1h checkpoint_warning = 600s # 0 is off autovacuum = off # enable autovacuum subprocess? datestyle = 'iso, dmy' lc_messages = 'en_GB.UTF-8' # locale for system error message lc_monetary = 'en_GB.UTF-8' # locale for monetary formatting lc_numeric = 'en_GB.UTF-8' # locale for number formatting lc_time = 'en_GB.UTF-8' # locale for time formatting
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > * delete a random 2% of the table > * vacuum to recover space > * insert a random 2% of the table > * select (about) 1000 consecutively-numbered rows > * select all the rows (this is just a cross check that the > number of rows isn't changing too much) > > What you would hope to see as the benefit of the patch is that the time > for the range SELECT degrades more slowly as more of the table is > replaced. Ignoring the first SELECT as being a startup transient, it > looks like HEAD degrades from about 3 msec to 6 msec over 10 iterations > (20% replacement of the table), whereas with the patch it's about 3 msec > to about 4 and a half. However, the INSERT steps went from around 20 > sec each to about twice that. I would suggest: a) continuing the test until you have 100% replacement. b) defining the tables with a fill-factor large enough to hold at least one extra tuple c) considering this patch alongside GII where it is especially important. It's pretty serious what you're suggesting since it means that we'll basically never have a real cluster feature. I would sure hope we're missing something and there's a way to make this work usefully. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > It's pretty serious what you're suggesting since it means that we'll basically > never have a real cluster feature. I would sure hope we're missing something > and there's a way to make this work usefully. Another approach would be an online CLUSTER command. That means there'll be a lot of churn when tuples need to be moved back and forth, along with updating indexes, but it'd take the overhead out of the critical path. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 9 Jul 2007, Tom Lane wrote: > The hardware is just a Dell x86_64 workstation with crappy IDE disk, so > maybe things would look better elsewhere, but it's all I have to work > with. Do you have write-caching turned off on the drive so INSERTs are being rate-limited by WAL syncs? Trying to characterize the class of setup you're using. The part that seemed curious to me about your results in the unpatched version is why the first INSERT takes 2.4 seconds, while the second takes 12.2 then later ones settle from 17-23 s. I could understand the 12-23 variation, but that the first one fires off in 2.4 seems kind of fishy; why so fast? Is there something that's just fitting in memory in that case that's just missing in the patched version? results-head: ... executed DELETE in 14.770937 sec executed VACUUM in 10.663301 sec executed INSERT in 2.449248 sec (1st) ... executed INSERT in 12.212027 sec (2nd) results-patch: ... executed DELETE in 18.062664 sec executed VACUUM in 28.487570 sec executed INSERT in 25.638022 sec (1st) ... executed INSERT in 40.759404 sec (2nd) -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Gregory Stark <stark@enterprisedb.com> writes: > It's pretty serious what you're suggesting since it means that we'll > basically never have a real cluster feature. It seems entirely likely that this is not the way to go about "real clustering". regards, tom lane
Greg Smith <gsmith@gregsmith.com> writes: > On Mon, 9 Jul 2007, Tom Lane wrote: >> The hardware is just a Dell x86_64 workstation with crappy IDE disk, so >> maybe things would look better elsewhere, but it's all I have to work >> with. > Do you have write-caching turned off on the drive so INSERTs are being > rate-limited by WAL syncs? Trying to characterize the class of setup > you're using. It's just desktop-grade junk :-(. Feel free to repeat the test on something more serious. > The part that seemed curious to me about your results in the unpatched > version is why the first INSERT takes 2.4 seconds, while the second takes > 12.2 then later ones settle from 17-23 s. I could understand the 12-23 > variation, but that the first one fires off in 2.4 seems kind of fishy; The test seemed to be mostly I/O bound according to vmstat. I supposed that it was fast up until the kernel's available space was full of dirty pages, and then subsequent writes would be constrained by actual disk write bandwidth. Anyway the numbers seemed relatively consistent after the setup and first test cycle, so I think the part I was actually trying to draw conclusions from was probably real enough. regards, tom lane
On Tue, 10 Jul 2007, Tom Lane wrote: > It's just desktop-grade junk :-(. Feel free to repeat the test on > something more serious. Right, but even such junk can be setup such that the disks honor commits, just wanted to confirm you didn't go out of your way to do that--sounds like you didn't. My mini-lab at home has two PG test systems, one with a real write cache, the other has an IDE drive with the cache turned off so commits actually wait to hit the platter. The painful slowness of writes in that situation keeps me grounded as to how much data integrity actually costs if you're not accelerating it, and the differences in benchmarks between the two systems (which are otherwise a similar class of hardware) can be informative. > Anyway the numbers seemed relatively consistent after the setup and > first test cycle, so I think the part I was actually trying to draw > conclusions from was probably real enough. Agreed, just figuring out the test ramp-up situation, and your explanation for that quirk sounds reasonable. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
This idea has been rejected to do poor performance results reported later in the thread. --------------------------------------------------------------------------- Heikki Linnakangas wrote: > While thinking about index-organized-tables and similar ideas, it > occurred to me that there's some low-hanging-fruit: maintaining cluster > order on inserts by trying to place new heap tuples close to other > similar tuples. That involves asking the index am where on the heap the > new tuple should go, and trying to insert it there before using the FSM. > Using the new fillfactor parameter makes it more likely that there's > room on the page. We don't worry about the order within the page. > > The API I'm thinking of introduces a new optional index am function, > amsuggestblock (suggestions for a better name are welcome). It gets the > same parameters as aminsert, and returns the heap block number that > would be optimal place to put the new tuple. It's be called from > ExecInsert before inserting the heap tuple, and the suggestion is passed > on to heap_insert and RelationGetBufferForTuple. > > I wrote a little patch to implement this for btree, attached. > > This could be optimized by changing the existing aminsert API, because > as it is, an insert will have to descend the btree twice. Once in > amsuggestblock and then in aminsert. amsuggestblock could keep the right > index page pinned so aminsert could locate it quicker. But I wanted to > keep this simple for now. Another improvement might be to allow > amsuggestblock to return a list of suggestions, but that makes it more > expensive to insert if there isn't room in the suggested pages, since > heap_insert will have to try them all before giving up. > > Comments regarding the general idea or the patch? There should probably > be a index option to turn the feature on and off. You'll want to turn it > off when you first load a table, and turn it on after CLUSTER to keep it > clustered. > > Since there's been discussion on keeping the TODO list more up-to-date, > I hereby officially claim the "Automatically maintain clustering on a > table" TODO item :). Feel free to bombard me with requests for status > reports. And just to be clear, I'm not trying to sneak this into 8.2 > anymore, this is 8.3 stuff. > > I won't be implementing a background daemon described on the TODO item, > since that would essentially be an online version of CLUSTER. Which sure > would be nice, but that's a different story. > > - Heikki > [ text/x-patch is unsupported, treating like TEXT/PLAIN ] > Index: doc/src/sgml/catalogs.sgml > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/catalogs.sgml,v > retrieving revision 2.129 > diff -c -r2.129 catalogs.sgml > *** doc/src/sgml/catalogs.sgml 31 Jul 2006 20:08:55 -0000 2.129 > --- doc/src/sgml/catalogs.sgml 8 Aug 2006 16:17:21 -0000 > *************** > *** 499,504 **** > --- 499,511 ---- > <entry>Function to parse and validate reloptions for an index</entry> > </row> > > + <row> > + <entry><structfield>amsuggestblock</structfield></entry> > + <entry><type>regproc</type></entry> > + <entry><literal><link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>.oid</literal></entry> > + <entry>Get the best place in the heap to put a new tuple</entry> > + </row> > + > </tbody> > </tgroup> > </table> > Index: doc/src/sgml/indexam.sgml > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/indexam.sgml,v > retrieving revision 2.16 > diff -c -r2.16 indexam.sgml > *** doc/src/sgml/indexam.sgml 31 Jul 2006 20:08:59 -0000 2.16 > --- doc/src/sgml/indexam.sgml 8 Aug 2006 17:15:25 -0000 > *************** > *** 391,396 **** > --- 391,414 ---- > <function>amoptions</> to test validity of options settings. > </para> > > + <para> > + <programlisting> > + BlockNumber > + amsuggestblock (Relation indexRelation, > + Datum *values, > + bool *isnull, > + Relation heapRelation); > + </programlisting> > + Gets the optimal place in the heap for a new tuple. The parameters > + correspond the parameters for <literal>aminsert</literal>. > + This function is called on the clustered index before a new tuple > + is inserted to the heap, and it should choose the optimal insertion > + target page on the heap in such manner that the heap stays as close > + as possible to the index order. > + <literal>amsuggestblock</literal> can return InvalidBlockNumber if > + the index am doesn't have a suggestion. > + </para> > + > </sect1> > > <sect1 id="index-scanning"> > Index: src/backend/access/heap/heapam.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v > retrieving revision 1.218 > diff -c -r1.218 heapam.c > *** src/backend/access/heap/heapam.c 31 Jul 2006 20:08:59 -0000 1.218 > --- src/backend/access/heap/heapam.c 8 Aug 2006 16:17:21 -0000 > *************** > *** 1325,1330 **** > --- 1325,1335 ---- > * use_fsm is passed directly to RelationGetBufferForTuple, which see for > * more info. > * > + * suggested_blk can be set by the caller to hint heap_insert which > + * block would be the best place to put the new tuple in. heap_insert can > + * ignore the suggestion, if there's not enough room on that block. > + * InvalidBlockNumber means no preference. > + * > * The return value is the OID assigned to the tuple (either here or by the > * caller), or InvalidOid if no OID. The header fields of *tup are updated > * to match the stored tuple; in particular tup->t_self receives the actual > *************** > *** 1333,1339 **** > */ > Oid > heap_insert(Relation relation, HeapTuple tup, CommandId cid, > ! bool use_wal, bool use_fsm) > { > TransactionId xid = GetCurrentTransactionId(); > HeapTuple heaptup; > --- 1338,1344 ---- > */ > Oid > heap_insert(Relation relation, HeapTuple tup, CommandId cid, > ! bool use_wal, bool use_fsm, BlockNumber suggested_blk) > { > TransactionId xid = GetCurrentTransactionId(); > HeapTuple heaptup; > *************** > *** 1386,1392 **** > > /* Find buffer to insert this tuple into */ > buffer = RelationGetBufferForTuple(relation, heaptup->t_len, > ! InvalidBuffer, use_fsm); > > /* NO EREPORT(ERROR) from here till changes are logged */ > START_CRIT_SECTION(); > --- 1391,1397 ---- > > /* Find buffer to insert this tuple into */ > buffer = RelationGetBufferForTuple(relation, heaptup->t_len, > ! InvalidBuffer, use_fsm, suggested_blk); > > /* NO EREPORT(ERROR) from here till changes are logged */ > START_CRIT_SECTION(); > *************** > *** 1494,1500 **** > Oid > simple_heap_insert(Relation relation, HeapTuple tup) > { > ! return heap_insert(relation, tup, GetCurrentCommandId(), true, true); > } > > /* > --- 1499,1506 ---- > Oid > simple_heap_insert(Relation relation, HeapTuple tup) > { > ! return heap_insert(relation, tup, GetCurrentCommandId(), true, > ! true, InvalidBlockNumber); > } > > /* > *************** > *** 2079,2085 **** > { > /* Assume there's no chance to put heaptup on same page. */ > newbuf = RelationGetBufferForTuple(relation, heaptup->t_len, > ! buffer, true); > } > else > { > --- 2085,2092 ---- > { > /* Assume there's no chance to put heaptup on same page. */ > newbuf = RelationGetBufferForTuple(relation, heaptup->t_len, > ! buffer, true, > ! InvalidBlockNumber); > } > else > { > *************** > *** 2096,2102 **** > */ > LockBuffer(buffer, BUFFER_LOCK_UNLOCK); > newbuf = RelationGetBufferForTuple(relation, heaptup->t_len, > ! buffer, true); > } > else > { > --- 2103,2110 ---- > */ > LockBuffer(buffer, BUFFER_LOCK_UNLOCK); > newbuf = RelationGetBufferForTuple(relation, heaptup->t_len, > ! buffer, true, > ! InvalidBlockNumber); > } > else > { > Index: src/backend/access/heap/hio.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/hio.c,v > retrieving revision 1.63 > diff -c -r1.63 hio.c > *** src/backend/access/heap/hio.c 3 Jul 2006 22:45:37 -0000 1.63 > --- src/backend/access/heap/hio.c 9 Aug 2006 18:03:01 -0000 > *************** > *** 93,98 **** > --- 93,100 ---- > * any committed data of other transactions. (See heap_insert's comments > * for additional constraints needed for safe usage of this behavior.) > * > + * If the caller has a suggestion, it's passed in suggestedBlock. > + * > * We always try to avoid filling existing pages further than the fillfactor. > * This is OK since this routine is not consulted when updating a tuple and > * keeping it on the same page, which is the scenario fillfactor is meant > *************** > *** 103,109 **** > */ > Buffer > RelationGetBufferForTuple(Relation relation, Size len, > ! Buffer otherBuffer, bool use_fsm) > { > Buffer buffer = InvalidBuffer; > Page pageHeader; > --- 105,112 ---- > */ > Buffer > RelationGetBufferForTuple(Relation relation, Size len, > ! Buffer otherBuffer, bool use_fsm, > ! BlockNumber suggestedBlock) > { > Buffer buffer = InvalidBuffer; > Page pageHeader; > *************** > *** 135,142 **** > otherBlock = InvalidBlockNumber; /* just to keep compiler quiet */ > > /* > ! * We first try to put the tuple on the same page we last inserted a tuple > ! * on, as cached in the relcache entry. If that doesn't work, we ask the > * shared Free Space Map to locate a suitable page. Since the FSM's info > * might be out of date, we have to be prepared to loop around and retry > * multiple times. (To insure this isn't an infinite loop, we must update > --- 138,147 ---- > otherBlock = InvalidBlockNumber; /* just to keep compiler quiet */ > > /* > ! * We first try to put the tuple on the page suggested by the caller, if > ! * any. Then we try to put the tuple on the same page we last inserted a > ! * tuple on, as cached in the relcache entry. If that doesn't work, we > ! * ask the > * shared Free Space Map to locate a suitable page. Since the FSM's info > * might be out of date, we have to be prepared to loop around and retry > * multiple times. (To insure this isn't an infinite loop, we must update > *************** > *** 144,152 **** > * not to be suitable.) If the FSM has no record of a page with enough > * free space, we give up and extend the relation. > * > ! * When use_fsm is false, we either put the tuple onto the existing target > ! * page or extend the relation. > */ > if (len + saveFreeSpace <= MaxTupleSize) > targetBlock = relation->rd_targblock; > else > --- 149,167 ---- > * not to be suitable.) If the FSM has no record of a page with enough > * free space, we give up and extend the relation. > * > ! * When use_fsm is false, we skip the fsm lookup if neither the suggested > ! * nor the cached last insertion page has enough room, and extend the > ! * relation. > ! * > ! * The fillfactor is taken into account when calculating the free space > ! * on the cached target block, and when using the FSM. The suggested page > ! * is used whenever there's enough room in it, regardless of the fillfactor, > ! * because that's exactly the purpose the space is reserved for in the > ! * first place. > */ > + if (suggestedBlock != InvalidBlockNumber) > + targetBlock = suggestedBlock; > + else > if (len + saveFreeSpace <= MaxTupleSize) > targetBlock = relation->rd_targblock; > else > *************** > *** 219,224 **** > --- 234,244 ---- > */ > pageHeader = (Page) BufferGetPage(buffer); > pageFreeSpace = PageGetFreeSpace(pageHeader); > + > + /* If we're trying the suggested block, don't care about fillfactor */ > + if (targetBlock == suggestedBlock && len <= pageFreeSpace) > + return buffer; > + > if (len + saveFreeSpace <= pageFreeSpace) > { > /* use this page as future insert target, too */ > *************** > *** 241,246 **** > --- 261,275 ---- > ReleaseBuffer(buffer); > } > > + /* If we just tried the suggested block, try the cached target > + * block next, before consulting the FSM. */ > + if(suggestedBlock == targetBlock) > + { > + targetBlock = relation->rd_targblock; > + suggestedBlock = InvalidBlockNumber; > + continue; > + } > + > /* Without FSM, always fall out of the loop and extend */ > if (!use_fsm) > break; > Index: src/backend/access/index/genam.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/index/genam.c,v > retrieving revision 1.58 > diff -c -r1.58 genam.c > *** src/backend/access/index/genam.c 31 Jul 2006 20:08:59 -0000 1.58 > --- src/backend/access/index/genam.c 8 Aug 2006 16:17:21 -0000 > *************** > *** 259,261 **** > --- 259,275 ---- > > pfree(sysscan); > } > + > + /* > + * This is a dummy implementation of amsuggestblock, to be used for index > + * access methods that don't or can't support it. It just returns > + * InvalidBlockNumber, which means "no preference". > + * > + * This is probably not a good best place for this function, but it doesn't > + * fit naturally anywhere else either. > + */ > + Datum > + dummysuggestblock(PG_FUNCTION_ARGS) > + { > + PG_RETURN_UINT32(InvalidBlockNumber); > + } > Index: src/backend/access/index/indexam.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/index/indexam.c,v > retrieving revision 1.94 > diff -c -r1.94 indexam.c > *** src/backend/access/index/indexam.c 31 Jul 2006 20:08:59 -0000 1.94 > --- src/backend/access/index/indexam.c 8 Aug 2006 16:17:21 -0000 > *************** > *** 18,23 **** > --- 18,24 ---- > * index_rescan - restart a scan of an index > * index_endscan - end a scan > * index_insert - insert an index tuple into a relation > + * index_suggestblock - get desired insert location for a heap tuple > * index_markpos - mark a scan position > * index_restrpos - restore a scan position > * index_getnext - get the next tuple from a scan > *************** > *** 202,207 **** > --- 203,237 ---- > BoolGetDatum(check_uniqueness))); > } > > + /* ---------------- > + * index_suggestblock - get desired insert location for a heap tuple > + * > + * The returned BlockNumber is the *heap* page that is the best place > + * to insert the given tuple to, according to the index am. The best > + * place is usually one that maintains the cluster order. > + * ---------------- > + */ > + BlockNumber > + index_suggestblock(Relation indexRelation, > + Datum *values, > + bool *isnull, > + Relation heapRelation) > + { > + FmgrInfo *procedure; > + > + RELATION_CHECKS; > + GET_REL_PROCEDURE(amsuggestblock); > + > + /* > + * have the am's suggestblock proc do all the work. > + */ > + return DatumGetUInt32(FunctionCall4(procedure, > + PointerGetDatum(indexRelation), > + PointerGetDatum(values), > + PointerGetDatum(isnull), > + PointerGetDatum(heapRelation))); > + } > + > /* > * index_beginscan - start a scan of an index with amgettuple > * > Index: src/backend/access/nbtree/nbtinsert.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtinsert.c,v > retrieving revision 1.142 > diff -c -r1.142 nbtinsert.c > *** src/backend/access/nbtree/nbtinsert.c 25 Jul 2006 19:13:00 -0000 1.142 > --- src/backend/access/nbtree/nbtinsert.c 9 Aug 2006 17:51:33 -0000 > *************** > *** 146,151 **** > --- 146,221 ---- > } > > /* > + * _bt_suggestblock() -- Find the heap block of the closest index tuple. > + * > + * The logic to find the target should match _bt_doinsert, otherwise > + * we'll be making bad suggestions. > + */ > + BlockNumber > + _bt_suggestblock(Relation rel, IndexTuple itup, Relation heapRel) > + { > + int natts = rel->rd_rel->relnatts; > + OffsetNumber offset; > + Page page; > + BTPageOpaque opaque; > + > + ScanKey itup_scankey; > + BTStack stack; > + Buffer buf; > + IndexTuple curitup; > + BlockNumber suggestion = InvalidBlockNumber; > + > + /* we need an insertion scan key to do our search, so build one */ > + itup_scankey = _bt_mkscankey(rel, itup); > + > + /* find the first page containing this key */ > + stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_READ); > + if(!BufferIsValid(buf)) > + { > + /* The index was completely empty. No suggestion then. */ > + return InvalidBlockNumber; > + } > + /* we don't need the stack, so free it right away */ > + _bt_freestack(stack); > + > + page = BufferGetPage(buf); > + opaque = (BTPageOpaque) PageGetSpecialPointer(page); > + > + /* Find the location in the page where the new index tuple would go to. */ > + > + offset = _bt_binsrch(rel, buf, natts, itup_scankey, false); > + if (offset > PageGetMaxOffsetNumber(page)) > + { > + /* _bt_binsrch returned pointer to end-of-page. It means that > + * there was no equal items on the page, and the new item should > + * be inserted as the last tuple of the page. There could be equal > + * items on the next page, however. > + * > + * At the moment, we just ignore the potential equal items on the > + * right, and pretend there isn't any. We could instead walk right > + * to the next page to check that, but let's keep it simple for now. > + */ > + offset = OffsetNumberPrev(offset); > + } > + if(offset < P_FIRSTDATAKEY(opaque)) > + { > + /* We landed on an empty page. We could step left or right until > + * we find some items, but let's keep it simple for now. > + */ > + } else { > + /* We're now positioned at the index tuple that we're interested in. */ > + > + curitup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset)); > + suggestion = ItemPointerGetBlockNumber(&curitup->t_tid); > + } > + > + _bt_relbuf(rel, buf); > + _bt_freeskey(itup_scankey); > + > + return suggestion; > + } > + > + /* > * _bt_check_unique() -- Check for violation of unique index constraint > * > * Returns InvalidTransactionId if there is no conflict, else an xact ID > Index: src/backend/access/nbtree/nbtree.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtree.c,v > retrieving revision 1.149 > diff -c -r1.149 nbtree.c > *** src/backend/access/nbtree/nbtree.c 10 May 2006 23:18:39 -0000 1.149 > --- src/backend/access/nbtree/nbtree.c 9 Aug 2006 18:04:02 -0000 > *************** > *** 228,233 **** > --- 228,265 ---- > } > > /* > + * btsuggestblock() -- find the best place in the heap to put a new tuple. > + * > + * This uses the same logic as btinsert to find the place where the index > + * tuple would go if this was a btinsert call. > + * > + * There's room for improvement here. An insert operation will descend > + * the tree twice, first by btsuggestblock, then by btinsert. Things > + * might have changed in between, so that the heap tuple is actually > + * not inserted in the optimal page, but since this is just an > + * optimization, it's ok if it happens sometimes. > + */ > + Datum > + btsuggestblock(PG_FUNCTION_ARGS) > + { > + Relation rel = (Relation) PG_GETARG_POINTER(0); > + Datum *values = (Datum *) PG_GETARG_POINTER(1); > + bool *isnull = (bool *) PG_GETARG_POINTER(2); > + Relation heapRel = (Relation) PG_GETARG_POINTER(3); > + IndexTuple itup; > + BlockNumber suggestion; > + > + /* generate an index tuple */ > + itup = index_form_tuple(RelationGetDescr(rel), values, isnull); > + > + suggestion =_bt_suggestblock(rel, itup, heapRel); > + > + pfree(itup); > + > + PG_RETURN_UINT32(suggestion); > + } > + > + /* > * btgettuple() -- Get the next tuple in the scan. > */ > Datum > Index: src/backend/executor/execMain.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execMain.c,v > retrieving revision 1.277 > diff -c -r1.277 execMain.c > *** src/backend/executor/execMain.c 31 Jul 2006 01:16:37 -0000 1.277 > --- src/backend/executor/execMain.c 8 Aug 2006 16:17:21 -0000 > *************** > *** 892,897 **** > --- 892,898 ---- > resultRelInfo->ri_RangeTableIndex = resultRelationIndex; > resultRelInfo->ri_RelationDesc = resultRelationDesc; > resultRelInfo->ri_NumIndices = 0; > + resultRelInfo->ri_ClusterIndex = -1; > resultRelInfo->ri_IndexRelationDescs = NULL; > resultRelInfo->ri_IndexRelationInfo = NULL; > /* make a copy so as not to depend on relcache info not changing... */ > *************** > *** 1388,1394 **** > heap_insert(estate->es_into_relation_descriptor, tuple, > estate->es_snapshot->curcid, > estate->es_into_relation_use_wal, > ! false); /* never any point in using FSM */ > /* we know there are no indexes to update */ > heap_freetuple(tuple); > IncrAppended(); > --- 1389,1396 ---- > heap_insert(estate->es_into_relation_descriptor, tuple, > estate->es_snapshot->curcid, > estate->es_into_relation_use_wal, > ! false, /* never any point in using FSM */ > ! InvalidBlockNumber); > /* we know there are no indexes to update */ > heap_freetuple(tuple); > IncrAppended(); > *************** > *** 1419,1424 **** > --- 1421,1427 ---- > ResultRelInfo *resultRelInfo; > Relation resultRelationDesc; > Oid newId; > + BlockNumber suggestedBlock; > > /* > * get the heap tuple out of the tuple table slot, making sure we have a > *************** > *** 1467,1472 **** > --- 1470,1479 ---- > if (resultRelationDesc->rd_att->constr) > ExecConstraints(resultRelInfo, slot, estate); > > + /* Ask the index am of the clustered index for the > + * best place to put it */ > + suggestedBlock = ExecSuggestBlock(slot, estate); > + > /* > * insert the tuple > * > *************** > *** 1475,1481 **** > */ > newId = heap_insert(resultRelationDesc, tuple, > estate->es_snapshot->curcid, > ! true, true); > > IncrAppended(); > (estate->es_processed)++; > --- 1482,1488 ---- > */ > newId = heap_insert(resultRelationDesc, tuple, > estate->es_snapshot->curcid, > ! true, true, suggestedBlock); > > IncrAppended(); > (estate->es_processed)++; > Index: src/backend/executor/execUtils.c > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execUtils.c,v > retrieving revision 1.139 > diff -c -r1.139 execUtils.c > *** src/backend/executor/execUtils.c 4 Aug 2006 21:33:36 -0000 1.139 > --- src/backend/executor/execUtils.c 9 Aug 2006 18:05:05 -0000 > *************** > *** 31,36 **** > --- 31,37 ---- > * ExecOpenIndices \ > * ExecCloseIndices | referenced by InitPlan, EndPlan, > * ExecInsertIndexTuples / ExecInsert, ExecUpdate > + * ExecSuggestBlock Referenced by ExecInsert > * > * RegisterExprContextCallback Register function shutdown callback > * UnregisterExprContextCallback Deregister function shutdown callback > *************** > *** 874,879 **** > --- 875,881 ---- > IndexInfo **indexInfoArray; > > resultRelInfo->ri_NumIndices = 0; > + resultRelInfo->ri_ClusterIndex = -1; > > /* fast path if no indexes */ > if (!RelationGetForm(resultRelation)->relhasindex) > *************** > *** 913,918 **** > --- 915,925 ---- > /* extract index key information from the index's pg_index info */ > ii = BuildIndexInfo(indexDesc); > > + /* Remember which index is the clustered one. > + * It's used to call the suggestblock-method on inserts */ > + if(indexDesc->rd_index->indisclustered) > + resultRelInfo->ri_ClusterIndex = i; > + > relationDescs[i] = indexDesc; > indexInfoArray[i] = ii; > i++; > *************** > *** 1062,1067 **** > --- 1069,1137 ---- > } > } > > + /* ---------------------------------------------------------------- > + * ExecSuggestBlock > + * > + * This routine asks the index am where a new heap tuple > + * should be placed. > + * ---------------------------------------------------------------- > + */ > + BlockNumber > + ExecSuggestBlock(TupleTableSlot *slot, > + EState *estate) > + { > + ResultRelInfo *resultRelInfo; > + int i; > + Relation relationDesc; > + Relation heapRelation; > + ExprContext *econtext; > + Datum values[INDEX_MAX_KEYS]; > + bool isnull[INDEX_MAX_KEYS]; > + IndexInfo *indexInfo; > + > + /* > + * Get information from the result relation info structure. > + */ > + resultRelInfo = estate->es_result_relation_info; > + i = resultRelInfo->ri_ClusterIndex; > + if(i == -1) > + return InvalidBlockNumber; /* there was no clustered index */ > + > + heapRelation = resultRelInfo->ri_RelationDesc; > + relationDesc = resultRelInfo->ri_IndexRelationDescs[i]; > + indexInfo = resultRelInfo->ri_IndexRelationInfo[i]; > + > + /* You can't cluster on a partial index */ > + Assert(indexInfo->ii_Predicate == NIL); > + > + /* > + * We will use the EState's per-tuple context for evaluating > + * index expressions (creating it if it's not already there). > + */ > + econtext = GetPerTupleExprContext(estate); > + > + /* Arrange for econtext's scan tuple to be the tuple under test */ > + econtext->ecxt_scantuple = slot; > + > + /* > + * FormIndexDatum fills in its values and isnull parameters with the > + * appropriate values for the column(s) of the index. > + */ > + FormIndexDatum(indexInfo, > + slot, > + estate, > + values, > + isnull); > + > + /* > + * The index AM does the rest. > + */ > + return index_suggestblock(relationDesc, /* index relation */ > + values, /* array of index Datums */ > + isnull, /* null flags */ > + heapRelation); > + } > + > /* > * UpdateChangedParamSet > * Add changed parameters to a plan node's chgParam set > Index: src/include/access/genam.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/genam.h,v > retrieving revision 1.65 > diff -c -r1.65 genam.h > *** src/include/access/genam.h 31 Jul 2006 20:09:05 -0000 1.65 > --- src/include/access/genam.h 9 Aug 2006 17:53:44 -0000 > *************** > *** 93,98 **** > --- 93,101 ---- > ItemPointer heap_t_ctid, > Relation heapRelation, > bool check_uniqueness); > + extern BlockNumber index_suggestblock(Relation indexRelation, > + Datum *values, bool *isnull, > + Relation heapRelation); > > extern IndexScanDesc index_beginscan(Relation heapRelation, > Relation indexRelation, > *************** > *** 123,128 **** > --- 126,133 ---- > extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum, > uint16 procnum); > > + extern Datum dummysuggestblock(PG_FUNCTION_ARGS); > + > /* > * index access method support routines (in genam.c) > */ > Index: src/include/access/heapam.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/heapam.h,v > retrieving revision 1.114 > diff -c -r1.114 heapam.h > *** src/include/access/heapam.h 3 Jul 2006 22:45:39 -0000 1.114 > --- src/include/access/heapam.h 8 Aug 2006 16:17:21 -0000 > *************** > *** 156,162 **** > extern void setLastTid(const ItemPointer tid); > > extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, > ! bool use_wal, bool use_fsm); > extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, > ItemPointer ctid, TransactionId *update_xmax, > CommandId cid, Snapshot crosscheck, bool wait); > --- 156,162 ---- > extern void setLastTid(const ItemPointer tid); > > extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, > ! bool use_wal, bool use_fsm, BlockNumber suggestedblk); > extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, > ItemPointer ctid, TransactionId *update_xmax, > CommandId cid, Snapshot crosscheck, bool wait); > Index: src/include/access/hio.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/hio.h,v > retrieving revision 1.32 > diff -c -r1.32 hio.h > *** src/include/access/hio.h 13 Jul 2006 17:47:01 -0000 1.32 > --- src/include/access/hio.h 8 Aug 2006 16:17:21 -0000 > *************** > *** 21,26 **** > extern void RelationPutHeapTuple(Relation relation, Buffer buffer, > HeapTuple tuple); > extern Buffer RelationGetBufferForTuple(Relation relation, Size len, > ! Buffer otherBuffer, bool use_fsm); > > #endif /* HIO_H */ > --- 21,26 ---- > extern void RelationPutHeapTuple(Relation relation, Buffer buffer, > HeapTuple tuple); > extern Buffer RelationGetBufferForTuple(Relation relation, Size len, > ! Buffer otherBuffer, bool use_fsm, BlockNumber suggestedblk); > > #endif /* HIO_H */ > Index: src/include/access/nbtree.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/nbtree.h,v > retrieving revision 1.103 > diff -c -r1.103 nbtree.h > *** src/include/access/nbtree.h 7 Aug 2006 16:57:57 -0000 1.103 > --- src/include/access/nbtree.h 8 Aug 2006 16:17:21 -0000 > *************** > *** 467,472 **** > --- 467,473 ---- > extern Datum btbulkdelete(PG_FUNCTION_ARGS); > extern Datum btvacuumcleanup(PG_FUNCTION_ARGS); > extern Datum btoptions(PG_FUNCTION_ARGS); > + extern Datum btsuggestblock(PG_FUNCTION_ARGS); > > /* > * prototypes for functions in nbtinsert.c > *************** > *** 476,481 **** > --- 477,484 ---- > extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access); > extern void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf, > BTStack stack, bool is_root, bool is_only); > + extern BlockNumber _bt_suggestblock(Relation rel, IndexTuple itup, > + Relation heapRel); > > /* > * prototypes for functions in nbtpage.c > Index: src/include/catalog/pg_am.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_am.h,v > retrieving revision 1.46 > diff -c -r1.46 pg_am.h > *** src/include/catalog/pg_am.h 31 Jul 2006 20:09:05 -0000 1.46 > --- src/include/catalog/pg_am.h 8 Aug 2006 16:17:21 -0000 > *************** > *** 65,70 **** > --- 65,71 ---- > regproc amvacuumcleanup; /* post-VACUUM cleanup function */ > regproc amcostestimate; /* estimate cost of an indexscan */ > regproc amoptions; /* parse AM-specific parameters */ > + regproc amsuggestblock; /* suggest a block where to put heap tuple */ > } FormData_pg_am; > > /* ---------------- > *************** > *** 78,84 **** > * compiler constants for pg_am > * ---------------- > */ > ! #define Natts_pg_am 23 > #define Anum_pg_am_amname 1 > #define Anum_pg_am_amstrategies 2 > #define Anum_pg_am_amsupport 3 > --- 79,85 ---- > * compiler constants for pg_am > * ---------------- > */ > ! #define Natts_pg_am 24 > #define Anum_pg_am_amname 1 > #define Anum_pg_am_amstrategies 2 > #define Anum_pg_am_amsupport 3 > *************** > *** 102,123 **** > #define Anum_pg_am_amvacuumcleanup 21 > #define Anum_pg_am_amcostestimate 22 > #define Anum_pg_am_amoptions 23 > > /* ---------------- > * initial contents of pg_am > * ---------------- > */ > > ! DATA(insert OID = 403 ( btree 5 1 1 t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan btmarkposbtrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions )); > DESCR("b-tree index access method"); > #define BTREE_AM_OID 403 > ! DATA(insert OID = 405 ( hash 1 1 0 f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan hashendscanhashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions )); > DESCR("hash index access method"); > #define HASH_AM_OID 405 > ! DATA(insert OID = 783 ( gist 100 7 0 f t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan gistendscangistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions )); > DESCR("GiST index access method"); > #define GIST_AM_OID 783 > ! DATA(insert OID = 2742 ( gin 100 4 0 f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan ginendscanginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions )); > DESCR("GIN index access method"); > #define GIN_AM_OID 2742 > > --- 103,125 ---- > #define Anum_pg_am_amvacuumcleanup 21 > #define Anum_pg_am_amcostestimate 22 > #define Anum_pg_am_amoptions 23 > + #define Anum_pg_am_amsuggestblock 24 > > /* ---------------- > * initial contents of pg_am > * ---------------- > */ > > ! DATA(insert OID = 403 ( btree 5 1 1 t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan btmarkposbtrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions btsuggestblock)); > DESCR("b-tree index access method"); > #define BTREE_AM_OID 403 > ! DATA(insert OID = 405 ( hash 1 1 0 f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan hashendscanhashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions dummysuggestblock)); > DESCR("hash index access method"); > #define HASH_AM_OID 405 > ! DATA(insert OID = 783 ( gist 100 7 0 f t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan gistendscangistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions dummysuggestblock)); > DESCR("GiST index access method"); > #define GIST_AM_OID 783 > ! DATA(insert OID = 2742 ( gin 100 4 0 f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan ginendscanginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions dummysuggestblock )); > DESCR("GIN index access method"); > #define GIN_AM_OID 2742 > > Index: src/include/catalog/pg_proc.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_proc.h,v > retrieving revision 1.420 > diff -c -r1.420 pg_proc.h > *** src/include/catalog/pg_proc.h 6 Aug 2006 03:53:44 -0000 1.420 > --- src/include/catalog/pg_proc.h 9 Aug 2006 18:06:44 -0000 > *************** > *** 682,687 **** > --- 682,689 ---- > DESCR("btree(internal)"); > DATA(insert OID = 2785 ( btoptions PGNSP PGUID 12 f f t f s 2 17 "1009 16" _null_ _null_ _null_ btoptions- _null_ )); > DESCR("btree(internal)"); > + DATA(insert OID = 2852 ( btsuggestblock PGNSP PGUID 12 f f t f v 4 23 "2281 2281 2281 2281" _null_ _null_ _null_ btsuggestblock - _null_ )); > + DESCR("btree(internal)"); > > DATA(insert OID = 339 ( poly_same PGNSP PGUID 12 f f t f i 2 16 "604 604" _null_ _null_ _null_ poly_same -_null_ )); > DESCR("same as?"); > *************** > *** 3936,3941 **** > --- 3938,3946 ---- > DATA(insert OID = 2749 ( arraycontained PGNSP PGUID 12 f f t f i 2 16 "2277 2277" _null_ _null_ _null_ arraycontained- _null_ )); > DESCR("anyarray contained"); > > + DATA(insert OID = 2853 ( dummysuggestblock PGNSP PGUID 12 f f t f v 4 23 "2281 2281 2281 2281" _null_ _null_ _null_ dummysuggestblock - _null_ )); > + DESCR("dummy amsuggestblock implementation (internal)"); > + > /* > * Symbolic values for provolatile column: these indicate whether the result > * of a function is dependent *only* on the values of its explicit arguments, > Index: src/include/executor/executor.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/executor/executor.h,v > retrieving revision 1.128 > diff -c -r1.128 executor.h > *** src/include/executor/executor.h 4 Aug 2006 21:33:36 -0000 1.128 > --- src/include/executor/executor.h 8 Aug 2006 16:17:21 -0000 > *************** > *** 271,276 **** > --- 271,277 ---- > extern void ExecCloseIndices(ResultRelInfo *resultRelInfo); > extern void ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid, > EState *estate, bool is_vacuum); > + extern BlockNumber ExecSuggestBlock(TupleTableSlot *slot, EState *estate); > > extern void RegisterExprContextCallback(ExprContext *econtext, > ExprContextCallbackFunction function, > Index: src/include/nodes/execnodes.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/nodes/execnodes.h,v > retrieving revision 1.158 > diff -c -r1.158 execnodes.h > *** src/include/nodes/execnodes.h 4 Aug 2006 21:33:36 -0000 1.158 > --- src/include/nodes/execnodes.h 8 Aug 2006 16:17:21 -0000 > *************** > *** 257,262 **** > --- 257,264 ---- > * NumIndices # of indices existing on result relation > * IndexRelationDescs array of relation descriptors for indices > * IndexRelationInfo array of key/attr info for indices > + * ClusterIndex index to the IndexRelationInfo array of the > + * clustered index, or -1 if there's none > * TrigDesc triggers to be fired, if any > * TrigFunctions cached lookup info for trigger functions > * TrigInstrument optional runtime measurements for triggers > *************** > *** 272,277 **** > --- 274,280 ---- > int ri_NumIndices; > RelationPtr ri_IndexRelationDescs; > IndexInfo **ri_IndexRelationInfo; > + int ri_ClusterIndex; > TriggerDesc *ri_TrigDesc; > FmgrInfo *ri_TrigFunctions; > struct Instrumentation *ri_TrigInstrument; > Index: src/include/utils/rel.h > =================================================================== > RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/utils/rel.h,v > retrieving revision 1.91 > diff -c -r1.91 rel.h > *** src/include/utils/rel.h 3 Jul 2006 22:45:41 -0000 1.91 > --- src/include/utils/rel.h 8 Aug 2006 16:17:21 -0000 > *************** > *** 116,121 **** > --- 116,122 ---- > FmgrInfo amvacuumcleanup; > FmgrInfo amcostestimate; > FmgrInfo amoptions; > + FmgrInfo amsuggestblock; > } RelationAmInfo; > > -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +