Thread: Maintaining cluster order on insert

Maintaining cluster order on insert

From
Heikki Linnakangas
Date:
While thinking about index-organized-tables and similar ideas, it
occurred to me that there's some low-hanging-fruit: maintaining cluster
order on inserts by trying to place new heap tuples close to other
similar tuples. That involves asking the index am where on the heap the
new tuple should go, and trying to insert it there before using the FSM.
Using the new fillfactor parameter makes it more likely that there's
room on the page. We don't worry about the order within the page.

The API I'm thinking of introduces a new optional index am function,
amsuggestblock (suggestions for a better name are welcome). It gets the
same parameters as aminsert, and returns the heap block number that
would be optimal place to put the new tuple. It's be called from
ExecInsert before inserting the heap tuple, and the suggestion is passed
on to heap_insert and RelationGetBufferForTuple.

I wrote a little patch to implement this for btree, attached.

This could be optimized by changing the existing aminsert API, because
as it is, an insert will have to descend the btree twice. Once in
amsuggestblock and then in aminsert. amsuggestblock could keep the right
index page pinned so aminsert could locate it quicker. But I wanted to
keep this simple for now. Another improvement might be to allow
amsuggestblock to return a list of suggestions, but that makes it more
expensive to insert if there isn't room in the suggested pages, since
heap_insert will have to try them all before giving up.

Comments regarding the general idea or the patch? There should probably
be a index option to turn the feature on and off. You'll want to turn it
off when you first load a table, and turn it on after CLUSTER to keep it
clustered.

Since there's been discussion on keeping the TODO list more up-to-date,
I hereby officially claim the "Automatically maintain clustering on a
table" TODO item :). Feel free to bombard me with requests for status
reports. And just to be clear, I'm not trying to sneak this into 8.2
anymore, this is 8.3 stuff.

I won't be implementing a background daemon described on the TODO item,
since that would essentially be an online version of CLUSTER. Which sure
would be nice, but that's a different story.

- Heikki

Index: doc/src/sgml/catalogs.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/catalogs.sgml,v
retrieving revision 2.129
diff -c -r2.129 catalogs.sgml
*** doc/src/sgml/catalogs.sgml    31 Jul 2006 20:08:55 -0000    2.129
--- doc/src/sgml/catalogs.sgml    8 Aug 2006 16:17:21 -0000
***************
*** 499,504 ****
--- 499,511 ----
        <entry>Function to parse and validate reloptions for an index</entry>
       </row>

+      <row>
+       <entry><structfield>amsuggestblock</structfield></entry>
+       <entry><type>regproc</type></entry>
+       <entry><literal><link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>.oid</literal></entry>
+       <entry>Get the best place in the heap to put a new tuple</entry>
+      </row>
+
      </tbody>
     </tgroup>
    </table>
Index: doc/src/sgml/indexam.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/indexam.sgml,v
retrieving revision 2.16
diff -c -r2.16 indexam.sgml
*** doc/src/sgml/indexam.sgml    31 Jul 2006 20:08:59 -0000    2.16
--- doc/src/sgml/indexam.sgml    8 Aug 2006 17:15:25 -0000
***************
*** 391,396 ****
--- 391,414 ----
     <function>amoptions</> to test validity of options settings.
    </para>

+   <para>
+ <programlisting>
+ BlockNumber
+ amsuggestblock (Relation indexRelation,
+                 Datum *values,
+                 bool *isnull,
+                 Relation heapRelation);
+ </programlisting>
+    Gets the optimal place in the heap for a new tuple. The parameters
+    correspond the parameters for <literal>aminsert</literal>.
+    This function is called on the clustered index before a new tuple
+    is inserted to the heap, and it should choose the optimal insertion
+    target page on the heap in such manner that the heap stays as close
+    as possible to the index order.
+    <literal>amsuggestblock</literal> can return InvalidBlockNumber if
+    the index am doesn't have a suggestion.
+   </para>
+
   </sect1>

   <sect1 id="index-scanning">
Index: src/backend/access/heap/heapam.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.218
diff -c -r1.218 heapam.c
*** src/backend/access/heap/heapam.c    31 Jul 2006 20:08:59 -0000    1.218
--- src/backend/access/heap/heapam.c    8 Aug 2006 16:17:21 -0000
***************
*** 1325,1330 ****
--- 1325,1335 ----
   * use_fsm is passed directly to RelationGetBufferForTuple, which see for
   * more info.
   *
+  * suggested_blk can be set by the caller to hint heap_insert which
+  * block would be the best place to put the new tuple in. heap_insert can
+  * ignore the suggestion, if there's not enough room on that block.
+  * InvalidBlockNumber means no preference.
+  *
   * The return value is the OID assigned to the tuple (either here or by the
   * caller), or InvalidOid if no OID.  The header fields of *tup are updated
   * to match the stored tuple; in particular tup->t_self receives the actual
***************
*** 1333,1339 ****
   */
  Oid
  heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!             bool use_wal, bool use_fsm)
  {
      TransactionId xid = GetCurrentTransactionId();
      HeapTuple    heaptup;
--- 1338,1344 ----
   */
  Oid
  heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!             bool use_wal, bool use_fsm, BlockNumber suggested_blk)
  {
      TransactionId xid = GetCurrentTransactionId();
      HeapTuple    heaptup;
***************
*** 1386,1392 ****

      /* Find buffer to insert this tuple into */
      buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
!                                        InvalidBuffer, use_fsm);

      /* NO EREPORT(ERROR) from here till changes are logged */
      START_CRIT_SECTION();
--- 1391,1397 ----

      /* Find buffer to insert this tuple into */
      buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
!                                        InvalidBuffer, use_fsm, suggested_blk);

      /* NO EREPORT(ERROR) from here till changes are logged */
      START_CRIT_SECTION();
***************
*** 1494,1500 ****
  Oid
  simple_heap_insert(Relation relation, HeapTuple tup)
  {
!     return heap_insert(relation, tup, GetCurrentCommandId(), true, true);
  }

  /*
--- 1499,1506 ----
  Oid
  simple_heap_insert(Relation relation, HeapTuple tup)
  {
!     return heap_insert(relation, tup, GetCurrentCommandId(), true,
!                        true, InvalidBlockNumber);
  }

  /*
***************
*** 2079,2085 ****
          {
              /* Assume there's no chance to put heaptup on same page. */
              newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
!                                                buffer, true);
          }
          else
          {
--- 2085,2092 ----
          {
              /* Assume there's no chance to put heaptup on same page. */
              newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
!                                                buffer, true,
!                                                InvalidBlockNumber);
          }
          else
          {
***************
*** 2096,2102 ****
                   */
                  LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
                  newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
!                                                    buffer, true);
              }
              else
              {
--- 2103,2110 ----
                   */
                  LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
                  newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
!                                                    buffer, true,
!                                                    InvalidBlockNumber);
              }
              else
              {
Index: src/backend/access/heap/hio.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/hio.c,v
retrieving revision 1.63
diff -c -r1.63 hio.c
*** src/backend/access/heap/hio.c    3 Jul 2006 22:45:37 -0000    1.63
--- src/backend/access/heap/hio.c    9 Aug 2006 18:03:01 -0000
***************
*** 93,98 ****
--- 93,100 ----
   *    any committed data of other transactions.  (See heap_insert's comments
   *    for additional constraints needed for safe usage of this behavior.)
   *
+  *    If the caller has a suggestion, it's passed in suggestedBlock.
+  *
   *    We always try to avoid filling existing pages further than the fillfactor.
   *    This is OK since this routine is not consulted when updating a tuple and
   *    keeping it on the same page, which is the scenario fillfactor is meant
***************
*** 103,109 ****
   */
  Buffer
  RelationGetBufferForTuple(Relation relation, Size len,
!                           Buffer otherBuffer, bool use_fsm)
  {
      Buffer        buffer = InvalidBuffer;
      Page        pageHeader;
--- 105,112 ----
   */
  Buffer
  RelationGetBufferForTuple(Relation relation, Size len,
!                           Buffer otherBuffer, bool use_fsm,
!                           BlockNumber suggestedBlock)
  {
      Buffer        buffer = InvalidBuffer;
      Page        pageHeader;
***************
*** 135,142 ****
          otherBlock = InvalidBlockNumber;        /* just to keep compiler quiet */

      /*
!      * We first try to put the tuple on the same page we last inserted a tuple
!      * on, as cached in the relcache entry.  If that doesn't work, we ask the
       * shared Free Space Map to locate a suitable page.  Since the FSM's info
       * might be out of date, we have to be prepared to loop around and retry
       * multiple times.    (To insure this isn't an infinite loop, we must update
--- 138,147 ----
          otherBlock = InvalidBlockNumber;        /* just to keep compiler quiet */

      /*
!      * We first try to put the tuple on the page suggested by the caller, if
!      * any. Then we try to put the tuple on the same page we last inserted a
!      * tuple on, as cached in the relcache entry. If that doesn't work, we
!      * ask the
       * shared Free Space Map to locate a suitable page.  Since the FSM's info
       * might be out of date, we have to be prepared to loop around and retry
       * multiple times.    (To insure this isn't an infinite loop, we must update
***************
*** 144,152 ****
       * not to be suitable.)  If the FSM has no record of a page with enough
       * free space, we give up and extend the relation.
       *
!      * When use_fsm is false, we either put the tuple onto the existing target
!      * page or extend the relation.
       */
      if (len + saveFreeSpace <= MaxTupleSize)
          targetBlock = relation->rd_targblock;
      else
--- 149,167 ----
       * not to be suitable.)  If the FSM has no record of a page with enough
       * free space, we give up and extend the relation.
       *
!      * When use_fsm is false, we skip the fsm lookup if neither the suggested
!      * nor the cached last insertion page has enough room, and extend the
!      * relation.
!      *
!      * The fillfactor is taken into account when calculating the free space
!      * on the cached target block, and when using the FSM. The suggested page
!      * is used whenever there's enough room in it, regardless of the fillfactor,
!      * because that's exactly the purpose the space is reserved for in the
!      * first place.
       */
+     if (suggestedBlock != InvalidBlockNumber)
+         targetBlock = suggestedBlock;
+     else
      if (len + saveFreeSpace <= MaxTupleSize)
          targetBlock = relation->rd_targblock;
      else
***************
*** 219,224 ****
--- 234,244 ----
           */
          pageHeader = (Page) BufferGetPage(buffer);
          pageFreeSpace = PageGetFreeSpace(pageHeader);
+
+         /* If we're trying the suggested block, don't care about fillfactor */
+         if (targetBlock == suggestedBlock && len <= pageFreeSpace)
+             return buffer;
+
          if (len + saveFreeSpace <= pageFreeSpace)
          {
              /* use this page as future insert target, too */
***************
*** 241,246 ****
--- 261,275 ----
              ReleaseBuffer(buffer);
          }

+         /* If we just tried the suggested block, try the cached target
+          * block next, before consulting the FSM. */
+         if(suggestedBlock == targetBlock)
+         {
+             targetBlock = relation->rd_targblock;
+             suggestedBlock = InvalidBlockNumber;
+             continue;
+         }
+
          /* Without FSM, always fall out of the loop and extend */
          if (!use_fsm)
              break;
Index: src/backend/access/index/genam.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/index/genam.c,v
retrieving revision 1.58
diff -c -r1.58 genam.c
*** src/backend/access/index/genam.c    31 Jul 2006 20:08:59 -0000    1.58
--- src/backend/access/index/genam.c    8 Aug 2006 16:17:21 -0000
***************
*** 259,261 ****
--- 259,275 ----

      pfree(sysscan);
  }
+
+ /*
+  * This is a dummy implementation of amsuggestblock, to be used for index
+  * access methods that don't or can't support it. It just returns
+  * InvalidBlockNumber, which means "no preference".
+  *
+  * This is probably not a good best place for this function, but it doesn't
+  * fit naturally anywhere else either.
+  */
+ Datum
+ dummysuggestblock(PG_FUNCTION_ARGS)
+ {
+     PG_RETURN_UINT32(InvalidBlockNumber);
+ }
Index: src/backend/access/index/indexam.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/index/indexam.c,v
retrieving revision 1.94
diff -c -r1.94 indexam.c
*** src/backend/access/index/indexam.c    31 Jul 2006 20:08:59 -0000    1.94
--- src/backend/access/index/indexam.c    8 Aug 2006 16:17:21 -0000
***************
*** 18,23 ****
--- 18,24 ----
   *        index_rescan    - restart a scan of an index
   *        index_endscan    - end a scan
   *        index_insert    - insert an index tuple into a relation
+  *        index_suggestblock    - get desired insert location for a heap tuple
   *        index_markpos    - mark a scan position
   *        index_restrpos    - restore a scan position
   *        index_getnext    - get the next tuple from a scan
***************
*** 202,207 ****
--- 203,237 ----
                                        BoolGetDatum(check_uniqueness)));
  }

+ /* ----------------
+  *        index_suggestblock - get desired insert location for a heap tuple
+  *
+  * The returned BlockNumber is the *heap* page that is the best place
+  * to insert the given tuple to, according to the index am. The best
+  * place is usually one that maintains the cluster order.
+  * ----------------
+  */
+ BlockNumber
+ index_suggestblock(Relation indexRelation,
+                    Datum *values,
+                    bool *isnull,
+                    Relation heapRelation)
+ {
+     FmgrInfo   *procedure;
+
+     RELATION_CHECKS;
+     GET_REL_PROCEDURE(amsuggestblock);
+
+     /*
+      * have the am's suggestblock proc do all the work.
+      */
+     return DatumGetUInt32(FunctionCall4(procedure,
+                                       PointerGetDatum(indexRelation),
+                                       PointerGetDatum(values),
+                                       PointerGetDatum(isnull),
+                                       PointerGetDatum(heapRelation)));
+ }
+
  /*
   * index_beginscan - start a scan of an index with amgettuple
   *
Index: src/backend/access/nbtree/nbtinsert.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtinsert.c,v
retrieving revision 1.142
diff -c -r1.142 nbtinsert.c
*** src/backend/access/nbtree/nbtinsert.c    25 Jul 2006 19:13:00 -0000    1.142
--- src/backend/access/nbtree/nbtinsert.c    9 Aug 2006 17:51:33 -0000
***************
*** 146,151 ****
--- 146,221 ----
  }

  /*
+  *    _bt_suggestblock() -- Find the heap block of the closest index tuple.
+  *
+  * The logic to find the target should match _bt_doinsert, otherwise
+  * we'll be making bad suggestions.
+  */
+ BlockNumber
+ _bt_suggestblock(Relation rel, IndexTuple itup, Relation heapRel)
+ {
+     int            natts = rel->rd_rel->relnatts;
+     OffsetNumber offset;
+     Page        page;
+     BTPageOpaque opaque;
+
+     ScanKey        itup_scankey;
+     BTStack        stack;
+     Buffer        buf;
+     IndexTuple    curitup;
+     BlockNumber suggestion = InvalidBlockNumber;
+
+     /* we need an insertion scan key to do our search, so build one */
+     itup_scankey = _bt_mkscankey(rel, itup);
+
+     /* find the first page containing this key */
+     stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_READ);
+     if(!BufferIsValid(buf))
+     {
+         /* The index was completely empty. No suggestion then. */
+         return InvalidBlockNumber;
+     }
+     /* we don't need the stack, so free it right away */
+     _bt_freestack(stack);
+
+     page = BufferGetPage(buf);
+     opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+     /* Find the location in the page where the new index tuple would go to. */
+
+     offset = _bt_binsrch(rel, buf, natts, itup_scankey, false);
+     if (offset > PageGetMaxOffsetNumber(page))
+     {
+         /* _bt_binsrch returned pointer to end-of-page. It means that
+          * there was no equal items on the page, and the new item should
+          * be inserted as the last tuple of the page. There could be equal
+          * items on the next page, however.
+          *
+          * At the moment, we just ignore the potential equal items on the
+          * right, and pretend there isn't any. We could instead walk right
+          * to the next page to check that, but let's keep it simple for now.
+          */
+         offset = OffsetNumberPrev(offset);
+     }
+     if(offset < P_FIRSTDATAKEY(opaque))
+     {
+         /* We landed on an empty page. We could step left or right until
+          * we find some items, but let's keep it simple for now.
+          */
+     } else {
+         /* We're now positioned at the index tuple that we're interested in. */
+
+         curitup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
+         suggestion = ItemPointerGetBlockNumber(&curitup->t_tid);
+     }
+
+     _bt_relbuf(rel, buf);
+     _bt_freeskey(itup_scankey);
+
+     return suggestion;
+ }
+
+ /*
   *    _bt_check_unique() -- Check for violation of unique index constraint
   *
   * Returns InvalidTransactionId if there is no conflict, else an xact ID
Index: src/backend/access/nbtree/nbtree.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtree.c,v
retrieving revision 1.149
diff -c -r1.149 nbtree.c
*** src/backend/access/nbtree/nbtree.c    10 May 2006 23:18:39 -0000    1.149
--- src/backend/access/nbtree/nbtree.c    9 Aug 2006 18:04:02 -0000
***************
*** 228,233 ****
--- 228,265 ----
  }

  /*
+  *    btsuggestblock() -- find the best place in the heap to put a new tuple.
+  *
+  *        This uses the same logic as btinsert to find the place where the index
+  *        tuple would go if this was a btinsert call.
+  *
+  *        There's room for improvement here. An insert operation will descend
+  *        the tree twice, first by btsuggestblock, then by btinsert. Things
+  *        might have changed in between, so that the heap tuple is actually
+  *        not inserted in the optimal page, but since this is just an
+  *        optimization, it's ok if it happens    sometimes.
+  */
+ Datum
+ btsuggestblock(PG_FUNCTION_ARGS)
+ {
+     Relation    rel = (Relation) PG_GETARG_POINTER(0);
+     Datum       *values = (Datum *) PG_GETARG_POINTER(1);
+     bool       *isnull = (bool *) PG_GETARG_POINTER(2);
+     Relation    heapRel = (Relation) PG_GETARG_POINTER(3);
+     IndexTuple    itup;
+     BlockNumber suggestion;
+
+     /* generate an index tuple */
+     itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
+
+     suggestion =_bt_suggestblock(rel, itup, heapRel);
+
+     pfree(itup);
+
+     PG_RETURN_UINT32(suggestion);
+ }
+
+ /*
   *    btgettuple() -- Get the next tuple in the scan.
   */
  Datum
Index: src/backend/executor/execMain.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execMain.c,v
retrieving revision 1.277
diff -c -r1.277 execMain.c
*** src/backend/executor/execMain.c    31 Jul 2006 01:16:37 -0000    1.277
--- src/backend/executor/execMain.c    8 Aug 2006 16:17:21 -0000
***************
*** 892,897 ****
--- 892,898 ----
      resultRelInfo->ri_RangeTableIndex = resultRelationIndex;
      resultRelInfo->ri_RelationDesc = resultRelationDesc;
      resultRelInfo->ri_NumIndices = 0;
+     resultRelInfo->ri_ClusterIndex = -1;
      resultRelInfo->ri_IndexRelationDescs = NULL;
      resultRelInfo->ri_IndexRelationInfo = NULL;
      /* make a copy so as not to depend on relcache info not changing... */
***************
*** 1388,1394 ****
          heap_insert(estate->es_into_relation_descriptor, tuple,
                      estate->es_snapshot->curcid,
                      estate->es_into_relation_use_wal,
!                     false);        /* never any point in using FSM */
          /* we know there are no indexes to update */
          heap_freetuple(tuple);
          IncrAppended();
--- 1389,1396 ----
          heap_insert(estate->es_into_relation_descriptor, tuple,
                      estate->es_snapshot->curcid,
                      estate->es_into_relation_use_wal,
!                     false, /* never any point in using FSM */
!                     InvalidBlockNumber);
          /* we know there are no indexes to update */
          heap_freetuple(tuple);
          IncrAppended();
***************
*** 1419,1424 ****
--- 1421,1427 ----
      ResultRelInfo *resultRelInfo;
      Relation    resultRelationDesc;
      Oid            newId;
+     BlockNumber suggestedBlock;

      /*
       * get the heap tuple out of the tuple table slot, making sure we have a
***************
*** 1467,1472 ****
--- 1470,1479 ----
      if (resultRelationDesc->rd_att->constr)
          ExecConstraints(resultRelInfo, slot, estate);

+     /* Ask the index am of the clustered index for the
+      * best place to put it */
+     suggestedBlock = ExecSuggestBlock(slot, estate);
+
      /*
       * insert the tuple
       *
***************
*** 1475,1481 ****
       */
      newId = heap_insert(resultRelationDesc, tuple,
                          estate->es_snapshot->curcid,
!                         true, true);

      IncrAppended();
      (estate->es_processed)++;
--- 1482,1488 ----
       */
      newId = heap_insert(resultRelationDesc, tuple,
                          estate->es_snapshot->curcid,
!                         true, true, suggestedBlock);

      IncrAppended();
      (estate->es_processed)++;
Index: src/backend/executor/execUtils.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execUtils.c,v
retrieving revision 1.139
diff -c -r1.139 execUtils.c
*** src/backend/executor/execUtils.c    4 Aug 2006 21:33:36 -0000    1.139
--- src/backend/executor/execUtils.c    9 Aug 2006 18:05:05 -0000
***************
*** 31,36 ****
--- 31,37 ----
   *        ExecOpenIndices            \
   *        ExecCloseIndices         | referenced by InitPlan, EndPlan,
   *        ExecInsertIndexTuples    /  ExecInsert, ExecUpdate
+  *        ExecSuggestBlock        Referenced by ExecInsert
   *
   *        RegisterExprContextCallback    Register function shutdown callback
   *        UnregisterExprContextCallback  Deregister function shutdown callback
***************
*** 874,879 ****
--- 875,881 ----
      IndexInfo **indexInfoArray;

      resultRelInfo->ri_NumIndices = 0;
+     resultRelInfo->ri_ClusterIndex = -1;

      /* fast path if no indexes */
      if (!RelationGetForm(resultRelation)->relhasindex)
***************
*** 913,918 ****
--- 915,925 ----
          /* extract index key information from the index's pg_index info */
          ii = BuildIndexInfo(indexDesc);

+         /* Remember which index is the clustered one.
+          * It's used to call the suggestblock-method on inserts */
+         if(indexDesc->rd_index->indisclustered)
+             resultRelInfo->ri_ClusterIndex = i;
+
          relationDescs[i] = indexDesc;
          indexInfoArray[i] = ii;
          i++;
***************
*** 1062,1067 ****
--- 1069,1137 ----
      }
  }

+ /* ----------------------------------------------------------------
+  *        ExecSuggestBlock
+  *
+  *        This routine asks the index am where a new heap tuple
+  *        should be placed.
+  * ----------------------------------------------------------------
+  */
+ BlockNumber
+ ExecSuggestBlock(TupleTableSlot *slot,
+                  EState *estate)
+ {
+     ResultRelInfo *resultRelInfo;
+     int            i;
+     Relation    relationDesc;
+     Relation    heapRelation;
+     ExprContext *econtext;
+     Datum        values[INDEX_MAX_KEYS];
+     bool        isnull[INDEX_MAX_KEYS];
+     IndexInfo  *indexInfo;
+
+     /*
+      * Get information from the result relation info structure.
+      */
+     resultRelInfo = estate->es_result_relation_info;
+     i = resultRelInfo->ri_ClusterIndex;
+     if(i == -1)
+         return InvalidBlockNumber; /* there was no clustered index */
+
+     heapRelation = resultRelInfo->ri_RelationDesc;
+     relationDesc = resultRelInfo->ri_IndexRelationDescs[i];
+     indexInfo = resultRelInfo->ri_IndexRelationInfo[i];
+
+     /* You can't cluster on a partial index */
+     Assert(indexInfo->ii_Predicate == NIL);
+
+     /*
+      * We will use the EState's per-tuple context for evaluating
+      * index expressions (creating it if it's not already there).
+      */
+     econtext = GetPerTupleExprContext(estate);
+
+     /* Arrange for econtext's scan tuple to be the tuple under test */
+     econtext->ecxt_scantuple = slot;
+
+     /*
+      * FormIndexDatum fills in its values and isnull parameters with the
+      * appropriate values for the column(s) of the index.
+      */
+     FormIndexDatum(indexInfo,
+                    slot,
+                    estate,
+                    values,
+                    isnull);
+
+     /*
+      * The index AM does the rest.
+      */
+     return index_suggestblock(relationDesc,    /* index relation */
+                  values,    /* array of index Datums */
+                  isnull,    /* null flags */
+                  heapRelation);
+ }
+
  /*
   * UpdateChangedParamSet
   *        Add changed parameters to a plan node's chgParam set
Index: src/include/access/genam.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/genam.h,v
retrieving revision 1.65
diff -c -r1.65 genam.h
*** src/include/access/genam.h    31 Jul 2006 20:09:05 -0000    1.65
--- src/include/access/genam.h    9 Aug 2006 17:53:44 -0000
***************
*** 93,98 ****
--- 93,101 ----
               ItemPointer heap_t_ctid,
               Relation heapRelation,
               bool check_uniqueness);
+ extern BlockNumber index_suggestblock(Relation indexRelation,
+              Datum *values, bool *isnull,
+              Relation heapRelation);

  extern IndexScanDesc index_beginscan(Relation heapRelation,
                  Relation indexRelation,
***************
*** 123,128 ****
--- 126,133 ----
  extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
                    uint16 procnum);

+ extern Datum dummysuggestblock(PG_FUNCTION_ARGS);
+
  /*
   * index access method support routines (in genam.c)
   */
Index: src/include/access/heapam.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/heapam.h,v
retrieving revision 1.114
diff -c -r1.114 heapam.h
*** src/include/access/heapam.h    3 Jul 2006 22:45:39 -0000    1.114
--- src/include/access/heapam.h    8 Aug 2006 16:17:21 -0000
***************
*** 156,162 ****
  extern void setLastTid(const ItemPointer tid);

  extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!             bool use_wal, bool use_fsm);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
              ItemPointer ctid, TransactionId *update_xmax,
              CommandId cid, Snapshot crosscheck, bool wait);
--- 156,162 ----
  extern void setLastTid(const ItemPointer tid);

  extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!             bool use_wal, bool use_fsm, BlockNumber suggestedblk);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
              ItemPointer ctid, TransactionId *update_xmax,
              CommandId cid, Snapshot crosscheck, bool wait);
Index: src/include/access/hio.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/hio.h,v
retrieving revision 1.32
diff -c -r1.32 hio.h
*** src/include/access/hio.h    13 Jul 2006 17:47:01 -0000    1.32
--- src/include/access/hio.h    8 Aug 2006 16:17:21 -0000
***************
*** 21,26 ****
  extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
                       HeapTuple tuple);
  extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
!                     Buffer otherBuffer, bool use_fsm);

  #endif   /* HIO_H */
--- 21,26 ----
  extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
                       HeapTuple tuple);
  extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
!                     Buffer otherBuffer, bool use_fsm, BlockNumber suggestedblk);

  #endif   /* HIO_H */
Index: src/include/access/nbtree.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/nbtree.h,v
retrieving revision 1.103
diff -c -r1.103 nbtree.h
*** src/include/access/nbtree.h    7 Aug 2006 16:57:57 -0000    1.103
--- src/include/access/nbtree.h    8 Aug 2006 16:17:21 -0000
***************
*** 467,472 ****
--- 467,473 ----
  extern Datum btbulkdelete(PG_FUNCTION_ARGS);
  extern Datum btvacuumcleanup(PG_FUNCTION_ARGS);
  extern Datum btoptions(PG_FUNCTION_ARGS);
+ extern Datum btsuggestblock(PG_FUNCTION_ARGS);

  /*
   * prototypes for functions in nbtinsert.c
***************
*** 476,481 ****
--- 477,484 ----
  extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
  extern void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
                    BTStack stack, bool is_root, bool is_only);
+ extern BlockNumber _bt_suggestblock(Relation rel, IndexTuple itup,
+              Relation heapRel);

  /*
   * prototypes for functions in nbtpage.c
Index: src/include/catalog/pg_am.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_am.h,v
retrieving revision 1.46
diff -c -r1.46 pg_am.h
*** src/include/catalog/pg_am.h    31 Jul 2006 20:09:05 -0000    1.46
--- src/include/catalog/pg_am.h    8 Aug 2006 16:17:21 -0000
***************
*** 65,70 ****
--- 65,71 ----
      regproc        amvacuumcleanup;    /* post-VACUUM cleanup function */
      regproc        amcostestimate; /* estimate cost of an indexscan */
      regproc        amoptions;        /* parse AM-specific parameters */
+     regproc        amsuggestblock;    /* suggest a block where to put heap tuple */
  } FormData_pg_am;

  /* ----------------
***************
*** 78,84 ****
   *        compiler constants for pg_am
   * ----------------
   */
! #define Natts_pg_am                        23
  #define Anum_pg_am_amname                1
  #define Anum_pg_am_amstrategies            2
  #define Anum_pg_am_amsupport            3
--- 79,85 ----
   *        compiler constants for pg_am
   * ----------------
   */
! #define Natts_pg_am                        24
  #define Anum_pg_am_amname                1
  #define Anum_pg_am_amstrategies            2
  #define Anum_pg_am_amsupport            3
***************
*** 102,123 ****
  #define Anum_pg_am_amvacuumcleanup        21
  #define Anum_pg_am_amcostestimate        22
  #define Anum_pg_am_amoptions            23

  /* ----------------
   *        initial contents of pg_am
   * ----------------
   */

! DATA(insert OID = 403 (  btree    5 1 1 t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan
btmarkposbtrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions )); 
  DESCR("b-tree index access method");
  #define BTREE_AM_OID 403
! DATA(insert OID = 405 (  hash    1 1 0 f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan
hashendscanhashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions )); 
  DESCR("hash index access method");
  #define HASH_AM_OID 405
! DATA(insert OID = 783 (  gist    100 7 0 f t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan
gistendscangistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions )); 
  DESCR("GiST index access method");
  #define GIST_AM_OID 783
! DATA(insert OID = 2742 (  gin    100 4 0 f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan
ginendscanginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions )); 
  DESCR("GIN index access method");
  #define GIN_AM_OID 2742

--- 103,125 ----
  #define Anum_pg_am_amvacuumcleanup        21
  #define Anum_pg_am_amcostestimate        22
  #define Anum_pg_am_amoptions            23
+ #define Anum_pg_am_amsuggestblock        24

  /* ----------------
   *        initial contents of pg_am
   * ----------------
   */

! DATA(insert OID = 403 (  btree    5 1 1 t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan
btmarkposbtrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions btsuggestblock)); 
  DESCR("b-tree index access method");
  #define BTREE_AM_OID 403
! DATA(insert OID = 405 (  hash    1 1 0 f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan
hashendscanhashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions
dummysuggestblock));
  DESCR("hash index access method");
  #define HASH_AM_OID 405
! DATA(insert OID = 783 (  gist    100 7 0 f t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan
gistendscangistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions
dummysuggestblock));
  DESCR("GiST index access method");
  #define GIST_AM_OID 783
! DATA(insert OID = 2742 (  gin    100 4 0 f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan
ginendscanginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions dummysuggestblock
));
  DESCR("GIN index access method");
  #define GIN_AM_OID 2742

Index: src/include/catalog/pg_proc.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_proc.h,v
retrieving revision 1.420
diff -c -r1.420 pg_proc.h
*** src/include/catalog/pg_proc.h    6 Aug 2006 03:53:44 -0000    1.420
--- src/include/catalog/pg_proc.h    9 Aug 2006 18:06:44 -0000
***************
*** 682,687 ****
--- 682,689 ----
  DESCR("btree(internal)");
  DATA(insert OID = 2785 (  btoptions           PGNSP PGUID 12 f f t f s 2 17 "1009 16" _null_ _null_ _null_  btoptions
-_null_ )); 
  DESCR("btree(internal)");
+ DATA(insert OID = 2852 (  btsuggestblock   PGNSP PGUID 12 f f t f v 4 23 "2281 2281 2281 2281" _null_ _null_ _null_
btsuggestblock - _null_ )); 
+ DESCR("btree(internal)");

  DATA(insert OID = 339 (  poly_same           PGNSP PGUID 12 f f t f i 2 16 "604 604" _null_ _null_ _null_ poly_same -
_null_)); 
  DESCR("same as?");
***************
*** 3936,3941 ****
--- 3938,3946 ----
  DATA(insert OID = 2749 (  arraycontained       PGNSP PGUID 12 f f t f i 2 16 "2277 2277" _null_ _null_ _null_
arraycontained- _null_ )); 
  DESCR("anyarray contained");

+ DATA(insert OID = 2853 (  dummysuggestblock   PGNSP PGUID 12 f f t f v 4 23 "2281 2281 2281 2281" _null_ _null_
_null_   dummysuggestblock - _null_ )); 
+ DESCR("dummy amsuggestblock implementation (internal)");
+
  /*
   * Symbolic values for provolatile column: these indicate whether the result
   * of a function is dependent *only* on the values of its explicit arguments,
Index: src/include/executor/executor.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/executor/executor.h,v
retrieving revision 1.128
diff -c -r1.128 executor.h
*** src/include/executor/executor.h    4 Aug 2006 21:33:36 -0000    1.128
--- src/include/executor/executor.h    8 Aug 2006 16:17:21 -0000
***************
*** 271,276 ****
--- 271,277 ----
  extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
  extern void ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
                        EState *estate, bool is_vacuum);
+ extern BlockNumber ExecSuggestBlock(TupleTableSlot *slot, EState *estate);

  extern void RegisterExprContextCallback(ExprContext *econtext,
                              ExprContextCallbackFunction function,
Index: src/include/nodes/execnodes.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/nodes/execnodes.h,v
retrieving revision 1.158
diff -c -r1.158 execnodes.h
*** src/include/nodes/execnodes.h    4 Aug 2006 21:33:36 -0000    1.158
--- src/include/nodes/execnodes.h    8 Aug 2006 16:17:21 -0000
***************
*** 257,262 ****
--- 257,264 ----
   *        NumIndices                # of indices existing on result relation
   *        IndexRelationDescs        array of relation descriptors for indices
   *        IndexRelationInfo        array of key/attr info for indices
+  *        ClusterIndex            index to the IndexRelationInfo array of the
+  *                                clustered index, or -1 if there's none
   *        TrigDesc                triggers to be fired, if any
   *        TrigFunctions            cached lookup info for trigger functions
   *        TrigInstrument            optional runtime measurements for triggers
***************
*** 272,277 ****
--- 274,280 ----
      int            ri_NumIndices;
      RelationPtr ri_IndexRelationDescs;
      IndexInfo **ri_IndexRelationInfo;
+     int         ri_ClusterIndex;
      TriggerDesc *ri_TrigDesc;
      FmgrInfo   *ri_TrigFunctions;
      struct Instrumentation *ri_TrigInstrument;
Index: src/include/utils/rel.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/utils/rel.h,v
retrieving revision 1.91
diff -c -r1.91 rel.h
*** src/include/utils/rel.h    3 Jul 2006 22:45:41 -0000    1.91
--- src/include/utils/rel.h    8 Aug 2006 16:17:21 -0000
***************
*** 116,121 ****
--- 116,122 ----
      FmgrInfo    amvacuumcleanup;
      FmgrInfo    amcostestimate;
      FmgrInfo    amoptions;
+     FmgrInfo    amsuggestblock;
  } RelationAmInfo;



Re: Maintaining cluster order on insert

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> While thinking about index-organized-tables and similar ideas, it
> occurred to me that there's some low-hanging-fruit: maintaining cluster
> order on inserts by trying to place new heap tuples close to other
> similar tuples.

Doesn't this happen for free if there's enough space?  UPDATE tries to
place the new tuple on the same page it's already on.  In practice
people are only likely to cluster by primary keys (or other things that
seldom change) so I don't particularly agree with inventing a large wart
to support "optimally" placing things somewhere else...

            regards, tom lane

Re: Maintaining cluster order on insert

From
"Jonah H. Harris"
Date:
On 8/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> UPDATE tries to place the new tuple on the same page it's already
> on.

I think he meant for INSERT.

--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation            | fax: 732.331.1301
33 Wood Ave S, 2nd Floor            | jharris@enterprisedb.com
Iselin, New Jersey 08830            | http://www.enterprisedb.com/

Re: [HACKERS] Maintaining cluster order on insert

From
Gene
Date:
I have a table that inserts lots of rows (million+ per day) int8 as primary key, and I cluster by a timestamp which is approximately the timestamp of the insert beforehand and is therefore in increasing order and doesn't change. Most of the rows are updated about 3 times over time roughly within the next 30 minutes. Should I assume that that all of these updates will be on separate pages unless I perform a cluster (which takes a long time) and performance will suffer due to this? Is it possible to preallocate space on the same page for future updates (whatever the average number of updates may be per row) decreasing the number of page loads for querying?

Gene

On 8/9/06, Jonah H. Harris <jonah.harris@gmail.com> wrote:
On 8/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> UPDATE tries to place the new tuple on the same page it's already
> on.

I think he meant for INSERT.

--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation            | fax: 732.331.1301
33 Wood Ave S, 2nd Floor            | jharris@enterprisedb.com
Iselin, New Jersey 08830            | http://www.enterprisedb.com/

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match



--
Eugene Hart

Re: [HACKERS] Maintaining cluster order on insert

From
Tom Lane
Date:
Gene <genekhart@gmail.com> writes:
> I have a table that inserts lots of rows (million+ per day) int8 as primary
> key, and I cluster by a timestamp which is approximately the timestamp of
> the insert beforehand and is therefore in increasing order and doesn't
> change. Most of the rows are updated about 3 times over time roughly within
> the next 30 minutes.

ISTM you should hardly need to worry about clustering that --- the data
will be in timestamp order pretty naturally.

The main problem you're going to have is the update-3-times bit.  You
could keep updated rows on the same page as the original if you ran the
table at fillfactor 25% (which you'll be able to do in 8.2) ... but
while this might be sane for the leading edge of the table, you hardly
want such low storage density in the stable part.

You could reduce the fillfactor requirement if you could vacuum the
table constantly (every 10 minutes or so) but I assume the table is
large enough to make that unattractive.  (Eventually we should have
a version of vacuum that understands where the dirty stuff is, which
might make this approach tenable ... but not in 8.2.)

Your best bet might be to partition the table into two subtables, one
with "stable" data and one with the fresh data, and transfer rows from
one to the other once they get stable.  Storage density in the "fresh"
part would be poor, but it should be small enough you don't care.

            regards, tom lane

Re: [HACKERS] Maintaining cluster order on insert

From
Gene
Date:
You are correct the main part I'm worried about is the updates, being so far from the originals. fyi I am partitioning the tables by the timestamp column,vacuum analyzing once per hour, creating one child partition per day in a cron job. Because I'm using hibernate for database abstraction (stateless sessions), I can only have one RULE since having more than one insert rule will not return the correct number of updated rows which confuses hibernate. The one rule just directs inserts to the latest child partition which seems to work well.

The reason I'm doing the clustering is I was hoping that with the "stable" non-updating partitions I could execute a CLUSTER at night (slow...) and it would compact the tables into their most efficient state for querying which always involves a date range. bad idea? In this fillfactor feature, will you be able to set it to 100% once you know that no more updates will occur? Or will doing a cluster effectively do this? Will the fill factor only apply for inserts?

"Your best bet might be to partition the table into two subtables, one
with "stable" data and one with the fresh data, and transfer rows from
one to the other once they get stable.  Storage density in the "fresh"
part would be poor, but it should be small enough you don't care."

This sounds interesting, I could create a RULE/INSERT on the unstable table, I will know during the update if it is ready to be put in the stable table. What would be an efficient way to do the transfer? Since the updates occur somewhat randomly, wouldnt the tuples in the stable table then be out of natural timestamp order?

thanks for all of your help and comments! it is greatly appreciated!
Gene Hart

On 8/9/06, Tom Lane <tgl@sss.pgh.pa.us > wrote:
Gene <genekhart@gmail.com> writes:
> I have a table that inserts lots of rows (million+ per day) int8 as primary
> key, and I cluster by a timestamp which is approximately the timestamp of
> the insert beforehand and is therefore in increasing order and doesn't
> change. Most of the rows are updated about 3 times over time roughly within
> the next 30 minutes.

ISTM you should hardly need to worry about clustering that --- the data
will be in timestamp order pretty naturally.

The main problem you're going to have is the update-3-times bit.  You
could keep updated rows on the same page as the original if you ran the
table at fillfactor 25% (which you'll be able to do in 8.2) ... but
while this might be sane for the leading edge of the table, you hardly
want such low storage density in the stable part.

You could reduce the fillfactor requirement if you could vacuum the
table constantly (every 10 minutes or so) but I assume the table is
large enough to make that unattractive.  (Eventually we should have
a version of vacuum that understands where the dirty stuff is, which
might make this approach tenable ... but not in 8.2.)

Your best bet might be to partition the table into two subtables, one
with "stable" data and one with the fresh data, and transfer rows from
one to the other once they get stable.  Storage density in the "fresh"
part would be poor, but it should be small enough you don't care.

                        regards, tom lane



--
Eugene Hart

Re: Maintaining cluster order on insert

From
Heikki Linnakangas
Date:
Jonah H. Harris wrote:
> On 8/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> UPDATE tries to place the new tuple on the same page it's already
>> on.
>
> I think he meant for INSERT.
>

Right. Update is indeed taken care of already.

One example where this would help would be a customer_history table that
stores all transactions of a customer. Primary key is (customer_id,
transaction_id). You would want to keep the table clustered by
customer_id to make it quick to retrieve all transactions of a customer.
In general, any table with more or less random insert/delete activity
that you want to keep in order would benefit.

- Heikki


Re: [HACKERS] Maintaining cluster order on insert

From
stark
Date:
Gene <genekhart@gmail.com> writes:

> "Your best bet might be to partition the table into two subtables, one
> with "stable" data and one with the fresh data, and transfer rows from
> one to the other once they get stable.  Storage density in the "fresh"
> part would be poor, but it should be small enough you don't care."
>
> This sounds interesting, I could create a RULE/INSERT on the unstable table,
> I will know during the update if it is ready to be put in the stable table.
> What would be an efficient way to do the transfer? Since the updates occur
> somewhat randomly, wouldnt the tuples in the stable table then be out of
> natural timestamp order?

You may find it easier to handle some of the logic in a low level application
layer or layer of stored procedures rather than trying to make it entirely
transparent with rules. If you do want it to be transparent you might also
consider whether you want triggers instead of rules.

Another direction you might want to consider is whether the columns that
you're updating would be more normalized in a separate table. You might really
want to have a record of those past states as well. So you might find having
three records in this other table for each of your regular records in your
main table might actually work out better.

Even if you only have a 1-1 relationship sometimes this kind of horizontal
partitioning (or do people consider this vertical partitioning?) is still
worthwhile. If the columns being updated are very small or often not needed at
all then it may be reasonably efficient to look them up separately and still
let you store the bulk of the data efficiently and access it in a fast
sequential scan.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: [HACKERS] Maintaining cluster order on insert

From
Ron Mayer
Date:
Tom Lane wrote:
> Gene <genekhart@gmail.com> writes:
>> I have a table that inserts lots of rows (million+ per day) int8 as primary
>> key, and I cluster by a timestamp which is approximately the timestamp of
>> the insert...
>
> ISTM you should hardly need to worry about clustering that --- the data
> will be in timestamp order pretty naturally.

In my case my biggest/slowest tables are clustered by zip-code (which
does a reasonable job at keeping counties/cities/etc on the
same pages too).  Data comes in constantly (many records per minute, as
we ramp up), pretty uniformly across the country; but most queries
are geographically bounded.  The data's pretty much insert-only.

If I understand Heikki's patch, it would help for this use case.


> Your best bet might be to partition the table into two subtables, one
> with "stable" data and one with the fresh data, and transfer rows from
> one to the other once they get stable.  Storage density in the "fresh"
> part would be poor, but it should be small enough you don't care.

Hmm... that should work well for me too.  Not sure if the use-case
I mentioned above is still compelling anymore; since this seems like
it'd give me much of the benefit; and I don't need an excessive
fillfactor on the stable part of the table.

Re: [HACKERS] Maintaining cluster order on insert

From
Heikki Linnakangas
Date:
Ron Mayer wrote:
> In my case my biggest/slowest tables are clustered by zip-code (which
> does a reasonable job at keeping counties/cities/etc on the
> same pages too).  Data comes in constantly (many records per minute, as
> we ramp up), pretty uniformly across the country; but most queries
> are geographically bounded.  The data's pretty much insert-only.

No deletes? If the tables grow over time, you probably would need to run
CLUSTER every now and then to get the best performance, though the patch
would alleviate that quite a lot.

Do you have a development environment where you could test what effect
the patch would have? It would be interesting to have a real-world use
case, since I don't have one myself at the moment.

> If I understand Heikki's patch, it would help for this use case.

Yes, it would.

> > Your best bet might be to partition the table into two subtables, one
> > with "stable" data and one with the fresh data, and transfer rows from
> > one to the other once they get stable.  Storage density in the "fresh"
> > part would be poor, but it should be small enough you don't care.
>
> Hmm... that should work well for me too.  Not sure if the use-case
> I mentioned above is still compelling anymore; since this seems like
> it'd give me much of the benefit; and I don't need an excessive
> fillfactor on the stable part of the table.

Umm, if your inserts are uniformly distributed across the country, you
wouldn't have a stable part, right?

- Heikki

Re: [HACKERS] Maintaining cluster order on insert

From
Ron Mayer
Date:
Heikki Linnakangas wrote:
> Ron Mayer wrote:
>> In my case my biggest/slowest tables are clustered by zip-code (which
>> does a reasonable job at keeping counties/cities/etc on the
>> same pages too)....
>
> No deletes? If the tables grow over time, you probably would need to run
> CLUSTER every now and then to get the best performance, though the patch
> would alleviate that quite a lot.

Yup, pretty much no deletes; since it's a historical archive of
some government documents with address info.  Though I the
live system may periodically expunge data, say, 10+years old.

> Do you have a development environment where you could test what effect
> the patch would have? It would be interesting to have a real-world use
> case, since I don't have one myself at the moment.

I have a development environment, but it doesn't have the same
real-time-growing behavior, and only a small region of the country.
I suppose I could pre-load N-1 years and cluster it, and then
incrementally insert the last year of data to simulate the effect.

But sure, I'll attempt to try the patch; but don't really have any
good benchmarking environment to give any definitive results.  If
an anecdotal "this is how it feels to me" is useful, I can give one
of those.

>> > Your best bet might be to partition the table into two subtables, one
>> > with "stable" data and one with the fresh data.
>>
>> Hmm... that should work well for me too....
>
> Umm, if your inserts are uniformly distributed across the country, you
> wouldn't have a stable part, right?

Hmm.  Maybe.  I was thinking when archiving to the large table
an "order by" clause when inserting from the new partition to the
stable partition could at least make the big table "piecewise"
clustered so most records for a zip code fit in the same few disk
pages, even though those pages would still end up lying around
far apart on the disk.

I wonder what part of "CLUSTER" gives the most benefit - that
most records of a type fit on a few blocks;  or that those blocks
are next to each other so can be read sequentially?





Re: Maintaining cluster order on insert

From
Bruce Momjian
Date:
[ Sorry I found this one only found recently.]

Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Heikki Linnakangas wrote:
> While thinking about index-organized-tables and similar ideas, it
> occurred to me that there's some low-hanging-fruit: maintaining cluster
> order on inserts by trying to place new heap tuples close to other
> similar tuples. That involves asking the index am where on the heap the
> new tuple should go, and trying to insert it there before using the FSM.
> Using the new fillfactor parameter makes it more likely that there's
> room on the page. We don't worry about the order within the page.
>
> The API I'm thinking of introduces a new optional index am function,
> amsuggestblock (suggestions for a better name are welcome). It gets the
> same parameters as aminsert, and returns the heap block number that
> would be optimal place to put the new tuple. It's be called from
> ExecInsert before inserting the heap tuple, and the suggestion is passed
> on to heap_insert and RelationGetBufferForTuple.
>
> I wrote a little patch to implement this for btree, attached.
>
> This could be optimized by changing the existing aminsert API, because
> as it is, an insert will have to descend the btree twice. Once in
> amsuggestblock and then in aminsert. amsuggestblock could keep the right
> index page pinned so aminsert could locate it quicker. But I wanted to
> keep this simple for now. Another improvement might be to allow
> amsuggestblock to return a list of suggestions, but that makes it more
> expensive to insert if there isn't room in the suggested pages, since
> heap_insert will have to try them all before giving up.
>
> Comments regarding the general idea or the patch? There should probably
> be a index option to turn the feature on and off. You'll want to turn it
> off when you first load a table, and turn it on after CLUSTER to keep it
> clustered.
>
> Since there's been discussion on keeping the TODO list more up-to-date,
> I hereby officially claim the "Automatically maintain clustering on a
> table" TODO item :). Feel free to bombard me with requests for status
> reports. And just to be clear, I'm not trying to sneak this into 8.2
> anymore, this is 8.3 stuff.
>
> I won't be implementing a background daemon described on the TODO item,
> since that would essentially be an online version of CLUSTER. Which sure
> would be nice, but that's a different story.
>
> - Heikki
>


--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Maintaining cluster order on insert

From
Heikki Linnakangas
Date:
Ah, thanks! I had forgotten about it as well.

Bruce Momjian wrote:
> [ Sorry I found this one only found recently.]
>
> Your patch has been added to the PostgreSQL unapplied patches list at:
>
>     http://momjian.postgresql.org/cgi-bin/pgpatches
>
> It will be applied as soon as one of the PostgreSQL committers reviews
> and approves it.
>
> ---------------------------------------------------------------------------
>
>
> Heikki Linnakangas wrote:
>> While thinking about index-organized-tables and similar ideas, it
>> occurred to me that there's some low-hanging-fruit: maintaining cluster
>> order on inserts by trying to place new heap tuples close to other
>> similar tuples. That involves asking the index am where on the heap the
>> new tuple should go, and trying to insert it there before using the FSM.
>> Using the new fillfactor parameter makes it more likely that there's
>> room on the page. We don't worry about the order within the page.
>>
>> The API I'm thinking of introduces a new optional index am function,
>> amsuggestblock (suggestions for a better name are welcome). It gets the
>> same parameters as aminsert, and returns the heap block number that
>> would be optimal place to put the new tuple. It's be called from
>> ExecInsert before inserting the heap tuple, and the suggestion is passed
>> on to heap_insert and RelationGetBufferForTuple.
>>
>> I wrote a little patch to implement this for btree, attached.
>>
>> This could be optimized by changing the existing aminsert API, because
>> as it is, an insert will have to descend the btree twice. Once in
>> amsuggestblock and then in aminsert. amsuggestblock could keep the right
>> index page pinned so aminsert could locate it quicker. But I wanted to
>> keep this simple for now. Another improvement might be to allow
>> amsuggestblock to return a list of suggestions, but that makes it more
>> expensive to insert if there isn't room in the suggested pages, since
>> heap_insert will have to try them all before giving up.
>>
>> Comments regarding the general idea or the patch? There should probably
>> be a index option to turn the feature on and off. You'll want to turn it
>> off when you first load a table, and turn it on after CLUSTER to keep it
>> clustered.
>>
>> Since there's been discussion on keeping the TODO list more up-to-date,
>> I hereby officially claim the "Automatically maintain clustering on a
>> table" TODO item :). Feel free to bombard me with requests for status
>> reports. And just to be clear, I'm not trying to sneak this into 8.2
>> anymore, this is 8.3 stuff.
>>
>> I won't be implementing a background daemon described on the TODO item,
>> since that would essentially be an online version of CLUSTER. Which sure
>> would be nice, but that's a different story.
>>
>> - Heikki
>>
>
>


--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Maintaining cluster order on insert

From
"Jim C. Nasby"
Date:
What about adding the ability to ask the FSM for a page that's near a
given page? That way if you did have to go to the FSM you could at least
try and insert close to the page you originally wanted.

On Tue, May 15, 2007 at 11:26:51PM +0100, Heikki Linnakangas wrote:
> Ah, thanks! I had forgotten about it as well.
>
> Bruce Momjian wrote:
> >[ Sorry I found this one only found recently.]
> >
> >Your patch has been added to the PostgreSQL unapplied patches list at:
> >
> >    http://momjian.postgresql.org/cgi-bin/pgpatches
> >
> >It will be applied as soon as one of the PostgreSQL committers reviews
> >and approves it.
> >
> >---------------------------------------------------------------------------
> >
> >
> >Heikki Linnakangas wrote:
> >>While thinking about index-organized-tables and similar ideas, it
> >>occurred to me that there's some low-hanging-fruit: maintaining cluster
> >>order on inserts by trying to place new heap tuples close to other
> >>similar tuples. That involves asking the index am where on the heap the
> >>new tuple should go, and trying to insert it there before using the FSM.
> >>Using the new fillfactor parameter makes it more likely that there's
> >>room on the page. We don't worry about the order within the page.
> >>
> >>The API I'm thinking of introduces a new optional index am function,
> >>amsuggestblock (suggestions for a better name are welcome). It gets the
> >>same parameters as aminsert, and returns the heap block number that
> >>would be optimal place to put the new tuple. It's be called from
> >>ExecInsert before inserting the heap tuple, and the suggestion is passed
> >>on to heap_insert and RelationGetBufferForTuple.
> >>
> >>I wrote a little patch to implement this for btree, attached.
> >>
> >>This could be optimized by changing the existing aminsert API, because
> >>as it is, an insert will have to descend the btree twice. Once in
> >>amsuggestblock and then in aminsert. amsuggestblock could keep the right
> >>index page pinned so aminsert could locate it quicker. But I wanted to
> >>keep this simple for now. Another improvement might be to allow
> >>amsuggestblock to return a list of suggestions, but that makes it more
> >>expensive to insert if there isn't room in the suggested pages, since
> >>heap_insert will have to try them all before giving up.
> >>
> >>Comments regarding the general idea or the patch? There should probably
> >>be a index option to turn the feature on and off. You'll want to turn it
> >>off when you first load a table, and turn it on after CLUSTER to keep it
> >>clustered.
> >>
> >>Since there's been discussion on keeping the TODO list more up-to-date,
> >>I hereby officially claim the "Automatically maintain clustering on a
> >>table" TODO item :). Feel free to bombard me with requests for status
> >>reports. And just to be clear, I'm not trying to sneak this into 8.2
> >>anymore, this is 8.3 stuff.
> >>
> >>I won't be implementing a background daemon described on the TODO item,
> >>since that would essentially be an online version of CLUSTER. Which sure
> >>would be nice, but that's a different story.
> >>
> >>- Heikki
> >>
> >
> >
>
>
> --
>   Heikki Linnakangas
>   EnterpriseDB   http://www.enterprisedb.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>

--
Jim Nasby                                      decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Maintaining cluster order on insert

From
Heikki Linnakangas
Date:
Jim C. Nasby wrote:
> What about adding the ability to ask the FSM for a page that's near a
> given page? That way if you did have to go to the FSM you could at least
> try and insert close to the page you originally wanted.

Yeah, there's always room for improvement. I made the patch when I was
working on clustered indexes, and was mostly concerned about getting
inserts to the same page as other tuples with similar values so that the
clustered index stays clustered.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Maintaining cluster order on insert

From
"Jaime Casanova"
Date:
On 5/16/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> Jim C. Nasby wrote:
> > What about adding the ability to ask the FSM for a page that's near a
> > given page? That way if you did have to go to the FSM you could at least
> > try and insert close to the page you originally wanted.
>
> Yeah, there's always room for improvement. I made the patch when I was
> working on clustered indexes, and was mostly concerned about getting
> inserts to the same page as other tuples with similar values so that the
> clustered index stays clustered.
>

the patch doesn't apply in cvs... you'll need to update it...

--
regards,
Jaime Casanova

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs and the universe trying
to produce bigger and better idiots.
So far, the universe is winning."
                                       Richard Cook

Re: Maintaining cluster order on insert

From
"Jaime Casanova"
Date:
On 5/17/07, Jaime Casanova <systemguards@gmail.com> wrote:
> On 5/16/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> > Jim C. Nasby wrote:
> > > What about adding the ability to ask the FSM for a page that's near a
> > > given page? That way if you did have to go to the FSM you could at least
> > > try and insert close to the page you originally wanted.
> >
> > Yeah, there's always room for improvement. I made the patch when I was
> > working on clustered indexes, and was mostly concerned about getting
> > inserts to the same page as other tuples with similar values so that the
> > clustered index stays clustered.
> >
>
> the patch doesn't apply in cvs... you'll need to update it...
>

cvs i wrote? head i was meaning... sorry, too late for me =)

--
regards,
Jaime Casanova

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs and the universe trying
to produce bigger and better idiots.
So far, the universe is winning."
                                       Richard Cook

Re: Maintaining cluster order on insert

From
Heikki Linnakangas
Date:
Jaime Casanova wrote:
> On 5/16/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
>> Jim C. Nasby wrote:
>> > What about adding the ability to ask the FSM for a page that's near a
>> > given page? That way if you did have to go to the FSM you could at
>> least
>> > try and insert close to the page you originally wanted.
>>
>> Yeah, there's always room for improvement. I made the patch when I was
>> working on clustered indexes, and was mostly concerned about getting
>> inserts to the same page as other tuples with similar values so that the
>> clustered index stays clustered.
>>
>
> the patch doesn't apply in cvs... you'll need to update it...

Oh, here you are.

The implementation has changed a bit since August. I thought I had
submitted an updated version in the winter but couldn't find it. Anyway,
I updated and dusted off the source tree, tidied up the comments a
little bit, and fixed some inconsistencies in pg_proc entries that made
opr_sanity to fail.

The beef of the patch is two new optional indexam API functions:
amprepareinsert and amfinishinsert. amprepareinsert is called before
inserting the heap tuple. It descends the tree and finds and pins the
right leaf page to insert to, and returns a suggestion on where the heap
tuple should be inserted. amfinishinsert is called after inserting the
heap tuple to actually insert the index tuple. Documentation for these
functions need to be added indexam.sgml, I noticed that that's not done yet.

The cluster_inserts GUC option that you can use to enable/disable the
feature should be removed before committing.

The performance characteristics of this patch hasn't been thoroughly
discussed yet. The reason why you want to cluster your tables is to
speed up SELECTs that return a bunch of tuples with similar values, for
example range queries. The reason for keeping them clustered on inserts
is to reduce the need to run CLUSTER as often.

It doesn't come without a cost, however. In the worst case, there never
is room for new inserts on pages, and each insert needs to do one extra
I/O to fetch the optimal heap page where the insert should go, see that
there's no room, and then insert somewhere else. Using a non-zero
fillfactor helps, but even when there is room on the page, it's often
cheaper to just append to the end of the table and running CLUSTER at
night for example, than do random access to insert to the "right" pages
in the heap.

So, should we have a WITH-option on the table to enable/disable this
feature, and what would be the default?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: doc/src/sgml/catalogs.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/catalogs.sgml,v
retrieving revision 2.137
diff -c -r2.137 catalogs.sgml
*** doc/src/sgml/catalogs.sgml    12 Nov 2006 06:25:37 -0000    2.137
--- doc/src/sgml/catalogs.sgml    15 Jan 2007 17:24:52 -0000
***************
*** 499,504 ****
--- 499,518 ----
        <entry>Function to parse and validate <structfield>reloptions</> for an index</entry>
       </row>

+      <row>
+       <entry><structfield>amprepareinsert</structfield></entry>
+       <entry><type>regproc</type></entry>
+       <entry><literal><link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>.oid</literal></entry>
+       <entry>Performs the 1st phase of a two phase index insert, returning a suggestion of where in the heap to put a
newtuple</entry> 
+      </row>
+
+      <row>
+       <entry><structfield>amfinishinsert</structfield></entry>
+       <entry><type>regproc</type></entry>
+       <entry><literal><link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>.oid</literal></entry>
+       <entry>Finishes an index insert started with amprepareinsert</entry>
+      </row>
+
      </tbody>
     </tgroup>
    </table>
Index: src/backend/access/heap/heapam.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.222.2.1
diff -c -r1.222.2.1 heapam.c
*** src/backend/access/heap/heapam.c    4 Feb 2007 20:00:49 -0000    1.222.2.1
--- src/backend/access/heap/heapam.c    18 May 2007 10:41:13 -0000
***************
*** 1363,1368 ****
--- 1363,1372 ----
   * use_fsm is passed directly to RelationGetBufferForTuple, which see for
   * more info.
   *
+  * If suggested_blk is a valid block number, the tuple will be inserted to
+  * that block if there's enough room. If it's full, a block will be chosen
+  * as if suggested_blk was not set.
+  *
   * The return value is the OID assigned to the tuple (either here or by the
   * caller), or InvalidOid if no OID.  The header fields of *tup are updated
   * to match the stored tuple; in particular tup->t_self receives the actual
***************
*** 1371,1377 ****
   */
  Oid
  heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!             bool use_wal, bool use_fsm)
  {
      TransactionId xid = GetCurrentTransactionId();
      HeapTuple    heaptup;
--- 1375,1381 ----
   */
  Oid
  heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!             bool use_wal, bool use_fsm, BlockNumber suggested_blk)
  {
      TransactionId xid = GetCurrentTransactionId();
      HeapTuple    heaptup;
***************
*** 1421,1429 ****
      else
          heaptup = tup;

!     /* Find buffer to insert this tuple into */
!     buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
!                                        InvalidBuffer, use_fsm);

      /* NO EREPORT(ERROR) from here till changes are logged */
      START_CRIT_SECTION();
--- 1425,1467 ----
      else
          heaptup = tup;

!     /* Find buffer to insert this tuple into. Try the suggested block first
!      * if caller gave one.
!      */
!     if (suggested_blk != InvalidBlockNumber)
!     {
!         Buffer suggested_buf;
!         Page pageHeader;
!         Size pageFreeSpace;
!
!         suggested_buf = ReadBuffer(relation, suggested_blk);
!         pageHeader = (Page) BufferGetPage(suggested_buf);
!
!         LockBuffer(suggested_buf, BUFFER_LOCK_EXCLUSIVE);
!
!         /* Don't subtract fillfactor from the free space. That space is
!          * reserved exactly for situations like this; keeping updated and
!          * inserted tuples close to other tuples with similar values.
!          */
!         pageFreeSpace = PageGetFreeSpace(pageHeader);
!
!         if (heaptup->t_len <= pageFreeSpace)
!             buffer = suggested_buf;
!         else
!         {
!             /* Page was full. Release lock and pin and get another block
!              * as if suggested_blk was not given.
!              */
!             LockBuffer(suggested_buf, BUFFER_LOCK_UNLOCK);
!             ReleaseBuffer(suggested_buf);
!
!             buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
!                                                InvalidBuffer, use_fsm);
!         }
!     }
!     else
!         buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
!                                            InvalidBuffer, use_fsm);

      /* NO EREPORT(ERROR) from here till changes are logged */
      START_CRIT_SECTION();
***************
*** 1531,1537 ****
  Oid
  simple_heap_insert(Relation relation, HeapTuple tup)
  {
!     return heap_insert(relation, tup, GetCurrentCommandId(), true, true);
  }

  /*
--- 1569,1576 ----
  Oid
  simple_heap_insert(Relation relation, HeapTuple tup)
  {
!     return heap_insert(relation, tup, GetCurrentCommandId(), true,
!                        true, InvalidBlockNumber);
  }

  /*
Index: src/backend/access/index/indexam.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/index/indexam.c,v
retrieving revision 1.95
diff -c -r1.95 indexam.c
*** src/backend/access/index/indexam.c    4 Oct 2006 00:29:48 -0000    1.95
--- src/backend/access/index/indexam.c    18 May 2007 10:42:13 -0000
***************
*** 18,23 ****
--- 18,25 ----
   *        index_rescan    - restart a scan of an index
   *        index_endscan    - end a scan
   *        index_insert    - insert an index tuple into a relation
+  *        index_prepareinsert    - get desired insert location for a heap tuple
+  *        index_finishinsert    - insert a previously prepared index tuple
   *        index_markpos    - mark a scan position
   *        index_restrpos    - restore a scan position
   *        index_getnext    - get the next tuple from a scan
***************
*** 202,207 ****
--- 204,269 ----
                                        BoolGetDatum(check_uniqueness)));
  }

+ /* ----------------
+  *        index_prepareinsert - get desired insert location for a heap tuple
+  *
+  * The returned BlockNumber is the *heap* page that is the best place
+  * to insert the given tuple to, according to the index am. The best
+  * place is one that maintains the cluster order.
+  *
+  * opaque should be passed to a later index_finishinsert to finish the
+  * insert.
+  * ----------------
+  */
+ BlockNumber
+ index_prepareinsert(Relation indexRelation,
+                     Datum *values,
+                     bool *isnull,
+                     Relation heapRelation,
+                     bool check_uniqueness,
+                     void **opaque)
+ {
+     FmgrInfo   *procedure;
+
+     RELATION_CHECKS;
+     GET_REL_PROCEDURE(amprepareinsert);
+
+     /*
+      * have the am's prepareinsert proc do all the work.
+      */
+     return DatumGetUInt32(FunctionCall6(procedure,
+                                         PointerGetDatum(indexRelation),
+                                         PointerGetDatum(values),
+                                         PointerGetDatum(isnull),
+                                         PointerGetDatum(heapRelation),
+                                         BoolGetDatum(check_uniqueness),
+                                         PointerGetDatum(opaque)));
+ }
+
+ /* ----------------
+  *        index_finishinsert - insert a previously prepared index tuple
+  *
+  * Finishes an insert operation initiated by an earlier call to
+  * index_prepareinsert.
+  * ----------------
+  */
+ bool
+ index_finishinsert(Relation indexRelation,
+                    ItemPointer heap_t_ctid, void *opaque)
+ {
+     FmgrInfo   *procedure;
+
+     RELATION_CHECKS;
+     GET_REL_PROCEDURE(amfinishinsert);
+
+     /*
+      * have the am's finishinsert proc do all the work.
+      */
+     return DatumGetBool(FunctionCall2(procedure,
+                                       PointerGetDatum(heap_t_ctid),
+                                       PointerGetDatum(opaque)));
+ }
+
  /*
   * index_beginscan - start a scan of an index with amgettuple
   *
Index: src/backend/access/nbtree/nbtinsert.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtinsert.c,v
retrieving revision 1.146.2.1
diff -c -r1.146.2.1 nbtinsert.c
*** src/backend/access/nbtree/nbtinsert.c    27 Jan 2007 20:53:36 -0000    1.146.2.1
--- src/backend/access/nbtree/nbtinsert.c    18 May 2007 10:19:05 -0000
***************
*** 86,99 ****
      /* we need an insertion scan key to do our search, so build one */
      itup_scankey = _bt_mkscankey(rel, itup);

- top:
      /* find the first page containing this key */
      stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE);

      /* trade in our read lock for a write lock */
      LockBuffer(buf, BUFFER_LOCK_UNLOCK);
      LockBuffer(buf, BT_WRITE);

      /*
       * If the page was split between the time that we surrendered our read
       * lock and acquired our write lock, then this page may no longer be the
--- 86,198 ----
      /* we need an insertion scan key to do our search, so build one */
      itup_scankey = _bt_mkscankey(rel, itup);

      /* find the first page containing this key */
      stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE);

      /* trade in our read lock for a write lock */
      LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+     _bt_finishinsert(rel, heapRel, index_is_unique, itup,
+                      itup_scankey, stack, buf);
+ }
+
+ /*
+  *    _bt_prepareinsert() -- Find the insert location for a new tuple
+  *
+  * Descends the tree and finds the location for a new index tuple.
+  * As a hint to the executor, returns the heap block number the previous
+  * index tuple at that location points to. By inserting the heap tuple
+  * to that block, the heap will stay better clustered than by inserting
+  * to a random block.
+  */
+ BlockNumber
+ _bt_prepareinsert(Relation rel, IndexTuple itup, bool index_is_unique,
+                   Relation heapRel, BTInsertInfo *opaquePtr)
+ {
+     int            natts = rel->rd_rel->relnatts;
+     OffsetNumber offset;
+     Page        page;
+     BTPageOpaque opaque;
+
+     ScanKey        itup_scankey;
+     BTStack        stack;
+     Buffer        buf;
+     BlockNumber suggestion = InvalidBlockNumber;
+     BTInsertInfo insert_opaque;
+
+     /* we need an insertion scan key to do our search, so build one */
+     itup_scankey = _bt_mkscankey(rel, itup);
+
+     /* find the first page containing this key */
+     stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_READ);
+     if(!BufferIsValid(buf))
+     {
+         /* The index was completely empty. No suggestion then. */
+         *opaquePtr = NULL;
+         return InvalidBlockNumber;
+     }
+
+     page = BufferGetPage(buf);
+     opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+     /* Find the location in the page where the new index tuple would go to. */
+
+     offset = _bt_binsrch(rel, buf, natts, itup_scankey, false);
+     if (offset > PageGetMaxOffsetNumber(page))
+     {
+         /* _bt_binsrch returned pointer to end-of-page. It means that
+          * there was no equal items on the page, and the new item should
+          * be inserted as the last tuple of the page. There could be equal
+          * items on the next page, however.
+          *
+          * At the moment, we just ignore the potential equal items on the
+          * right, and pretend there isn't any. We could instead walk right
+          * to the next page to check that, but let's keep it simple for now.
+          */
+         offset = OffsetNumberPrev(offset);
+     }
+     if(offset < P_FIRSTDATAKEY(opaque))
+     {
+         /* We landed on an empty page. We could step left or right until
+          * we find some items, but let's keep it simple for now.
+          */
+     } else {
+         /* We're now positioned at the index tuple that we're interested in. */
+         ItemId iid = PageGetItemId(page, offset);
+         IndexTuple curitup = (IndexTuple) PageGetItem(page, iid);
+
+         suggestion = ItemPointerGetBlockNumber(&curitup->t_tid);
+     }
+
+     LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+     insert_opaque = *opaquePtr = palloc(sizeof(struct BTInsertInfoData));
+     insert_opaque->rel = rel;
+     insert_opaque->heapRel = heapRel;
+     insert_opaque->index_is_unique = index_is_unique;
+     insert_opaque->itup = itup;
+     insert_opaque->itup_scankey = itup_scankey;
+     insert_opaque->stack = stack;
+     insert_opaque->buf = buf;
+
+     return suggestion;
+ }
+
+ /*
+  *    _bt_finishinsert() -- Handle insertion of a single index tuple in the tree.
+  *
+  */
+ void
+ _bt_finishinsert(Relation rel, Relation heapRel, bool index_is_unique,
+                  IndexTuple itup, ScanKey itup_scankey,
+                  BTStack stack, Buffer buf)
+ {
+     int            natts = rel->rd_rel->relnatts;
+
      LockBuffer(buf, BT_WRITE);

+ top:
+
      /*
       * If the page was split between the time that we surrendered our read
       * lock and acquired our write lock, then this page may no longer be the
***************
*** 133,138 ****
--- 232,245 ----
              XactLockTableWait(xwait);
              /* start over... */
              _bt_freestack(stack);
+
+             /* find the first page containing this key */
+             stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE);
+
+             /* trade in our read lock for a write lock */
+             LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+             LockBuffer(buf, BT_WRITE);
+
              goto top;
          }
      }
***************
*** 143,148 ****
--- 250,256 ----
      /* be tidy */
      _bt_freestack(stack);
      _bt_freeskey(itup_scankey);
+     pfree(itup);
  }

  /*
Index: src/backend/access/nbtree/nbtree.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtree.c,v
retrieving revision 1.153
diff -c -r1.153 nbtree.c
*** src/backend/access/nbtree/nbtree.c    1 Nov 2006 19:43:17 -0000    1.153
--- src/backend/access/nbtree/nbtree.c    18 May 2007 10:20:00 -0000
***************
*** 223,229 ****

      _bt_doinsert(rel, itup, checkUnique, heapRel);

!     pfree(itup);

      PG_RETURN_BOOL(true);
  }
--- 223,278 ----

      _bt_doinsert(rel, itup, checkUnique, heapRel);

!     PG_RETURN_BOOL(true);
! }
!
! /*
!  *    btprepareinsert() -- find the best place in the heap to put a new tuple.
!  *
!  *        This uses the same logic as btinsert to find the place where the index
!  *        tuple would go if this was a btinsert call.
!  */
! Datum
! btprepareinsert(PG_FUNCTION_ARGS)
! {
!     Relation    rel = (Relation) PG_GETARG_POINTER(0);
!     Datum       *values = (Datum *) PG_GETARG_POINTER(1);
!     bool       *isnull = (bool *) PG_GETARG_POINTER(2);
!     Relation    heapRel = (Relation) PG_GETARG_POINTER(3);
!     bool        checkUnique = PG_GETARG_BOOL(4);
!     void      **opaquePtr = (void **) PG_GETARG_POINTER(5);
!     IndexTuple    itup;
!     BlockNumber suggestion;
!
!     /* generate an index tuple */
!     itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
!
!     suggestion =_bt_prepareinsert(rel, itup, checkUnique, heapRel,
!                                   (BTInsertInfo *) opaquePtr);
!
!     PG_RETURN_UINT32(suggestion);
! }
!
! /*
!  *    btfinishinsert() -- finish insert
!  */
! Datum
! btfinishinsert(PG_FUNCTION_ARGS)
! {
!     ItemPointer ht_ctid = (ItemPointer) PG_GETARG_POINTER(0);
!     BTInsertInfo opaque = (void *) PG_GETARG_POINTER(1);
!
!     opaque->itup->t_tid = *ht_ctid;
!
!     _bt_finishinsert(opaque->rel,
!                      opaque->heapRel,
!                      opaque->index_is_unique,
!                      opaque->itup,
!                      opaque->itup_scankey,
!                      opaque->stack,
!                      opaque->buf);
!
!     pfree(opaque);

      PG_RETURN_BOOL(true);
  }
Index: src/backend/executor/execMain.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execMain.c,v
retrieving revision 1.280.2.2
diff -c -r1.280.2.2 execMain.c
*** src/backend/executor/execMain.c    2 Feb 2007 00:07:27 -0000    1.280.2.2
--- src/backend/executor/execMain.c    18 May 2007 09:48:19 -0000
***************
*** 53,58 ****
--- 53,59 ----
  #include "utils/lsyscache.h"
  #include "utils/memutils.h"

+ bool cluster_inserts = true; /* GUC */

  typedef struct evalPlanQual
  {
***************
*** 847,854 ****
--- 848,857 ----
      resultRelInfo->ri_RangeTableIndex = resultRelationIndex;
      resultRelInfo->ri_RelationDesc = resultRelationDesc;
      resultRelInfo->ri_NumIndices = 0;
+     resultRelInfo->ri_ClusterIndex = -1;
      resultRelInfo->ri_IndexRelationDescs = NULL;
      resultRelInfo->ri_IndexRelationInfo = NULL;
+     resultRelInfo->ri_PreparedInsertOpaque = NULL;
      /* make a copy so as not to depend on relcache info not changing... */
      resultRelInfo->ri_TrigDesc = CopyTriggerDesc(resultRelationDesc->trigdesc);
      if (resultRelInfo->ri_TrigDesc)
***************
*** 1331,1336 ****
--- 1334,1340 ----
      ResultRelInfo *resultRelInfo;
      Relation    resultRelationDesc;
      Oid            newId;
+     BlockNumber suggestedBlock;

      /*
       * get the heap tuple out of the tuple table slot, making sure we have a
***************
*** 1379,1384 ****
--- 1383,1395 ----
      if (resultRelationDesc->rd_att->constr)
          ExecConstraints(resultRelInfo, slot, estate);

+     /* Ask the index am of the clustered index for the
+      * best place to put it */
+     if(cluster_inserts)
+         suggestedBlock = ExecPrepareIndexInsert(slot, estate);
+     else
+         suggestedBlock = InvalidBlockNumber;
+
      /*
       * insert the tuple
       *
***************
*** 1387,1393 ****
       */
      newId = heap_insert(resultRelationDesc, tuple,
                          estate->es_snapshot->curcid,
!                         true, true);

      IncrAppended();
      (estate->es_processed)++;
--- 1398,1404 ----
       */
      newId = heap_insert(resultRelationDesc, tuple,
                          estate->es_snapshot->curcid,
!                         true, true, suggestedBlock);

      IncrAppended();
      (estate->es_processed)++;
***************
*** 2554,2560 ****
                  tuple,
                  estate->es_snapshot->curcid,
                  estate->es_into_relation_use_wal,
!                 false);            /* never any point in using FSM */

      /* We know this is a newly created relation, so there are no indexes */

--- 2565,2572 ----
                  tuple,
                  estate->es_snapshot->curcid,
                  estate->es_into_relation_use_wal,
!                 false,             /* never any point in using FSM */
!                 InvalidBlockNumber);

      /* We know this is a newly created relation, so there are no indexes */

Index: src/backend/executor/execUtils.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execUtils.c,v
retrieving revision 1.140.2.3
diff -c -r1.140.2.3 execUtils.c
*** src/backend/executor/execUtils.c    6 Feb 2007 17:35:27 -0000    1.140.2.3
--- src/backend/executor/execUtils.c    18 May 2007 09:48:19 -0000
***************
*** 32,37 ****
--- 32,38 ----
   *        ExecOpenIndices            \
   *        ExecCloseIndices         | referenced by InitPlan, EndPlan,
   *        ExecInsertIndexTuples    /  ExecInsert, ExecUpdate
+  *        ExecSuggestBlock        Referenced by ExecInsert
   *
   *        RegisterExprContextCallback    Register function shutdown callback
   *        UnregisterExprContextCallback  Deregister function shutdown callback
***************
*** 939,944 ****
--- 940,946 ----
      IndexInfo **indexInfoArray;

      resultRelInfo->ri_NumIndices = 0;
+     resultRelInfo->ri_ClusterIndex = -1;

      /* fast path if no indexes */
      if (!RelationGetForm(resultRelation)->relhasindex)
***************
*** 978,983 ****
--- 980,990 ----
          /* extract index key information from the index's pg_index info */
          ii = BuildIndexInfo(indexDesc);

+         /* Remember which index is the clustered one.
+          * It's used to call the suggestblock-method on inserts */
+         if(indexDesc->rd_index->indisclustered)
+             resultRelInfo->ri_ClusterIndex = i;
+
          relationDescs[i] = indexDesc;
          indexInfoArray[i] = ii;
          i++;
***************
*** 1044,1049 ****
--- 1051,1058 ----
      ExprContext *econtext;
      Datum        values[INDEX_MAX_KEYS];
      bool        isnull[INDEX_MAX_KEYS];
+     int            clusterIndex;
+     bool        preparedInsert;

      /*
       * Get information from the result relation info structure.
***************
*** 1053,1058 ****
--- 1062,1069 ----
      relationDescs = resultRelInfo->ri_IndexRelationDescs;
      indexInfoArray = resultRelInfo->ri_IndexRelationInfo;
      heapRelation = resultRelInfo->ri_RelationDesc;
+     clusterIndex = resultRelInfo->ri_ClusterIndex;
+     preparedInsert = resultRelInfo->ri_PreparedInsertOpaque != NULL;

      /*
       * We will use the EState's per-tuple context for evaluating predicates
***************
*** 1063,1068 ****
--- 1074,1092 ----
      /* Arrange for econtext's scan tuple to be the tuple under test */
      econtext->ecxt_scantuple = slot;

+     if (preparedInsert)
+     {
+         index_finishinsert(relationDescs[clusterIndex],
+                            tupleid,
+                            resultRelInfo->ri_PreparedInsertOpaque);
+         resultRelInfo->ri_PreparedInsertOpaque = NULL;
+
+         /*
+          * keep track of index inserts for debugging
+          */
+         IncrIndexInserted();
+     }
+
      /*
       * for each index, form and insert the index tuple
       */
***************
*** 1073,1078 ****
--- 1097,1105 ----
          if (relationDescs[i] == NULL)
              continue;

+         if (preparedInsert && i == clusterIndex)
+             continue; /* insert to clustered index was already handled above */
+
          indexInfo = indexInfoArray[i];

          /* Check for partial index */
***************
*** 1127,1132 ****
--- 1154,1228 ----
      }
  }

+ /* ----------------------------------------------------------------
+  *        ExecPrepareIndexInsert
+  *
+  *        This routine asks the index am where a new heap tuple
+  *        should be placed.
+  * ----------------------------------------------------------------
+  */
+ BlockNumber
+ ExecPrepareIndexInsert(TupleTableSlot *slot,
+                        EState *estate)
+ {
+     ResultRelInfo *resultRelInfo;
+     int            i;
+     Relation    relationDesc;
+     Relation    heapRelation;
+     ExprContext *econtext;
+     Datum        values[INDEX_MAX_KEYS];
+     bool        isnull[INDEX_MAX_KEYS];
+     IndexInfo  *indexInfo;
+
+     /*
+      * Get information from the result relation info structure.
+      */
+     resultRelInfo = estate->es_result_relation_info;
+     i = resultRelInfo->ri_ClusterIndex;
+     if (i == -1)
+         return InvalidBlockNumber; /* there was no clustered index */
+
+     heapRelation = resultRelInfo->ri_RelationDesc;
+     relationDesc = resultRelInfo->ri_IndexRelationDescs[i];
+     indexInfo = resultRelInfo->ri_IndexRelationInfo[i];
+
+     if (!OidIsValid(relationDesc->rd_am->amprepareinsert))
+         return InvalidBlockNumber; /* the indexam doesn't support the
+                                     * two-phase insert API */
+
+     /* You can't cluster on a partial index */
+     Assert(indexInfo->ii_Predicate == NIL);
+
+     /*
+      * We will use the EState's per-tuple context for evaluating
+      * index expressions (creating it if it's not already there).
+      */
+     econtext = GetPerTupleExprContext(estate);
+
+     /* Arrange for econtext's scan tuple to be the tuple under test */
+     econtext->ecxt_scantuple = slot;
+
+     /*
+      * FormIndexDatum fills in its values and isnull parameters with the
+      * appropriate values for the column(s) of the index.
+      */
+     FormIndexDatum(indexInfo,
+                    slot,
+                    estate,
+                    values,
+                    isnull);
+
+     /*
+      * The index AM does the rest.
+      */
+     return index_prepareinsert(relationDesc,    /* index relation */
+                  values,    /* array of index Datums */
+                  isnull,    /* null flags */
+                  heapRelation,
+                  relationDesc->rd_index->indisunique,
+                  &resultRelInfo->ri_PreparedInsertOpaque);
+ }
+
  /*
   * UpdateChangedParamSet
   *        Add changed parameters to a plan node's chgParam set
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.360.2.1
diff -c -r1.360.2.1 guc.c
*** src/backend/utils/misc/guc.c    23 Apr 2007 15:13:30 -0000    1.360.2.1
--- src/backend/utils/misc/guc.c    18 May 2007 09:48:22 -0000
***************
*** 93,98 ****
--- 93,99 ----
  #define MS_PER_D (1000 * 60 * 60 * 24)

  /* XXX these should appear in other modules' header files */
+ extern bool cluster_inserts;
  extern bool Log_disconnections;
  extern int    CommitDelay;
  extern int    CommitSiblings;
***************
*** 403,408 ****
--- 404,417 ----
  static struct config_bool ConfigureNamesBool[] =
  {
      {
+         {"cluster_inserts", PGC_USERSET, DEVELOPER_OPTIONS,
+             gettext_noop("Tries to maintain cluster order on inserts."),
+             NULL
+         },
+         &cluster_inserts,
+         true, NULL, NULL
+     },
+     {
          {"enable_seqscan", PGC_USERSET, QUERY_TUNING_METHOD,
              gettext_noop("Enables the planner's use of sequential-scan plans."),
              NULL
Index: src/include/access/genam.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/genam.h,v
retrieving revision 1.65
diff -c -r1.65 genam.h
*** src/include/access/genam.h    31 Jul 2006 20:09:05 -0000    1.65
--- src/include/access/genam.h    18 May 2007 10:21:49 -0000
***************
*** 93,98 ****
--- 93,106 ----
               ItemPointer heap_t_ctid,
               Relation heapRelation,
               bool check_uniqueness);
+ extern BlockNumber index_prepareinsert(Relation indexRelation,
+              Datum *values, bool *isnull,
+              Relation heapRelation,
+              bool check_uniqueness,
+              void **opauqe);
+ extern bool index_finishinsert(Relation indexRelation,
+              ItemPointer heap_t_ctid,
+              void *opaque);

  extern IndexScanDesc index_beginscan(Relation heapRelation,
                  Relation indexRelation,
Index: src/include/access/heapam.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/heapam.h,v
retrieving revision 1.117
diff -c -r1.117 heapam.h
*** src/include/access/heapam.h    5 Nov 2006 22:42:10 -0000    1.117
--- src/include/access/heapam.h    15 Jan 2007 17:24:52 -0000
***************
*** 157,163 ****
  extern void setLastTid(const ItemPointer tid);

  extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!             bool use_wal, bool use_fsm);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
              ItemPointer ctid, TransactionId *update_xmax,
              CommandId cid, Snapshot crosscheck, bool wait);
--- 157,163 ----
  extern void setLastTid(const ItemPointer tid);

  extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!             bool use_wal, bool use_fsm, BlockNumber suggestedblk);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
              ItemPointer ctid, TransactionId *update_xmax,
              CommandId cid, Snapshot crosscheck, bool wait);
Index: src/include/access/nbtree.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/nbtree.h,v
retrieving revision 1.106
diff -c -r1.106 nbtree.h
*** src/include/access/nbtree.h    1 Nov 2006 19:43:17 -0000    1.106
--- src/include/access/nbtree.h    18 May 2007 10:22:09 -0000
***************
*** 479,488 ****
--- 479,511 ----
  extern Datum btbulkdelete(PG_FUNCTION_ARGS);
  extern Datum btvacuumcleanup(PG_FUNCTION_ARGS);
  extern Datum btoptions(PG_FUNCTION_ARGS);
+ extern Datum btprepareinsert(PG_FUNCTION_ARGS);
+ extern Datum btfinishinsert(PG_FUNCTION_ARGS);
+
+ /* Filled in by _bt_prepareinsert */
+ typedef struct BTInsertInfoData
+ {
+     Relation rel;
+     Relation heapRel;
+     bool index_is_unique;
+     IndexTuple itup;
+     ScanKey itup_scankey;
+     Buffer buf; /* pinned, not locked */
+     BTStack stack;
+ } BTInsertInfoData;
+
+ typedef BTInsertInfoData *BTInsertInfo;

  /*
   * prototypes for functions in nbtinsert.c
   */
+ extern BlockNumber _bt_prepareinsert(Relation rel, IndexTuple itup,
+                                      bool index_is_unique, Relation heapRel,
+                                      BTInsertInfo *opaquePtr);
+ extern void _bt_finishinsert(Relation rel, Relation heapRel,
+                              bool check_uniqueness,
+                              IndexTuple itup, ScanKey itup_scankey,
+                              BTStack stack, Buffer buf);
  extern void _bt_doinsert(Relation rel, IndexTuple itup,
               bool index_is_unique, Relation heapRel);
  extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
Index: src/include/catalog/pg_am.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_am.h,v
retrieving revision 1.46
diff -c -r1.46 pg_am.h
*** src/include/catalog/pg_am.h    31 Jul 2006 20:09:05 -0000    1.46
--- src/include/catalog/pg_am.h    18 May 2007 10:25:26 -0000
***************
*** 65,70 ****
--- 65,72 ----
      regproc        amvacuumcleanup;    /* post-VACUUM cleanup function */
      regproc        amcostestimate; /* estimate cost of an indexscan */
      regproc        amoptions;        /* parse AM-specific parameters */
+     regproc        amprepareinsert;    /* get desired insert location on heap */
+     regproc        amfinishinsert;    /* finish a prepared insert operation */
  } FormData_pg_am;

  /* ----------------
***************
*** 78,84 ****
   *        compiler constants for pg_am
   * ----------------
   */
! #define Natts_pg_am                        23
  #define Anum_pg_am_amname                1
  #define Anum_pg_am_amstrategies            2
  #define Anum_pg_am_amsupport            3
--- 80,86 ----
   *        compiler constants for pg_am
   * ----------------
   */
! #define Natts_pg_am                        25
  #define Anum_pg_am_amname                1
  #define Anum_pg_am_amstrategies            2
  #define Anum_pg_am_amsupport            3
***************
*** 102,123 ****
  #define Anum_pg_am_amvacuumcleanup        21
  #define Anum_pg_am_amcostestimate        22
  #define Anum_pg_am_amoptions            23

  /* ----------------
   *        initial contents of pg_am
   * ----------------
   */

! DATA(insert OID = 403 (  btree    5 1 1 t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan
btmarkposbtrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions )); 
  DESCR("b-tree index access method");
  #define BTREE_AM_OID 403
! DATA(insert OID = 405 (  hash    1 1 0 f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan
hashendscanhashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions )); 
  DESCR("hash index access method");
  #define HASH_AM_OID 405
! DATA(insert OID = 783 (  gist    100 7 0 f t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan
gistendscangistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions )); 
  DESCR("GiST index access method");
  #define GIST_AM_OID 783
! DATA(insert OID = 2742 (  gin    100 4 0 f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan
ginendscanginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions )); 
  DESCR("GIN index access method");
  #define GIN_AM_OID 2742

--- 104,127 ----
  #define Anum_pg_am_amvacuumcleanup        21
  #define Anum_pg_am_amcostestimate        22
  #define Anum_pg_am_amoptions            23
+ #define Anum_pg_am_amprepareinsert        24
+ #define Anum_pg_am_amfinishinsert        25

  /* ----------------
   *        initial contents of pg_am
   * ----------------
   */

! DATA(insert OID = 403 (  btree    5 1 1 t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan
btmarkposbtrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions btprepareinsert btfinishinsert)); 
  DESCR("b-tree index access method");
  #define BTREE_AM_OID 403
! DATA(insert OID = 405 (  hash    1 1 0 f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan
hashendscanhashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions - -)); 
  DESCR("hash index access method");
  #define HASH_AM_OID 405
! DATA(insert OID = 783 (  gist    100 7 0 f t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan
gistendscangistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions - -)); 
  DESCR("GiST index access method");
  #define GIST_AM_OID 783
! DATA(insert OID = 2742 (  gin    100 4 0 f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan
ginendscanginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions - -)); 
  DESCR("GIN index access method");
  #define GIN_AM_OID 2742

Index: src/include/catalog/pg_proc.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_proc.h,v
retrieving revision 1.429
diff -c -r1.429 pg_proc.h
*** src/include/catalog/pg_proc.h    28 Nov 2006 19:18:44 -0000    1.429
--- src/include/catalog/pg_proc.h    18 May 2007 11:32:04 -0000
***************
*** 682,687 ****
--- 682,691 ----
  DESCR("btree(internal)");
  DATA(insert OID = 2785 (  btoptions           PGNSP PGUID 12 f f t f s 2 17 "1009 16" _null_ _null_ _null_  btoptions
-_null_ )); 
  DESCR("btree(internal)");
+ DATA(insert OID = 5433 (  btprepareinsert   PGNSP PGUID 12 f f t f v 6 23 "2281 2281 2281 2281 2281 2281" _null_
_null__null_    btprepareinsert - _null_ )); 
+ DESCR("btree(internal)");
+ DATA(insert OID = 5430 (  btfinishinsert   PGNSP PGUID 12 f f t f v 2 16 "2281 2281" _null_ _null_ _null_
btfinishinsert- _null_ )); 
+ DESCR("btree(internal)");

  DATA(insert OID = 339 (  poly_same           PGNSP PGUID 12 f f t f i 2 16 "604 604" _null_ _null_ _null_ poly_same -
_null_)); 
  DESCR("same as?");
Index: src/include/executor/executor.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/executor/executor.h,v
retrieving revision 1.130.2.2
diff -c -r1.130.2.2 executor.h
*** src/include/executor/executor.h    2 Feb 2007 00:07:28 -0000    1.130.2.2
--- src/include/executor/executor.h    18 May 2007 09:48:24 -0000
***************
*** 275,280 ****
--- 275,281 ----
  extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
  extern void ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
                        EState *estate, bool is_vacuum);
+ extern BlockNumber ExecPrepareIndexInsert(TupleTableSlot *slot, EState *estate);

  extern void RegisterExprContextCallback(ExprContext *econtext,
                              ExprContextCallbackFunction function,
Index: src/include/nodes/execnodes.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/nodes/execnodes.h,v
retrieving revision 1.161.2.2
diff -c -r1.161.2.2 execnodes.h
*** src/include/nodes/execnodes.h    26 Apr 2007 23:24:57 -0000    1.161.2.2
--- src/include/nodes/execnodes.h    18 May 2007 09:48:25 -0000
***************
*** 259,264 ****
--- 259,266 ----
   *        NumIndices                # of indices existing on result relation
   *        IndexRelationDescs        array of relation descriptors for indices
   *        IndexRelationInfo        array of key/attr info for indices
+  *        ClusterIndex            index to the IndexRelationInfo array of the
+  *                                clustered index, or -1 if there's none
   *        TrigDesc                triggers to be fired, if any
   *        TrigFunctions            cached lookup info for trigger functions
   *        TrigInstrument            optional runtime measurements for triggers
***************
*** 275,286 ****
--- 277,291 ----
      int            ri_NumIndices;
      RelationPtr ri_IndexRelationDescs;
      IndexInfo **ri_IndexRelationInfo;
+     int         ri_ClusterIndex;
      TriggerDesc *ri_TrigDesc;
      FmgrInfo   *ri_TrigFunctions;
      struct Instrumentation *ri_TrigInstrument;
      List      **ri_ConstraintExprs;
      JunkFilter *ri_junkFilter;
      ProjectionInfo *ri_projectReturning;
+
+     void        *ri_PreparedInsertOpaque;
  } ResultRelInfo;

  /* ----------------
Index: src/include/utils/rel.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/utils/rel.h,v
retrieving revision 1.92
diff -c -r1.92 rel.h
*** src/include/utils/rel.h    4 Oct 2006 00:30:10 -0000    1.92
--- src/include/utils/rel.h    15 Jan 2007 17:24:52 -0000
***************
*** 116,121 ****
--- 116,123 ----
      FmgrInfo    amvacuumcleanup;
      FmgrInfo    amcostestimate;
      FmgrInfo    amoptions;
+     FmgrInfo    amprepareinsert;
+     FmgrInfo    amfinishinsert;
  } RelationAmInfo;



Re: Maintaining cluster order on insert

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Heikki Linnakangas wrote:
> Jaime Casanova wrote:
> > On 5/16/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> >> Jim C. Nasby wrote:
> >> > What about adding the ability to ask the FSM for a page that's near a
> >> > given page? That way if you did have to go to the FSM you could at
> >> least
> >> > try and insert close to the page you originally wanted.
> >>
> >> Yeah, there's always room for improvement. I made the patch when I was
> >> working on clustered indexes, and was mostly concerned about getting
> >> inserts to the same page as other tuples with similar values so that the
> >> clustered index stays clustered.
> >>
> >
> > the patch doesn't apply in cvs... you'll need to update it...
>
> Oh, here you are.
>
> The implementation has changed a bit since August. I thought I had
> submitted an updated version in the winter but couldn't find it. Anyway,
> I updated and dusted off the source tree, tidied up the comments a
> little bit, and fixed some inconsistencies in pg_proc entries that made
> opr_sanity to fail.
>
> The beef of the patch is two new optional indexam API functions:
> amprepareinsert and amfinishinsert. amprepareinsert is called before
> inserting the heap tuple. It descends the tree and finds and pins the
> right leaf page to insert to, and returns a suggestion on where the heap
> tuple should be inserted. amfinishinsert is called after inserting the
> heap tuple to actually insert the index tuple. Documentation for these
> functions need to be added indexam.sgml, I noticed that that's not done yet.
>
> The cluster_inserts GUC option that you can use to enable/disable the
> feature should be removed before committing.
>
> The performance characteristics of this patch hasn't been thoroughly
> discussed yet. The reason why you want to cluster your tables is to
> speed up SELECTs that return a bunch of tuples with similar values, for
> example range queries. The reason for keeping them clustered on inserts
> is to reduce the need to run CLUSTER as often.
>
> It doesn't come without a cost, however. In the worst case, there never
> is room for new inserts on pages, and each insert needs to do one extra
> I/O to fetch the optimal heap page where the insert should go, see that
> there's no room, and then insert somewhere else. Using a non-zero
> fillfactor helps, but even when there is room on the page, it's often
> cheaper to just append to the end of the table and running CLUSTER at
> night for example, than do random access to insert to the "right" pages
> in the heap.
>
> So, should we have a WITH-option on the table to enable/disable this
> feature, and what would be the default?
>
> --
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com


>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Maintaining cluster order on insert

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> The beef of the patch is two new optional indexam API functions:
> amprepareinsert and amfinishinsert. amprepareinsert is called before
> inserting the heap tuple. It descends the tree and finds and pins the
> right leaf page to insert to, and returns a suggestion on where the heap
> tuple should be inserted. amfinishinsert is called after inserting the
> heap tuple to actually insert the index tuple. Documentation for these
> functions need to be added indexam.sgml, I noticed that that's not done yet.

What happens when there's more than one index?

Is there a risk of deadlock during concurrent insertions (from different
processes trying to lock the same buffers in different orders)?

            regards, tom lane

Re: Maintaining cluster order on insert

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> The beef of the patch is two new optional indexam API functions:
>> amprepareinsert and amfinishinsert. amprepareinsert is called before
>> inserting the heap tuple. It descends the tree and finds and pins the
>> right leaf page to insert to, and returns a suggestion on where the heap
>> tuple should be inserted. amfinishinsert is called after inserting the
>> heap tuple to actually insert the index tuple. Documentation for these
>> functions need to be added indexam.sgml, I noticed that that's not done yet.
>
> What happens when there's more than one index?

You can only cluster a table on one index. The other index inserts will
use the aminsert-function, after inserting the heap tuple as usual.

> Is there a risk of deadlock during concurrent insertions (from different
> processes trying to lock the same buffers in different orders)?

No, should be ok. btprepareinsert function will release locks (but not
the pin) after descending the tree, and btfinishinsert will reacquire
them. The locking order is the same as today.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Maintaining cluster order on insert

From
"Jaime Casanova"
Date:
On 5/18/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> Jaime Casanova wrote:
> >
> > the patch doesn't apply in cvs... you'll need to update it...
>
> Oh, here you are.
>
> The implementation has changed a bit since August. I thought I had
> submitted an updated version in the winter but couldn't find it. Anyway,
> I updated and dusted off the source tree, tidied up the comments a
> little bit, and fixed some inconsistencies in pg_proc entries that made
> opr_sanity to fail.
>

this one doesn't apply either... there are problems with nbtinsert.c and pg_am.h

--
regards,
Jaime Casanova

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs and the universe trying
to produce bigger and better idiots.
So far, the universe is winning."
                                       Richard Cook

Re: Maintaining cluster order on insert

From
Heikki Linnakangas
Date:
Jaime Casanova wrote:
> On 5/18/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
>> Jaime Casanova wrote:
>> >
>> > the patch doesn't apply in cvs... you'll need to update it...
>>
>> Oh, here you are.
>>
>> The implementation has changed a bit since August. I thought I had
>> submitted an updated version in the winter but couldn't find it. Anyway,
>> I updated and dusted off the source tree, tidied up the comments a
>> little bit, and fixed some inconsistencies in pg_proc entries that made
>> opr_sanity to fail.
>>
>
> this one doesn't apply either... there are problems with nbtinsert.c and
> pg_am.h

Ah, sorry about that. For some reason my source tree was checked out
from the 8.2 branch, instead of CVS HEAD.

Here you are. Thanks for looking at this!

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: doc/src/sgml/catalogs.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/catalogs.sgml,v
retrieving revision 2.152
diff -c -r2.152 catalogs.sgml
*** doc/src/sgml/catalogs.sgml    15 May 2007 19:13:54 -0000    2.152
--- doc/src/sgml/catalogs.sgml    19 May 2007 16:23:49 -0000
***************
*** 517,522 ****
--- 517,536 ----
        <entry>Function to parse and validate <structfield>reloptions</> for an index</entry>
       </row>

+      <row>
+       <entry><structfield>amprepareinsert</structfield></entry>
+       <entry><type>regproc</type></entry>
+       <entry><literal><link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>.oid</literal></entry>
+       <entry>Performs the 1st phase of a two phase index insert, returning a suggestion of where in the heap to put a
newtuple</entry> 
+      </row>
+
+      <row>
+       <entry><structfield>amfinishinsert</structfield></entry>
+       <entry><type>regproc</type></entry>
+       <entry><literal><link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>.oid</literal></entry>
+       <entry>Finishes an index insert started with amprepareinsert</entry>
+      </row>
+
      </tbody>
     </tgroup>
    </table>
Index: src/backend/access/heap/heapam.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.232
diff -c -r1.232 heapam.c
*** src/backend/access/heap/heapam.c    8 Apr 2007 01:26:27 -0000    1.232
--- src/backend/access/heap/heapam.c    19 May 2007 16:45:14 -0000
***************
*** 1368,1373 ****
--- 1368,1377 ----
   * Note that use_wal and use_fsm will be applied when inserting into the
   * heap's TOAST table, too, if the tuple requires any out-of-line data.
   *
+  * If suggested_blk is a valid block number, the tuple will be inserted to
+  * that block if there's enough room. If it's full, a block will be chosen
+  * as if suggested_blk was not set.
+  *
   * The return value is the OID assigned to the tuple (either here or by the
   * caller), or InvalidOid if no OID.  The header fields of *tup are updated
   * to match the stored tuple; in particular tup->t_self receives the actual
***************
*** 1376,1382 ****
   */
  Oid
  heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!             bool use_wal, bool use_fsm)
  {
      TransactionId xid = GetCurrentTransactionId();
      HeapTuple    heaptup;
--- 1380,1386 ----
   */
  Oid
  heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!             bool use_wal, bool use_fsm, BlockNumber suggested_blk)
  {
      TransactionId xid = GetCurrentTransactionId();
      HeapTuple    heaptup;
***************
*** 1432,1440 ****
      else
          heaptup = tup;

!     /* Find buffer to insert this tuple into */
!     buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
!                                        InvalidBuffer, use_fsm);

      /* NO EREPORT(ERROR) from here till changes are logged */
      START_CRIT_SECTION();
--- 1436,1478 ----
      else
          heaptup = tup;

!     /* Find buffer to insert this tuple into. Try the suggested block first
!      * if caller gave one.
!      */
!     if (suggested_blk != InvalidBlockNumber)
!     {
!         Buffer suggested_buf;
!         Page pageHeader;
!         Size pageFreeSpace;
!
!         suggested_buf = ReadBuffer(relation, suggested_blk);
!         pageHeader = (Page) BufferGetPage(suggested_buf);
!
!         LockBuffer(suggested_buf, BUFFER_LOCK_EXCLUSIVE);
!
!         /* Don't subtract fillfactor from the free space. That space is
!          * reserved exactly for situations like this; keeping updated and
!          * inserted tuples close to other tuples with similar values.
!          */
!         pageFreeSpace = PageGetFreeSpace(pageHeader);
!
!         if (heaptup->t_len <= pageFreeSpace)
!             buffer = suggested_buf;
!         else
!         {
!             /* Page was full. Release lock and pin and get another block
!              * as if suggested_blk was not given.
!              */
!             LockBuffer(suggested_buf, BUFFER_LOCK_UNLOCK);
!             ReleaseBuffer(suggested_buf);
!
!             buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
!                                                InvalidBuffer, use_fsm);
!         }
!     }
!     else
!         buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
!                                            InvalidBuffer, use_fsm);

      /* NO EREPORT(ERROR) from here till changes are logged */
      START_CRIT_SECTION();
***************
*** 1544,1550 ****
  Oid
  simple_heap_insert(Relation relation, HeapTuple tup)
  {
!     return heap_insert(relation, tup, GetCurrentCommandId(), true, true);
  }

  /*
--- 1582,1589 ----
  Oid
  simple_heap_insert(Relation relation, HeapTuple tup)
  {
!     return heap_insert(relation, tup, GetCurrentCommandId(), true,
!                        true, InvalidBlockNumber);
  }

  /*
Index: src/backend/access/heap/tuptoaster.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/tuptoaster.c,v
retrieving revision 1.74
diff -c -r1.74 tuptoaster.c
*** src/backend/access/heap/tuptoaster.c    6 Apr 2007 04:21:41 -0000    1.74
--- src/backend/access/heap/tuptoaster.c    19 May 2007 16:45:39 -0000
***************
*** 1146,1152 ****
          if (!HeapTupleIsValid(toasttup))
              elog(ERROR, "failed to build TOAST tuple");

!         heap_insert(toastrel, toasttup, mycid, use_wal, use_fsm);

          /*
           * Create the index entry.    We cheat a little here by not using
--- 1146,1153 ----
          if (!HeapTupleIsValid(toasttup))
              elog(ERROR, "failed to build TOAST tuple");

!         heap_insert(toastrel, toasttup, mycid, use_wal, use_fsm,
!                     InvalidBlockNumber);

          /*
           * Create the index entry.    We cheat a little here by not using
Index: src/backend/access/index/indexam.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/index/indexam.c,v
retrieving revision 1.97
diff -c -r1.97 indexam.c
*** src/backend/access/index/indexam.c    5 Jan 2007 22:19:23 -0000    1.97
--- src/backend/access/index/indexam.c    19 May 2007 16:23:57 -0000
***************
*** 18,23 ****
--- 18,25 ----
   *        index_rescan    - restart a scan of an index
   *        index_endscan    - end a scan
   *        index_insert    - insert an index tuple into a relation
+  *        index_prepareinsert    - get desired insert location for a heap tuple
+  *        index_finishinsert    - insert a previously prepared index tuple
   *        index_markpos    - mark a scan position
   *        index_restrpos    - restore a scan position
   *        index_getnext    - get the next tuple from a scan
***************
*** 202,207 ****
--- 204,269 ----
                                        BoolGetDatum(check_uniqueness)));
  }

+ /* ----------------
+  *        index_prepareinsert - get desired insert location for a heap tuple
+  *
+  * The returned BlockNumber is the *heap* page that is the best place
+  * to insert the given tuple to, according to the index am. The best
+  * place is one that maintains the cluster order.
+  *
+  * opaque should be passed to a later index_finishinsert to finish the
+  * insert.
+  * ----------------
+  */
+ BlockNumber
+ index_prepareinsert(Relation indexRelation,
+                     Datum *values,
+                     bool *isnull,
+                     Relation heapRelation,
+                     bool check_uniqueness,
+                     void **opaque)
+ {
+     FmgrInfo   *procedure;
+
+     RELATION_CHECKS;
+     GET_REL_PROCEDURE(amprepareinsert);
+
+     /*
+      * have the am's prepareinsert proc do all the work.
+      */
+     return DatumGetUInt32(FunctionCall6(procedure,
+                                         PointerGetDatum(indexRelation),
+                                         PointerGetDatum(values),
+                                         PointerGetDatum(isnull),
+                                         PointerGetDatum(heapRelation),
+                                         BoolGetDatum(check_uniqueness),
+                                         PointerGetDatum(opaque)));
+ }
+
+ /* ----------------
+  *        index_finishinsert - insert a previously prepared index tuple
+  *
+  * Finishes an insert operation initiated by an earlier call to
+  * index_prepareinsert.
+  * ----------------
+  */
+ bool
+ index_finishinsert(Relation indexRelation,
+                    ItemPointer heap_t_ctid, void *opaque)
+ {
+     FmgrInfo   *procedure;
+
+     RELATION_CHECKS;
+     GET_REL_PROCEDURE(amfinishinsert);
+
+     /*
+      * have the am's finishinsert proc do all the work.
+      */
+     return DatumGetBool(FunctionCall2(procedure,
+                                       PointerGetDatum(heap_t_ctid),
+                                       PointerGetDatum(opaque)));
+ }
+
  /*
   * index_beginscan - start a scan of an index with amgettuple
   *
Index: src/backend/access/nbtree/nbtinsert.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtinsert.c,v
retrieving revision 1.156
diff -c -r1.156 nbtinsert.c
*** src/backend/access/nbtree/nbtinsert.c    11 Apr 2007 20:47:37 -0000    1.156
--- src/backend/access/nbtree/nbtinsert.c    19 May 2007 18:12:25 -0000
***************
*** 96,114 ****
      /* we need an insertion scan key to do our search, so build one */
      itup_scankey = _bt_mkscankey(rel, itup);

- top:
      /* find the first page containing this key */
      stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE);

      offset = InvalidOffsetNumber;

!     /* trade in our read lock for a write lock */
      LockBuffer(buf, BUFFER_LOCK_UNLOCK);
      LockBuffer(buf, BT_WRITE);

      /*
       * If the page was split between the time that we surrendered our read
!      * lock and acquired our write lock, then this page may no longer be the
       * right place for the key we want to insert.  In this case, we need to
       * move right in the tree.    See Lehman and Yao for an excruciatingly
       * precise description.
--- 96,224 ----
      /* we need an insertion scan key to do our search, so build one */
      itup_scankey = _bt_mkscankey(rel, itup);

      /* find the first page containing this key */
      stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE);

      offset = InvalidOffsetNumber;

!     /* release our read lock. _bt_finishinsert will relock the page in
!      * exclusive mode.
!      */
      LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+     _bt_finishinsert(rel, heapRel, index_is_unique, itup,
+                      itup_scankey, stack, buf);
+ }
+
+ /*
+  *    _bt_prepareinsert() -- Find the insert location for a new tuple
+  *
+  * Descends the tree and finds the location for a new index tuple.
+  * As a hint to the executor, returns the heap block number the previous
+  * index tuple at that location points to. By inserting the heap tuple
+  * to that block, the heap will stay better clustered than by inserting
+  * to a random block.
+  *
+  * The leaf page is pinned and a reference to it, among other information
+  * needed to finish the insert, is stored in opaquePtr.
+  */
+ BlockNumber
+ _bt_prepareinsert(Relation rel, IndexTuple itup, bool index_is_unique,
+                   Relation heapRel, BTInsertInfo *opaquePtr)
+ {
+     int            natts = rel->rd_rel->relnatts;
+     OffsetNumber offset;
+     Page        page;
+     BTPageOpaque opaque;
+
+     ScanKey        itup_scankey;
+     BTStack        stack;
+     Buffer        buf;
+     BlockNumber suggestion = InvalidBlockNumber;
+     BTInsertInfo insert_opaque;
+
+     /* we need an insertion scan key to do our search, so build one */
+     itup_scankey = _bt_mkscankey(rel, itup);
+
+     /* find the first page containing this key */
+     stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_READ);
+     if(!BufferIsValid(buf))
+     {
+         /* The index was completely empty. No suggestion then. */
+         *opaquePtr = NULL;
+         return InvalidBlockNumber;
+     }
+
+     page = BufferGetPage(buf);
+     opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+     /* Find the location in the page where the new index tuple would go to. */
+
+     offset = _bt_binsrch(rel, buf, natts, itup_scankey, false);
+     if (offset > PageGetMaxOffsetNumber(page))
+     {
+         /* _bt_binsrch returned pointer to end-of-page. It means that
+          * there was no equal items on the page, and the new item should
+          * be inserted as the last tuple of the page. There could be equal
+          * items on the next page, however.
+          *
+          * At the moment, we just ignore the potential equal items on the
+          * right, and pretend there isn't any. We could instead walk right
+          * to the next page to check that, but let's keep it simple for now.
+          */
+         offset = OffsetNumberPrev(offset);
+     }
+     if(offset < P_FIRSTDATAKEY(opaque))
+     {
+         /* We landed on an empty page. We could step left or right until
+          * we find some items, but let's keep it simple for now.
+          */
+     } else {
+         /* We're now positioned at the index tuple that we're interested in. */
+         ItemId iid = PageGetItemId(page, offset);
+         IndexTuple curitup = (IndexTuple) PageGetItem(page, iid);
+
+         suggestion = ItemPointerGetBlockNumber(&curitup->t_tid);
+     }
+
+     /* Release the read lock. _bt_finishinsert will later reacquire it in
+      * exclusive mode. Keeping the buffer locked would be deadlock-prone
+      * as well; who knows what the caller is going to do, and what pages to
+      * lock, before calling finishinsert.
+      */
+     LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+     /* Return a struct with all information needed to finish this insert. */
+     insert_opaque = *opaquePtr = palloc(sizeof(struct BTInsertInfoData));
+     insert_opaque->rel = rel;
+     insert_opaque->heapRel = heapRel;
+     insert_opaque->index_is_unique = index_is_unique;
+     insert_opaque->itup = itup;
+     insert_opaque->itup_scankey = itup_scankey;
+     insert_opaque->stack = stack;
+     insert_opaque->buf = buf;
+
+     return suggestion;
+ }
+
+ /*
+  *    _bt_finishinsert() -- Finish an insert prepared with prepareinsert
+  */
+ void
+ _bt_finishinsert(Relation rel, Relation heapRel, bool index_is_unique,
+                  IndexTuple itup, ScanKey itup_scankey,
+                  BTStack stack, Buffer buf)
+ {
+     int            natts = rel->rd_rel->relnatts;
+     OffsetNumber offset = InvalidOffsetNumber;
+
      LockBuffer(buf, BT_WRITE);

+ top:
+
      /*
       * If the page was split between the time that we surrendered our read
!      * lock in _bt_prepareinsert or _bt_doinsert, and acquired our write lock, then this page may no longer be the
       * right place for the key we want to insert.  In this case, we need to
       * move right in the tree.    See Lehman and Yao for an excruciatingly
       * precise description.
***************
*** 146,151 ****
--- 256,269 ----
              XactLockTableWait(xwait);
              /* start over... */
              _bt_freestack(stack);
+
+             /* find the first page containing this key */
+             stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE);
+
+             /* trade in our read lock for a write lock */
+             LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+             LockBuffer(buf, BT_WRITE);
+
              goto top;
          }
      }
***************
*** 157,162 ****
--- 275,281 ----
      /* be tidy */
      _bt_freestack(stack);
      _bt_freeskey(itup_scankey);
+     pfree(itup);
  }

  /*
Index: src/backend/access/nbtree/nbtree.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtree.c,v
retrieving revision 1.154
diff -c -r1.154 nbtree.c
*** src/backend/access/nbtree/nbtree.c    5 Jan 2007 22:19:23 -0000    1.154
--- src/backend/access/nbtree/nbtree.c    19 May 2007 16:23:58 -0000
***************
*** 223,229 ****

      _bt_doinsert(rel, itup, checkUnique, heapRel);

!     pfree(itup);

      PG_RETURN_BOOL(true);
  }
--- 223,278 ----

      _bt_doinsert(rel, itup, checkUnique, heapRel);

!     PG_RETURN_BOOL(true);
! }
!
! /*
!  *    btprepareinsert() -- find the best place in the heap to put a new tuple.
!  *
!  *        This uses the same logic as btinsert to find the place where the index
!  *        tuple would go if this was a btinsert call.
!  */
! Datum
! btprepareinsert(PG_FUNCTION_ARGS)
! {
!     Relation    rel = (Relation) PG_GETARG_POINTER(0);
!     Datum       *values = (Datum *) PG_GETARG_POINTER(1);
!     bool       *isnull = (bool *) PG_GETARG_POINTER(2);
!     Relation    heapRel = (Relation) PG_GETARG_POINTER(3);
!     bool        checkUnique = PG_GETARG_BOOL(4);
!     void      **opaquePtr = (void **) PG_GETARG_POINTER(5);
!     IndexTuple    itup;
!     BlockNumber suggestion;
!
!     /* generate an index tuple */
!     itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
!
!     suggestion =_bt_prepareinsert(rel, itup, checkUnique, heapRel,
!                                   (BTInsertInfo *) opaquePtr);
!
!     PG_RETURN_UINT32(suggestion);
! }
!
! /*
!  *    btfinishinsert() -- finish insert
!  */
! Datum
! btfinishinsert(PG_FUNCTION_ARGS)
! {
!     ItemPointer ht_ctid = (ItemPointer) PG_GETARG_POINTER(0);
!     BTInsertInfo opaque = (void *) PG_GETARG_POINTER(1);
!
!     opaque->itup->t_tid = *ht_ctid;
!
!     _bt_finishinsert(opaque->rel,
!                      opaque->heapRel,
!                      opaque->index_is_unique,
!                      opaque->itup,
!                      opaque->itup_scankey,
!                      opaque->stack,
!                      opaque->buf);
!
!     pfree(opaque);

      PG_RETURN_BOOL(true);
  }
Index: src/backend/commands/copy.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.283
diff -c -r1.283 copy.c
*** src/backend/commands/copy.c    27 Apr 2007 22:05:46 -0000    1.283
--- src/backend/commands/copy.c    19 May 2007 17:14:53 -0000
***************
*** 2109,2115 ****
                  ExecConstraints(resultRelInfo, slot, estate);

              /* OK, store the tuple and create index entries for it */
!             heap_insert(cstate->rel, tuple, mycid, use_wal, use_fsm);

              if (resultRelInfo->ri_NumIndices > 0)
                  ExecInsertIndexTuples(slot, &(tuple->t_self), estate, false);
--- 2109,2116 ----
                  ExecConstraints(resultRelInfo, slot, estate);

              /* OK, store the tuple and create index entries for it */
!             heap_insert(cstate->rel, tuple, mycid, use_wal, use_fsm,
!                         InvalidBlockNumber);

              if (resultRelInfo->ri_NumIndices > 0)
                  ExecInsertIndexTuples(slot, &(tuple->t_self), estate, false);
Index: src/backend/executor/execMain.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execMain.c,v
retrieving revision 1.293
diff -c -r1.293 execMain.c
*** src/backend/executor/execMain.c    27 Apr 2007 22:05:47 -0000    1.293
--- src/backend/executor/execMain.c    19 May 2007 16:24:01 -0000
***************
*** 53,58 ****
--- 53,59 ----
  #include "utils/lsyscache.h"
  #include "utils/memutils.h"

+ bool cluster_inserts = true; /* GUC */

  typedef struct evalPlanQual
  {
***************
*** 869,876 ****
--- 870,879 ----
      resultRelInfo->ri_RangeTableIndex = resultRelationIndex;
      resultRelInfo->ri_RelationDesc = resultRelationDesc;
      resultRelInfo->ri_NumIndices = 0;
+     resultRelInfo->ri_ClusterIndex = -1;
      resultRelInfo->ri_IndexRelationDescs = NULL;
      resultRelInfo->ri_IndexRelationInfo = NULL;
+     resultRelInfo->ri_PreparedInsertOpaque = NULL;
      /* make a copy so as not to depend on relcache info not changing... */
      resultRelInfo->ri_TrigDesc = CopyTriggerDesc(resultRelationDesc->trigdesc);
      if (resultRelInfo->ri_TrigDesc)
***************
*** 1353,1358 ****
--- 1356,1362 ----
      ResultRelInfo *resultRelInfo;
      Relation    resultRelationDesc;
      Oid            newId;
+     BlockNumber suggestedBlock;

      /*
       * get the heap tuple out of the tuple table slot, making sure we have a
***************
*** 1401,1406 ****
--- 1405,1417 ----
      if (resultRelationDesc->rd_att->constr)
          ExecConstraints(resultRelInfo, slot, estate);

+     /* Ask the index am of the clustered index for the
+      * best place to put it */
+     if(cluster_inserts)
+         suggestedBlock = ExecPrepareIndexInsert(slot, estate);
+     else
+         suggestedBlock = InvalidBlockNumber;
+
      /*
       * insert the tuple
       *
***************
*** 1409,1415 ****
       */
      newId = heap_insert(resultRelationDesc, tuple,
                          estate->es_snapshot->curcid,
!                         true, true);

      IncrAppended();
      (estate->es_processed)++;
--- 1420,1426 ----
       */
      newId = heap_insert(resultRelationDesc, tuple,
                          estate->es_snapshot->curcid,
!                         true, true, suggestedBlock);

      IncrAppended();
      (estate->es_processed)++;
***************
*** 2600,2606 ****
                  tuple,
                  estate->es_snapshot->curcid,
                  estate->es_into_relation_use_wal,
!                 false);            /* never any point in using FSM */

      /* We know this is a newly created relation, so there are no indexes */

--- 2611,2618 ----
                  tuple,
                  estate->es_snapshot->curcid,
                  estate->es_into_relation_use_wal,
!                 false,             /* never any point in using FSM */
!                 InvalidBlockNumber);

      /* We know this is a newly created relation, so there are no indexes */

Index: src/backend/executor/execUtils.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execUtils.c,v
retrieving revision 1.147
diff -c -r1.147 execUtils.c
*** src/backend/executor/execUtils.c    27 Feb 2007 01:11:25 -0000    1.147
--- src/backend/executor/execUtils.c    19 May 2007 18:22:33 -0000
***************
*** 31,36 ****
--- 31,37 ----
   *        ExecOpenIndices            \
   *        ExecCloseIndices         | referenced by InitPlan, EndPlan,
   *        ExecInsertIndexTuples    /  ExecInsert, ExecUpdate
+  *        ExecPrepareIndexInsert    Referenced by ExecInsert
   *
   *        RegisterExprContextCallback    Register function shutdown callback
   *        UnregisterExprContextCallback  Deregister function shutdown callback
***************
*** 902,907 ****
--- 903,909 ----
      IndexInfo **indexInfoArray;

      resultRelInfo->ri_NumIndices = 0;
+     resultRelInfo->ri_ClusterIndex = -1;

      /* fast path if no indexes */
      if (!RelationGetForm(resultRelation)->relhasindex)
***************
*** 941,946 ****
--- 943,953 ----
          /* extract index key information from the index's pg_index info */
          ii = BuildIndexInfo(indexDesc);

+         /* Remember which index is the clustered one.
+          * It's used to call the suggestblock-method on inserts */
+         if(indexDesc->rd_index->indisclustered)
+             resultRelInfo->ri_ClusterIndex = i;
+
          relationDescs[i] = indexDesc;
          indexInfoArray[i] = ii;
          i++;
***************
*** 1007,1012 ****
--- 1014,1021 ----
      ExprContext *econtext;
      Datum        values[INDEX_MAX_KEYS];
      bool        isnull[INDEX_MAX_KEYS];
+     int            clusterIndex;
+     bool        preparedInsert;

      /*
       * Get information from the result relation info structure.
***************
*** 1016,1021 ****
--- 1025,1049 ----
      relationDescs = resultRelInfo->ri_IndexRelationDescs;
      indexInfoArray = resultRelInfo->ri_IndexRelationInfo;
      heapRelation = resultRelInfo->ri_RelationDesc;
+     clusterIndex = resultRelInfo->ri_ClusterIndex;
+     preparedInsert = resultRelInfo->ri_PreparedInsertOpaque != NULL;
+
+     /*
+      * If the insert to the clustering index was already prepared,
+      * finish it.
+      */
+     if (preparedInsert)
+     {
+         index_finishinsert(relationDescs[clusterIndex],
+                            tupleid,
+                            resultRelInfo->ri_PreparedInsertOpaque);
+         resultRelInfo->ri_PreparedInsertOpaque = NULL;
+
+         /*
+          * keep track of index inserts for debugging
+          */
+         IncrIndexInserted();
+     }

      /*
       * We will use the EState's per-tuple context for evaluating predicates
***************
*** 1036,1041 ****
--- 1064,1072 ----
          if (relationDescs[i] == NULL)
              continue;

+         if (preparedInsert && i == clusterIndex)
+             continue; /* insert to clustered index was already handled above */
+
          indexInfo = indexInfoArray[i];

          /* Check for partial index */
***************
*** 1090,1095 ****
--- 1121,1196 ----
      }
  }

+ /* ----------------------------------------------------------------
+  *        ExecPrepareIndexInsert
+  *
+  *        This routine asks the index am where a new heap tuple
+  *        should be placed.
+  * ----------------------------------------------------------------
+  */
+ BlockNumber
+ ExecPrepareIndexInsert(TupleTableSlot *slot,
+                        EState *estate)
+ {
+     ResultRelInfo *resultRelInfo;
+     int            clusterIndex;
+     Relation    relationDesc;
+     Relation    heapRelation;
+     ExprContext *econtext;
+     Datum        values[INDEX_MAX_KEYS];
+     bool        isnull[INDEX_MAX_KEYS];
+     IndexInfo  *indexInfo;
+
+     /*
+      * Get information from the result relation info structure.
+      */
+     resultRelInfo = estate->es_result_relation_info;
+     clusterIndex = resultRelInfo->ri_ClusterIndex;
+
+     if (clusterIndex == -1)
+         return InvalidBlockNumber; /* there was no clustered index */
+
+     heapRelation = resultRelInfo->ri_RelationDesc;
+     relationDesc = resultRelInfo->ri_IndexRelationDescs[clusterIndex];
+     indexInfo = resultRelInfo->ri_IndexRelationInfo[clusterIndex];
+
+     if (!OidIsValid(relationDesc->rd_am->amprepareinsert))
+         return InvalidBlockNumber; /* the indexam doesn't support the
+                                     * two-phase insert API */
+
+     /* You can't cluster on a partial index */
+     Assert(indexInfo->ii_Predicate == NIL);
+
+     /*
+      * We will use the EState's per-tuple context for evaluating
+      * index expressions (creating it if it's not already there).
+      */
+     econtext = GetPerTupleExprContext(estate);
+
+     /* Arrange for econtext's scan tuple to be the tuple under test */
+     econtext->ecxt_scantuple = slot;
+
+     /*
+      * FormIndexDatum fills in its values and isnull parameters with the
+      * appropriate values for the column(s) of the index.
+      */
+     FormIndexDatum(indexInfo,
+                    slot,
+                    estate,
+                    values,
+                    isnull);
+
+     /*
+      * The index AM does the rest.
+      */
+     return index_prepareinsert(relationDesc,    /* index relation */
+                  values,    /* array of index Datums */
+                  isnull,    /* null flags */
+                  heapRelation,
+                  relationDesc->rd_index->indisunique,
+                  &resultRelInfo->ri_PreparedInsertOpaque);
+ }
+
  /*
   * UpdateChangedParamSet
   *        Add changed parameters to a plan node's chgParam set
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.391
diff -c -r1.391 guc.c
*** src/backend/utils/misc/guc.c    8 May 2007 16:33:51 -0000    1.391
--- src/backend/utils/misc/guc.c    19 May 2007 16:24:17 -0000
***************
*** 99,104 ****
--- 99,105 ----
  #define MS_PER_D (1000 * 60 * 60 * 24)

  /* XXX these should appear in other modules' header files */
+ extern bool cluster_inserts;
  extern bool Log_disconnections;
  extern int    CommitDelay;
  extern int    CommitSiblings;
***************
*** 427,432 ****
--- 428,441 ----
  static struct config_bool ConfigureNamesBool[] =
  {
      {
+         {"cluster_inserts", PGC_USERSET, DEVELOPER_OPTIONS,
+             gettext_noop("Tries to maintain cluster order on inserts."),
+             NULL
+         },
+         &cluster_inserts,
+         true, NULL, NULL
+     },
+     {
          {"enable_seqscan", PGC_USERSET, QUERY_TUNING_METHOD,
              gettext_noop("Enables the planner's use of sequential-scan plans."),
              NULL
Index: src/include/access/genam.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/genam.h,v
retrieving revision 1.66
diff -c -r1.66 genam.h
*** src/include/access/genam.h    5 Jan 2007 22:19:50 -0000    1.66
--- src/include/access/genam.h    19 May 2007 16:24:26 -0000
***************
*** 93,98 ****
--- 93,106 ----
               ItemPointer heap_t_ctid,
               Relation heapRelation,
               bool check_uniqueness);
+ extern BlockNumber index_prepareinsert(Relation indexRelation,
+              Datum *values, bool *isnull,
+              Relation heapRelation,
+              bool check_uniqueness,
+              void **opauqe);
+ extern bool index_finishinsert(Relation indexRelation,
+              ItemPointer heap_t_ctid,
+              void *opaque);

  extern IndexScanDesc index_beginscan(Relation heapRelation,
                  Relation indexRelation,
Index: src/include/access/heapam.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/heapam.h,v
retrieving revision 1.123
diff -c -r1.123 heapam.h
*** src/include/access/heapam.h    8 Apr 2007 01:26:33 -0000    1.123
--- src/include/access/heapam.h    19 May 2007 16:24:26 -0000
***************
*** 157,163 ****
  extern void setLastTid(const ItemPointer tid);

  extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!             bool use_wal, bool use_fsm);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
              ItemPointer ctid, TransactionId *update_xmax,
              CommandId cid, Snapshot crosscheck, bool wait);
--- 157,163 ----
  extern void setLastTid(const ItemPointer tid);

  extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!             bool use_wal, bool use_fsm, BlockNumber suggestedblk);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
              ItemPointer ctid, TransactionId *update_xmax,
              CommandId cid, Snapshot crosscheck, bool wait);
Index: src/include/access/nbtree.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/nbtree.h,v
retrieving revision 1.113
diff -c -r1.113 nbtree.h
*** src/include/access/nbtree.h    11 Apr 2007 20:47:38 -0000    1.113
--- src/include/access/nbtree.h    19 May 2007 16:24:26 -0000
***************
*** 508,517 ****
--- 508,540 ----
  extern Datum btbulkdelete(PG_FUNCTION_ARGS);
  extern Datum btvacuumcleanup(PG_FUNCTION_ARGS);
  extern Datum btoptions(PG_FUNCTION_ARGS);
+ extern Datum btprepareinsert(PG_FUNCTION_ARGS);
+ extern Datum btfinishinsert(PG_FUNCTION_ARGS);
+
+ /* Filled in by _bt_prepareinsert */
+ typedef struct BTInsertInfoData
+ {
+     Relation rel;
+     Relation heapRel;
+     bool index_is_unique;
+     IndexTuple itup;
+     ScanKey itup_scankey;
+     Buffer buf; /* pinned, not locked */
+     BTStack stack;
+ } BTInsertInfoData;
+
+ typedef BTInsertInfoData *BTInsertInfo;

  /*
   * prototypes for functions in nbtinsert.c
   */
+ extern BlockNumber _bt_prepareinsert(Relation rel, IndexTuple itup,
+                                      bool index_is_unique, Relation heapRel,
+                                      BTInsertInfo *opaquePtr);
+ extern void _bt_finishinsert(Relation rel, Relation heapRel,
+                              bool check_uniqueness,
+                              IndexTuple itup, ScanKey itup_scankey,
+                              BTStack stack, Buffer buf);
  extern void _bt_doinsert(Relation rel, IndexTuple itup,
               bool index_is_unique, Relation heapRel);
  extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
Index: src/include/catalog/pg_am.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_am.h,v
retrieving revision 1.51
diff -c -r1.51 pg_am.h
*** src/include/catalog/pg_am.h    6 Apr 2007 22:33:43 -0000    1.51
--- src/include/catalog/pg_am.h    19 May 2007 16:42:48 -0000
***************
*** 66,71 ****
--- 66,73 ----
      regproc        amvacuumcleanup;    /* post-VACUUM cleanup function */
      regproc        amcostestimate; /* estimate cost of an indexscan */
      regproc        amoptions;        /* parse AM-specific parameters */
+     regproc        amprepareinsert;    /* get desired insert location on heap */
+     regproc        amfinishinsert;    /* finish a prepared insert operation */
  } FormData_pg_am;

  /* ----------------
***************
*** 79,85 ****
   *        compiler constants for pg_am
   * ----------------
   */
! #define Natts_pg_am                        24
  #define Anum_pg_am_amname                1
  #define Anum_pg_am_amstrategies            2
  #define Anum_pg_am_amsupport            3
--- 81,87 ----
   *        compiler constants for pg_am
   * ----------------
   */
! #define Natts_pg_am                        26
  #define Anum_pg_am_amname                1
  #define Anum_pg_am_amstrategies            2
  #define Anum_pg_am_amsupport            3
***************
*** 104,125 ****
  #define Anum_pg_am_amvacuumcleanup        22
  #define Anum_pg_am_amcostestimate        23
  #define Anum_pg_am_amoptions            24

  /* ----------------
   *        initial contents of pg_am
   * ----------------
   */

! DATA(insert OID = 403 (  btree    5 1 t t t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan
btmarkposbtrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions )); 
  DESCR("b-tree index access method");
  #define BTREE_AM_OID 403
! DATA(insert OID = 405 (  hash    1 1 f f f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan
hashendscanhashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions )); 
  DESCR("hash index access method");
  #define HASH_AM_OID 405
! DATA(insert OID = 783 (  gist    0 7 f f t t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan
gistendscangistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions )); 
  DESCR("GiST index access method");
  #define GIST_AM_OID 783
! DATA(insert OID = 2742 (  gin    0 4 f f f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan
ginendscanginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions )); 
  DESCR("GIN index access method");
  #define GIN_AM_OID 2742

--- 106,129 ----
  #define Anum_pg_am_amvacuumcleanup        22
  #define Anum_pg_am_amcostestimate        23
  #define Anum_pg_am_amoptions            24
+ #define Anum_pg_am_amprepareinsert        25
+ #define Anum_pg_am_amfinishinsert        26

  /* ----------------
   *        initial contents of pg_am
   * ----------------
   */

! DATA(insert OID = 403 (  btree    5 1 t t t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan
btmarkposbtrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions btprepareinsert btfinishinsert)); 
  DESCR("b-tree index access method");
  #define BTREE_AM_OID 403
! DATA(insert OID = 405 (  hash    1 1 f f f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan
hashendscanhashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions - -)); 
  DESCR("hash index access method");
  #define HASH_AM_OID 405
! DATA(insert OID = 783 (  gist    0 7 f f t t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan
gistendscangistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions - -)); 
  DESCR("GiST index access method");
  #define GIST_AM_OID 783
! DATA(insert OID = 2742 (  gin    0 4 f f f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan
ginendscanginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions - -)); 
  DESCR("GIN index access method");
  #define GIN_AM_OID 2742

Index: src/include/catalog/pg_proc.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_proc.h,v
retrieving revision 1.455
diff -c -r1.455 pg_proc.h
*** src/include/catalog/pg_proc.h    8 May 2007 18:56:48 -0000    1.455
--- src/include/catalog/pg_proc.h    19 May 2007 17:20:23 -0000
***************
*** 688,693 ****
--- 688,697 ----
  DESCR("btree(internal)");
  DATA(insert OID = 2785 (  btoptions           PGNSP PGUID 12 1 0 f f t f s 2 17 "1009 16" _null_ _null_ _null_
btoptions- _null_ )); 
  DESCR("btree(internal)");
+ DATA(insert OID = 5433 (  btprepareinsert   PGNSP PGUID 12 1 0 f f t f v 6 23 "2281 2281 2281 2281 2281 2281" _null_
_null__null_    btprepareinsert - _null_ )); 
+ DESCR("btree(internal)");
+ DATA(insert OID = 5430 (  btfinishinsert   PGNSP PGUID 12 1 0 f f t f v 2 16 "2281 2281" _null_ _null_ _null_
btfinishinsert- _null_ )); 
+ DESCR("btree(internal)");

  DATA(insert OID = 339 (  poly_same           PGNSP PGUID 12 1 0 f f t f i 2 16 "604 604" _null_ _null_ _null_
poly_same- _null_ )); 
  DESCR("same as?");
Index: src/include/executor/executor.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/executor/executor.h,v
retrieving revision 1.139
diff -c -r1.139 executor.h
*** src/include/executor/executor.h    27 Feb 2007 01:11:25 -0000    1.139
--- src/include/executor/executor.h    19 May 2007 16:24:27 -0000
***************
*** 276,281 ****
--- 276,282 ----
  extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
  extern void ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
                        EState *estate, bool is_vacuum);
+ extern BlockNumber ExecPrepareIndexInsert(TupleTableSlot *slot, EState *estate);

  extern void RegisterExprContextCallback(ExprContext *econtext,
                              ExprContextCallbackFunction function,
Index: src/include/nodes/execnodes.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/nodes/execnodes.h,v
retrieving revision 1.174
diff -c -r1.174 execnodes.h
*** src/include/nodes/execnodes.h    17 May 2007 19:35:08 -0000    1.174
--- src/include/nodes/execnodes.h    19 May 2007 16:24:27 -0000
***************
*** 264,269 ****
--- 264,271 ----
   *        NumIndices                # of indices existing on result relation
   *        IndexRelationDescs        array of relation descriptors for indices
   *        IndexRelationInfo        array of key/attr info for indices
+  *        ClusterIndex            index to the IndexRelationInfo array of the
+  *                                clustered index, or -1 if there's none
   *        TrigDesc                triggers to be fired, if any
   *        TrigFunctions            cached lookup info for trigger functions
   *        TrigInstrument            optional runtime measurements for triggers
***************
*** 280,291 ****
--- 282,296 ----
      int            ri_NumIndices;
      RelationPtr ri_IndexRelationDescs;
      IndexInfo **ri_IndexRelationInfo;
+     int         ri_ClusterIndex;
      TriggerDesc *ri_TrigDesc;
      FmgrInfo   *ri_TrigFunctions;
      struct Instrumentation *ri_TrigInstrument;
      List      **ri_ConstraintExprs;
      JunkFilter *ri_junkFilter;
      ProjectionInfo *ri_projectReturning;
+
+     void        *ri_PreparedInsertOpaque;
  } ResultRelInfo;

  /* ----------------
Index: src/include/utils/rel.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/utils/rel.h,v
retrieving revision 1.100
diff -c -r1.100 rel.h
*** src/include/utils/rel.h    29 Mar 2007 00:15:39 -0000    1.100
--- src/include/utils/rel.h    19 May 2007 16:24:29 -0000
***************
*** 117,122 ****
--- 117,124 ----
      FmgrInfo    amvacuumcleanup;
      FmgrInfo    amcostestimate;
      FmgrInfo    amoptions;
+     FmgrInfo    amprepareinsert;
+     FmgrInfo    amfinishinsert;
  } RelationAmInfo;



Re: Maintaining cluster order on insert

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Heikki Linnakangas wrote:
> Jaime Casanova wrote:
> > On 5/18/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> >> Jaime Casanova wrote:
> >> >
> >> > the patch doesn't apply in cvs... you'll need to update it...
> >>
> >> Oh, here you are.
> >>
> >> The implementation has changed a bit since August. I thought I had
> >> submitted an updated version in the winter but couldn't find it. Anyway,
> >> I updated and dusted off the source tree, tidied up the comments a
> >> little bit, and fixed some inconsistencies in pg_proc entries that made
> >> opr_sanity to fail.
> >>
> >
> > this one doesn't apply either... there are problems with nbtinsert.c and
> > pg_am.h
>
> Ah, sorry about that. For some reason my source tree was checked out
> from the 8.2 branch, instead of CVS HEAD.
>
> Here you are. Thanks for looking at this!
>
> --
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com


>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Maintaining cluster order on insert

From
"Pavan Deolasee"
Date:


On 5/19/07, Heikki Linnakangas <heikki@enterprisedb.com > wrote:

Ah, sorry about that. For some reason my source tree was checked out
from the 8.2 branch, instead of CVS HEAD.


I looked at the patch. Not that I am very comfortable with this part
of the code, but nevertheless here are my comments:


I am wondering if we can optimize _bt_moveright() code by remembering
the PageLSN of the page before releasing the READ LOCK. If the page
is indeed changed since we last saw it, the PageLSN must be different
than what we had remembered and gives us a hint that we might need
to move right.


!           /* Page was full. Release lock and pin and get another block
!            * as if suggested_blk was not given.
!            */
!           LockBuffer(suggested_buf, BUFFER_LOCK_UNLOCK);
!           ReleaseBuffer(suggested_buf);

May be you want to use UnlockReleaseBuffer()  or  _bt_relbuf() just as a shorthand.
Btw, I wonder why _bt_relbuf() exists and even so why does it take "Relation"
argument ?

Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: Maintaining cluster order on insert

From
"Pavan Deolasee"
Date:


On 5/21/07, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:


On 5/19/07, Heikki Linnakangas < heikki@enterprisedb.com > wrote:

Ah, sorry about that. For some reason my source tree was checked out
from the 8.2 branch, instead of CVS HEAD.


I looked at the patch. Not that I am very comfortable with this part
of the code, but nevertheless here are my comments:


Another problem that I noticed with the patch is that it disregards
the fillfactor while inserting the tuples. ISTM that this is a correct
behavior when a tuple is being inserted in the block to preserve cluster
ordering. But when the tuple is being inserted as a normal operation,
IMO we should respect the fillfactor.

The easiest way to reproduce this is to insert tuples sorted on the
cluster index key.

Thanks,
Pavan


--

EnterpriseDB     http://www.enterprisedb.com

Re: Maintaining cluster order on insert

From
Heikki Linnakangas
Date:
Pavan Deolasee wrote:
> On 5/21/07, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>>
>>
>>
>> On 5/19/07, Heikki Linnakangas <heikki@enterprisedb.com > wrote:
>> >
>> >
>> > Ah, sorry about that. For some reason my source tree was checked out
>> > from the 8.2 branch, instead of CVS HEAD.
>> >
>> >
>> I looked at the patch. Not that I am very comfortable with this part
>> of the code, but nevertheless here are my comments:
>>
>>
> Another problem that I noticed with the patch is that it disregards
> the fillfactor while inserting the tuples. ISTM that this is a correct
> behavior when a tuple is being inserted in the block to preserve cluster
> ordering. But when the tuple is being inserted as a normal operation,
> IMO we should respect the fillfactor.
>
> The easiest way to reproduce this is to insert tuples sorted on the
> cluster index key.

Actually it does respect the fillfactor when the block suggested by
indexam is full. When you insert tuples sorted on the cluster index key,
the suggested page after the first tuple on the new page is always the
last page. Try inserting in random order instead.

IOW it's working as designed. But maybe it's not the desired behavior.
Should we have a special case and always respect the fillfactor when
inserting to the last page of the heap?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Maintaining cluster order on insert

From
"Jim C. Nasby"
Date:
On Mon, May 21, 2007 at 10:48:59AM +0100, Heikki Linnakangas wrote:
> >Another problem that I noticed with the patch is that it disregards
> >the fillfactor while inserting the tuples. ISTM that this is a correct
> >behavior when a tuple is being inserted in the block to preserve cluster
> >ordering. But when the tuple is being inserted as a normal operation,
> >IMO we should respect the fillfactor.
> >
> >The easiest way to reproduce this is to insert tuples sorted on the
> >cluster index key.
>
> Actually it does respect the fillfactor when the block suggested by
> indexam is full. When you insert tuples sorted on the cluster index key,
> the suggested page after the first tuple on the new page is always the
> last page. Try inserting in random order instead.
>
> IOW it's working as designed. But maybe it's not the desired behavior.
> Should we have a special case and always respect the fillfactor when
> inserting to the last page of the heap?

I think that would be following with least surprise.
--
Jim Nasby                                      decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Attachment

Re: Maintaining cluster order on insert

From
"Jaime Casanova"
Date:
On 5/27/07, Jim C. Nasby <decibel@decibel.org> wrote:
> On Mon, May 21, 2007 at 10:48:59AM +0100, Heikki Linnakangas wrote:
> >
> > IOW it's working as designed. But maybe it's not the desired behavior.
> > Should we have a special case and always respect the fillfactor when
> > inserting to the last page of the heap?
>
> I think that would be following with least surprise.

What's the status of this patch? are we waiting an update?

AFAIU, it's not fair to say that the patch maintain cluster order...
it just try to keep similar rows on the same page if possible (it's
not the same thing)... if it can't then it simply insert at the last
page as usual but we have wasted time in the try...

so the real question is if there is any performance win on this...
have you some numbers?

another question: if the fillfactor is 100% then is a complete waste
of time to look for a suggested block. maybe we could check for that?

--
regards,
Jaime Casanova

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs and the universe trying
to produce bigger and better idiots.
So far, the universe is winning."
                                       Richard Cook

Re: Maintaining cluster order on insert

From
Tom Lane
Date:
"Jaime Casanova" <systemguards@gmail.com> writes:
> another question: if the fillfactor is 100% then is a complete waste
> of time to look for a suggested block. maybe we could check for that?

No, it isn't, since the page might have been vacuumed since it was last
filled up.

            regards, tom lane

Re: Maintaining cluster order on insert

From
"Jaime Casanova"
Date:
On 6/16/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Jaime Casanova" <systemguards@gmail.com> writes:
> > another question: if the fillfactor is 100% then is a complete waste
> > of time to look for a suggested block. maybe we could check for that?
>
> No, it isn't, since the page might have been vacuumed since it was last
> filled up.
>
>                         regards, tom lane
>

ahh... that vacuum thing!!! yeah!!! ;)

--
regards,
Jaime Casanova

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs and the universe trying
to produce bigger and better idiots.
So far, the universe is winning."
                                       Richard Cook

Re: Maintaining cluster order on insert

From
Heikki Linnakangas
Date:
Jaime Casanova wrote:
> On 5/27/07, Jim C. Nasby <decibel@decibel.org> wrote:
>> On Mon, May 21, 2007 at 10:48:59AM +0100, Heikki Linnakangas wrote:
>> >
>> > IOW it's working as designed. But maybe it's not the desired behavior.
>> > Should we have a special case and always respect the fillfactor when
>> > inserting to the last page of the heap?
>>
>> I think that would be following with least surprise.
>
> What's the status of this patch? are we waiting an update?
>
> AFAIU, it's not fair to say that the patch maintain cluster order...
> it just try to keep similar rows on the same page if possible (it's
> not the same thing)... if it can't then it simply insert at the last
> page as usual but we have wasted time in the try...

That's right. Or more accurately, if there's no room on the same page,
it uses the FSM and if it still can't find room, it inserts at the last
page.

> so the real question is if there is any performance win on this...
> have you some numbers?

Hmm. I ran a small test with random inserts/deletes back in August. The
conclusion was that the table loses clustering a lot slower, but I can't
find the results now.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Maintaining cluster order on insert

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> The implementation has changed a bit since August. I thought I had
> submitted an updated version in the winter but couldn't find it. Anyway,
> I updated and dusted off the source tree, tidied up the comments a
> little bit, and fixed some inconsistencies in pg_proc entries that made
> opr_sanity to fail.

I started looking at this patch.  My first reaction is that I liked last
August's API (an independent "suggestblock" call) a lot better.  I think
trying to hold an index page lock across the heap insert is an extremely
bad idea; it will hurt concurrency and possibly cause deadlocks
(especially for unique indexes).

The other question is why is execMain involved in this?  That makes the
design nonfunctional for tuples inserted in any other way than through
the main executor --- COPY for instance.  Also, if this is successful,
I could see using it on system catalogs eventually.  I'm inclined to
think that the right design is for heap_insert to call amsuggestblock
for itself, or maybe even push that down to RelationGetBufferForTuple.
(Note: having heap_insert contain logic that duplicates
RelationGetBufferForTuple's is another bit of bad design here, but
that's at least correctable locally.)  Now the difficulty in that is
that the heapam.c routines don't have hold of any data structure
containing index knowledge ... but they do have hold of the Relation
structure for the heap.  I suggest making RelationGetIndexList() cache
the OID of the clustered index (if any) in relcache entries, and add
RelationGetClusterIndex much like RelationGetOidIndex, and then
heap_insert can use that.

            regards, tom lane

Re: Maintaining cluster order on insert

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> The implementation has changed a bit since August. I thought I had
>> submitted an updated version in the winter but couldn't find it. Anyway,
>> I updated and dusted off the source tree, tidied up the comments a
>> little bit, and fixed some inconsistencies in pg_proc entries that made
>> opr_sanity to fail.
>
> I started looking at this patch.  My first reaction is that I liked last
> August's API (an independent "suggestblock" call) a lot better.  I think
> trying to hold an index page lock across the heap insert is an extremely
> bad idea; it will hurt concurrency and possibly cause deadlocks
> (especially for unique indexes).

The index page is not kept locked across the calls. Just pinned.

The reason for switching to the new API instead of the amsuggestblock
API is CPU overhead. It avoids constructing the IndexTuple twice and
descending the tree twice.

Clustering is mainly useful when the table doesn't fit in cache, so one
could argue that if you care about clustering you're most likely I/O
bound and don't care about the CPU overhead that much. Nevertheless,
avoiding it seems like a good idea to me.

The amsuggestblock API is simpler, though. We might choose it on those
grounds alone.

> The other question is why is execMain involved in this?  That makes the
> design nonfunctional for tuples inserted in any other way than through
> the main executor --- COPY for instance.  Also, if this is successful,
> I could see using it on system catalogs eventually.  I'm inclined to
> think that the right design is for heap_insert to call amsuggestblock
> for itself, or maybe even push that down to RelationGetBufferForTuple.
> (Note: having heap_insert contain logic that duplicates
> RelationGetBufferForTuple's is another bit of bad design here, but
> that's at least correctable locally.)  Now the difficulty in that is
> that the heapam.c routines don't have hold of any data structure
> containing index knowledge ... but they do have hold of the Relation
> structure for the heap.  I suggest making RelationGetIndexList() cache
> the OID of the clustered index (if any) in relcache entries, and add
> RelationGetClusterIndex much like RelationGetOidIndex, and then
> heap_insert can use that.

Hmm. My first reaction on that is that having heap_insert reach into an
index is a modularity violation. It wouldn't be too hard to make similar
changes to COPY that I did to the executor.

I doubt it's very useful for system catalogs; they should never grow
very large, and should stay mostly cached anyway.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Maintaining cluster order on insert

From
Gregory Stark
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> The reason for switching to the new API instead of the amsuggestblock API is
> CPU overhead. It avoids constructing the IndexTuple twice and descending the
> tree twice.

I wonder if it's possible to finesse this. Have the suggestblock function
remember the last block it suggested and somehow recognize when it's given the
same tuple to index. Keeping the block pinned would still be the sticky point
though.


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Maintaining cluster order on insert

From
Tom Lane
Date:
[ back to this thread... ]

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> The other question is why is execMain involved in this?  That makes the
>> design nonfunctional for tuples inserted in any other way than through
>> the main executor --- COPY for instance.  Also, if this is successful,
>> I could see using it on system catalogs eventually.  I'm inclined to
>> think that the right design is for heap_insert to call amsuggestblock
>> for itself, or maybe even push that down to RelationGetBufferForTuple.

> Hmm. My first reaction on that is that having heap_insert reach into an
> index is a modularity violation. It wouldn't be too hard to make similar
> changes to COPY that I did to the executor.

Well, this entire patch is a modularity violation by definition.
Bringing the executor into it just makes the damage worse, by involving
relatively high-level code in the communication between two relatively
low-level bits of code, and ensuring that we'll have to involve still
other bits of high-level code in future if we want to make the behavior
work in more scenarios.

There is a problem with my suggestion of relying on the relcache:
that would only give you the index relation's OID, not a handle to an
open Relation for it.  I'm not sure this is a killer, however, as the
cost of an index_open/index_close cycle isn't really much more than a
dynahash search if the index is already open and locked by some upper
code level.

            regards, tom lane

Re: Maintaining cluster order on insert

From
Tom Lane
Date:
[ back to the cluster-order patch ]

Awhile back, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> The performance characteristics of this patch hasn't been thoroughly
> discussed yet. The reason why you want to cluster your tables is to
> speed up SELECTs that return a bunch of tuples with similar values, for
> example range queries. The reason for keeping them clustered on inserts
> is to reduce the need to run CLUSTER as often.

> It doesn't come without a cost, however. In the worst case, there never
> is room for new inserts on pages, and each insert needs to do one extra
> I/O to fetch the optimal heap page where the insert should go, see that
> there's no room, and then insert somewhere else. Using a non-zero
> fillfactor helps, but even when there is room on the page, it's often
> cheaper to just append to the end of the table and running CLUSTER at
> night for example, than do random access to insert to the "right" pages
> in the heap.

I looked through the thread and could not find any clear evidence that
anyone had done any performance testing of this patch at all, so I
hacked together a crude test that alternates between inserting/deleting
a random subset of rows and SELECTing a range of rows.  The results
do not look very good: while there is detectable improvement in the
SELECT performance, it's not much, and it comes at a *very* sizable
penalty in INSERT performance.

Attached is a test program, which I ran against today's CVS HEAD with
and without the v8 version of the patch.  The test program sets up
a clustered million-row table with a row width chosen to fit about
20 rows per page.  It then iterates a loop of:

    * delete a random 2% of the table
    * vacuum to recover space
    * insert a random 2% of the table
    * select (about) 1000 consecutively-numbered rows
    * select all the rows (this is just a cross check that the
      number of rows isn't changing too much)

What you would hope to see as the benefit of the patch is that the time
for the range SELECT degrades more slowly as more of the table is
replaced.  Ignoring the first SELECT as being a startup transient, it
looks like HEAD degrades from about 3 msec to 6 msec over 10 iterations
(20% replacement of the table), whereas with the patch it's about 3 msec
to about 4 and a half.  However, the INSERT steps went from around 20
sec each to about twice that.

I used just the following nondefault parameter settings:

shared_buffers = 10000                  # min 128kB or max_connections*16kB
maintenance_work_mem = 160MB            # min 1MB
max_fsm_pages = 2048000                 # min max_fsm_relations*16, 6 bytes each
wal_buffers = 640kB                     # min 32kB
checkpoint_segments = 30                # in logfile segments, min 1, 16MB each
enable_bitmapscan = off
enable_seqscan = off
autovacuum = off                        # enable autovacuum subprocess?

which for the most part are just 10x the factory defaults.  The hardware
is just a Dell x86_64 workstation with crappy IDE disk, so maybe things
would look better elsewhere, but it's all I have to work with.

Considering that the patch is really severely ugly from a modularity and
layering standpoint, I'm now inclined to reject it.  AFAICT this test
case is showing the patch at its best possible advantage; under
real-world conditions the benefit would be less.  It doesn't look to me
like the gain is worth the loss of system understandability and
maintainability that the patch would impose.

            regards, tom lane

/*
 * derived from testlibpq3.c
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include "libpq-fe.h"


#define TABLESIZE 1000000        /* rows in table */
#define ITERS 10


static void
exit_nicely(PGconn *conn)
{
    PQfinish(conn);
    exit(1);
}

static char *
elapsed_time(struct timeval start_t, struct timeval elapse_t)
{
    static char buf[64];

    if (elapse_t.tv_usec < start_t.tv_usec)
    {
        elapse_t.tv_sec--;
        elapse_t.tv_usec += 1000000;
    }

    sprintf(buf, "%3ld.%06ld", (long) (elapse_t.tv_sec - start_t.tv_sec),
           (long) (elapse_t.tv_usec - start_t.tv_usec));
    return buf;
}

static void
do_cmd(PGconn *conn, const char *cmd)
{
    PGresult   *res;
    struct timeval start_t;
    struct timeval elapse_t;

    gettimeofday(&start_t, NULL);
    res = PQexec(conn, cmd);
    gettimeofday(&elapse_t, NULL);
    if (PQresultStatus(res) != PGRES_COMMAND_OK)
    {
        fprintf(stderr, "%s\nCommand failed: %s", cmd, PQerrorMessage(conn));
        PQclear(res);
        exit_nicely(conn);
    }
    printf("executed %-.6s in %s sec\n", cmd,
           elapsed_time(start_t, elapse_t));
    fflush(stdout);
    PQclear(res);
}

static void
do_select(PGconn *conn, const char *cmd)
{
    PGresult   *res;
    struct timeval start_t;
    struct timeval elapse_t;

    gettimeofday(&start_t, NULL);
    res = PQexec(conn, cmd);
    gettimeofday(&elapse_t, NULL);
    if (PQresultStatus(res) != PGRES_TUPLES_OK)
    {
        fprintf(stderr, "%s\nCommand failed: %s", cmd, PQerrorMessage(conn));
        PQclear(res);
        exit_nicely(conn);
    }
    printf("retrieved %4s tuples in %s sec\n", PQgetvalue(res, 0, 0),
           elapsed_time(start_t, elapse_t));
    fflush(stdout);
    PQclear(res);
}

int
main(int argc, char **argv)
{
    const char *conninfo;
    PGconn       *conn;
    PGresult   *res;
    char        cmd[1024];
    int            i;

    /*
     * If the user supplies a parameter on the command line, use it as the
     * conninfo string; otherwise default to setting dbname=postgres and using
     * environment variables or defaults for all other connection parameters.
     */
    if (argc > 1)
        conninfo = argv[1];
    else
        conninfo = "dbname = postgres";

    /* Make a connection to the database */
    conn = PQconnectdb(conninfo);

    /* Check to see that the backend connection was successfully made */
    if (PQstatus(conn) != CONNECTION_OK)
    {
        fprintf(stderr, "Connection to database failed: %s",
                PQerrorMessage(conn));
        exit_nicely(conn);
    }

    /* Ignore any error while dropping old table */
    res = PQexec(conn, "DROP TABLE testtab");
    PQclear(res);

    /* Set up test table */
    do_cmd(conn, "CREATE TABLE testtab(k int, d text)");
    sprintf(cmd, "INSERT INTO testtab "
            "SELECT x, repeat('x',350) FROM generate_series(1,%d) x",
            TABLESIZE);
    do_cmd(conn, cmd);
    do_cmd(conn, "CREATE INDEX testtabi ON testtab(k)");
    do_cmd(conn, "CLUSTER testtab USING testtabi");
    do_cmd(conn, "VACUUM ANALYZE testtab");

    for (i = 0; i < ITERS; i++)
    {
        int            st;

        sprintf(cmd, "DELETE FROM testtab WHERE random() < 1.0/50");
        do_cmd(conn, cmd);
        do_cmd(conn, "VACUUM testtab");
        sprintf(cmd, "INSERT INTO testtab "
                "SELECT (random() * %d)::int, repeat('x',350) "
                "FROM generate_series(1,%d)",
                TABLESIZE, TABLESIZE/50);
        do_cmd(conn, cmd);
        st = ((double) random() * (TABLESIZE - 1000)) / 0x7FFFFFFF;
        sprintf(cmd, "SELECT count(*) FROM testtab "
                "WHERE k BETWEEN %d AND %d",
                st, st + 1000);
        do_select(conn, cmd);
        do_select(conn, "SELECT count(*) FROM testtab");
    }

    /* close the connection to the database and cleanup */
    PQfinish(conn);

    return 0;
}
executed CREATE in   0.066563 sec
executed INSERT in  40.465653 sec
executed CREATE in   9.152698 sec
executed CLUSTE in  20.036375 sec
executed VACUUM in   1.440232 sec
executed DELETE in  14.770937 sec
executed VACUUM in  10.663301 sec
executed INSERT in   2.449248 sec
retrieved  995 tuples in   0.170685 sec
retrieved 999971 tuples in   0.648237 sec
executed DELETE in  33.760709 sec
executed VACUUM in  28.762174 sec
executed INSERT in  12.212027 sec
retrieved  999 tuples in   0.003139 sec
retrieved 999642 tuples in   0.646765 sec
executed DELETE in  30.541053 sec
executed VACUUM in  13.204475 sec
executed INSERT in  17.972502 sec
retrieved  985 tuples in   0.003621 sec
retrieved 999612 tuples in   0.623749 sec
executed DELETE in  14.911813 sec
executed VACUUM in  27.443921 sec
executed INSERT in  19.125950 sec
retrieved 1002 tuples in   0.004900 sec
retrieved 999784 tuples in   0.667716 sec
executed DELETE in  22.651369 sec
executed VACUUM in  10.743926 sec
executed INSERT in  21.631076 sec
retrieved  987 tuples in   0.004566 sec
retrieved 999711 tuples in   0.632185 sec
executed DELETE in  11.587629 sec
executed VACUUM in  15.278964 sec
executed INSERT in  23.725325 sec
retrieved 1011 tuples in   0.005960 sec
retrieved 999624 tuples in   0.680135 sec
executed DELETE in   3.905152 sec
executed VACUUM in  42.848288 sec
executed INSERT in  18.619609 sec
retrieved 1006 tuples in   0.007106 sec
retrieved 999479 tuples in   0.678316 sec
executed DELETE in  27.288273 sec
executed VACUUM in   9.329839 sec
executed INSERT in  23.361110 sec
retrieved  983 tuples in   0.005997 sec
retrieved 999354 tuples in   0.615004 sec
executed DELETE in  20.416503 sec
executed VACUUM in  17.537463 sec
executed INSERT in  19.781416 sec
retrieved 1021 tuples in   0.006894 sec
retrieved 999688 tuples in   0.670918 sec
executed DELETE in  14.063273 sec
executed VACUUM in  17.439971 sec
executed INSERT in  17.381126 sec
retrieved  987 tuples in   0.006664 sec
retrieved 999930 tuples in   0.632494 sec
executed CREATE in   0.086862 sec
executed INSERT in  50.746362 sec
executed CREATE in  12.115655 sec
executed CLUSTE in  33.656341 sec
executed VACUUM in   4.306563 sec
executed DELETE in  18.062664 sec
executed VACUUM in  28.487570 sec
executed INSERT in  25.638022 sec
retrieved  998 tuples in   1.498475 sec
retrieved 1000019 tuples in   0.624082 sec
executed DELETE in  30.211317 sec
executed VACUUM in  24.147135 sec
executed INSERT in  40.759404 sec
retrieved 1006 tuples in   0.002711 sec
retrieved 1000184 tuples in   0.984822 sec
executed DELETE in  31.616131 sec
executed VACUUM in  22.383567 sec
executed INSERT in  36.174291 sec
retrieved  990 tuples in   0.002840 sec
retrieved 1000275 tuples in   0.723260 sec
executed DELETE in  34.279871 sec
executed VACUUM in  34.855060 sec
executed INSERT in  36.652868 sec
retrieved  990 tuples in   0.003554 sec
retrieved 1000376 tuples in   0.715215 sec
executed DELETE in  37.396236 sec
executed VACUUM in  17.721296 sec
executed INSERT in  32.413756 sec
retrieved  974 tuples in   0.003498 sec
retrieved 1000383 tuples in   0.723111 sec
executed DELETE in  44.696337 sec
executed VACUUM in  13.850628 sec
executed INSERT in  47.649557 sec
retrieved 1002 tuples in   0.004165 sec
retrieved 1000554 tuples in   0.722704 sec
executed DELETE in  17.854653 sec
executed VACUUM in  52.446585 sec
executed INSERT in  41.293512 sec
retrieved 1038 tuples in   0.004169 sec
retrieved 1000422 tuples in   0.692400 sec
executed DELETE in  32.102991 sec
executed VACUUM in  22.018916 sec
executed INSERT in  44.296194 sec
retrieved  991 tuples in   0.004636 sec
retrieved 1000713 tuples in   0.739701 sec
executed DELETE in  31.678825 sec
executed VACUUM in  12.643271 sec
executed INSERT in  53.179401 sec
retrieved 1015 tuples in   0.004745 sec
retrieved 1000797 tuples in   0.708603 sec
executed DELETE in  35.682182 sec
executed VACUUM in  42.612243 sec
executed INSERT in  45.726663 sec
retrieved 1017 tuples in   0.004753 sec
retrieved 1000755 tuples in   0.696441 sec

Re: Maintaining cluster order on insert

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> What you would hope to see as the benefit of the patch is that the time
> for the range SELECT degrades more slowly as more of the table is
> replaced.  Ignoring the first SELECT as being a startup transient, it
> looks like HEAD degrades from about 3 msec to 6 msec over 10 iterations
> (20% replacement of the table), whereas with the patch it's about 3 msec
> to about 4 and a half.  However, the INSERT steps went from around 20
> sec each to about twice that.

Clustering in general isn't very important on a table that fits in cache.

Also note that the overhead of descending the b-tree twice on INSERTs
hurts the most in a CPU bound test like that. The overhead of the
two-phase method I proposed should be lower.

Test 1:
> executed CREATE in   0.066563 sec
> executed INSERT in  40.465653 sec
> executed CREATE in   9.152698 sec
> executed CLUSTE in  20.036375 sec
> executed VACUUM in   1.440232 sec
Test 2:
> executed CREATE in   0.086862 sec
> executed INSERT in  50.746362 sec
> executed CREATE in  12.115655 sec
> executed CLUSTE in  33.656341 sec
> executed VACUUM in   4.306563 sec

Why is there such a big difference in all these initialization steps as
well? Is it just random noise?

I played a bit with that test program, results from my laptop attached.
I used a version patched with the latest patch I submitted, because
that's what I had readily available. I used the same patched binaries in
both test runs, I just didn't CLUSTER the table in the other run. The
main difference to your results is that the DELETE, VACUUM and INSERT
operations are much faster both with and without the patch. Most INSERTs
for example took < 1 s, and in your results they took > 15 s. Any idea why?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
executed CREATE in   0.098872 sec
executed INSERT in  36.958824 sec
executed CREATE in   7.320412 sec
executed CLUSTE in  22.579578 sec
executed VACUUM in   0.809287 sec
executed DELETE in   8.567730 sec
executed VACUUM in   5.194736 sec
executed INSERT in   0.882852 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 839347 AND 840347
retrieved  993 tuples in   0.239210 sec
executing select: SELECT count(*) FROM testtab
retrieved 999898 tuples in   0.417749 sec
executed DELETE in   5.398222 sec
executed VACUUM in   0.949003 sec
executed INSERT in   0.771617 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 393988 AND 394988
retrieved 1015 tuples in   0.001261 sec
executing select: SELECT count(*) FROM testtab
retrieved 1000038 tuples in   0.373199 sec
executed DELETE in   5.613083 sec
executed VACUUM in   1.289501 sec
executed INSERT in   1.434903 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 782316 AND 783316
retrieved 1011 tuples in   0.001723 sec
executing select: SELECT count(*) FROM testtab
retrieved 999826 tuples in   0.353616 sec
executed DELETE in   3.729799 sec
executed VACUUM in   1.406940 sec
executed INSERT in   1.169133 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 797641 AND 798641
retrieved  991 tuples in   0.001118 sec
executing select: SELECT count(*) FROM testtab
retrieved 999731 tuples in   0.340721 sec
executed DELETE in   1.448537 sec
executed VACUUM in   0.918642 sec
executed INSERT in   0.471951 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 910735 AND 911735
retrieved 1029 tuples in   0.001174 sec
executing select: SELECT count(*) FROM testtab
retrieved 999773 tuples in   0.334880 sec
executed DELETE in   0.972467 sec
executed VACUUM in   0.925557 sec
executed INSERT in   0.645214 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 197353 AND 198353
retrieved  975 tuples in   0.001076 sec
executing select: SELECT count(*) FROM testtab
retrieved 999751 tuples in   0.334343 sec
executed DELETE in   0.966780 sec
executed VACUUM in   0.923394 sec
executed INSERT in   0.468693 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 334887 AND 335887
retrieved  983 tuples in   0.001381 sec
executing select: SELECT count(*) FROM testtab
retrieved 999801 tuples in   0.335184 sec
executed DELETE in   0.792456 sec
executed VACUUM in   0.934451 sec
executed INSERT in   0.435028 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 767461 AND 768461
retrieved  998 tuples in   0.001279 sec
executing select: SELECT count(*) FROM testtab
retrieved 999854 tuples in   0.338010 sec
executed DELETE in   0.553884 sec
executed VACUUM in   0.933886 sec
executed INSERT in   0.435980 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 277496 AND 278496
retrieved 1017 tuples in   0.001257 sec
executing select: SELECT count(*) FROM testtab
retrieved 999923 tuples in   0.333156 sec
executed DELETE in   0.518371 sec
executed VACUUM in   0.932207 sec
executed INSERT in  18.031924 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 553415 AND 554415
retrieved 1012 tuples in   0.001280 sec
executing select: SELECT count(*) FROM testtab
retrieved 1000135 tuples in   0.343191 sec
executed CREATE in   0.014632 sec
executed INSERT in  40.509772 sec
executed CREATE in   4.485714 sec
executed VACUUM in   0.396134 sec
executed DELETE in  15.366893 sec
executed VACUUM in   4.004228 sec
executed INSERT in   1.794740 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 839347 AND 840347
retrieved 1008 tuples in   0.001182 sec
executing select: SELECT count(*) FROM testtab
retrieved 1000088 tuples in   0.314738 sec
executed DELETE in  19.191114 sec
executed VACUUM in   0.862771 sec
executed INSERT in   0.595838 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 393988 AND 394988
retrieved 1010 tuples in   0.001081 sec
executing select: SELECT count(*) FROM testtab
retrieved 1000243 tuples in   0.312002 sec
executed DELETE in   4.152384 sec
executed VACUUM in   0.876611 sec
executed INSERT in   0.748715 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 782316 AND 783316
retrieved 1006 tuples in   0.001168 sec
executing select: SELECT count(*) FROM testtab
retrieved 1000469 tuples in   0.313215 sec
executed DELETE in   1.860138 sec
executed VACUUM in   0.887234 sec
executed INSERT in   0.381926 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 797641 AND 798641
retrieved 1001 tuples in   0.001170 sec
executing select: SELECT count(*) FROM testtab
retrieved 1000399 tuples in   0.313861 sec
executed DELETE in   1.203454 sec
executed VACUUM in   0.894531 sec
executed INSERT in   0.414772 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 910735 AND 911735
retrieved 1005 tuples in   0.001192 sec
executing select: SELECT count(*) FROM testtab
retrieved 1000324 tuples in   0.313666 sec
executed DELETE in   1.103641 sec
executed VACUUM in   0.911423 sec
executed INSERT in   0.579772 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 197353 AND 198353
retrieved 1012 tuples in   0.001281 sec
executing select: SELECT count(*) FROM testtab
retrieved 1000150 tuples in   0.314685 sec
executed DELETE in   0.757131 sec
executed VACUUM in   0.904153 sec
executed INSERT in   0.412712 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 334887 AND 335887
retrieved 1003 tuples in   0.001255 sec
executing select: SELECT count(*) FROM testtab
retrieved 1000248 tuples in   0.342517 sec
executed DELETE in   0.572931 sec
executed VACUUM in   0.910369 sec
executed INSERT in   0.701230 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 767461 AND 768461
retrieved 1022 tuples in   0.001361 sec
executing select: SELECT count(*) FROM testtab
retrieved 1000353 tuples in   0.313835 sec
executed DELETE in   0.529490 sec
executed VACUUM in   0.913697 sec
executed INSERT in   0.407656 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 277496 AND 278496
retrieved 1031 tuples in   0.001368 sec
executing select: SELECT count(*) FROM testtab
retrieved 1000435 tuples in   0.314925 sec
executed DELETE in   4.871267 sec
executed VACUUM in   9.618490 sec
executed INSERT in   2.496079 sec
executing select: SELECT count(*) FROM testtab WHERE k BETWEEN 553415 AND 554415
retrieved  987 tuples in   0.001641 sec
executing select: SELECT count(*) FROM testtab
retrieved 1000648 tuples in   0.326064 sec
max_connections = 100            # (change requires restart)
shared_buffers = 700MB            # min 128kB or max_connections*16kB
maintenance_work_mem = 160MB        # min 1MB
max_fsm_pages = 204800            # min max_fsm_relations*16, 6 bytes each
full_page_writes = on            # recover from partial page writes
wal_buffers = 640kB            # min 32kB
checkpoint_segments = 30        # in logfile segments, min 1, 16MB each
checkpoint_timeout =30min        # range 30s-1h
checkpoint_warning = 600s        # 0 is off
autovacuum = off            # enable autovacuum subprocess?
datestyle = 'iso, dmy'
lc_messages = 'en_GB.UTF-8'            # locale for system error message
lc_monetary = 'en_GB.UTF-8'            # locale for monetary formatting
lc_numeric = 'en_GB.UTF-8'            # locale for number formatting
lc_time = 'en_GB.UTF-8'                # locale for time formatting

Re: Maintaining cluster order on insert

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

>     * delete a random 2% of the table
>     * vacuum to recover space
>     * insert a random 2% of the table
>     * select (about) 1000 consecutively-numbered rows
>     * select all the rows (this is just a cross check that the
>       number of rows isn't changing too much)
>
> What you would hope to see as the benefit of the patch is that the time
> for the range SELECT degrades more slowly as more of the table is
> replaced.  Ignoring the first SELECT as being a startup transient, it
> looks like HEAD degrades from about 3 msec to 6 msec over 10 iterations
> (20% replacement of the table), whereas with the patch it's about 3 msec
> to about 4 and a half.  However, the INSERT steps went from around 20
> sec each to about twice that.

I would suggest:

a) continuing the test until you have 100% replacement.

b) defining the tables with a fill-factor large enough to hold at least one
   extra tuple

c) considering this patch alongside GII where it is especially important.

It's pretty serious what you're suggesting since it means that we'll basically
never have a real cluster feature. I would sure hope we're missing something
and there's a way to make this work usefully.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Maintaining cluster order on insert

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> It's pretty serious what you're suggesting since it means that we'll basically
> never have a real cluster feature. I would sure hope we're missing something
> and there's a way to make this work usefully.

Another approach would be an online CLUSTER command. That means there'll
be a lot of churn when tuples need to be moved back and forth, along
with updating indexes, but it'd take the overhead out of the critical path.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Maintaining cluster order on insert

From
Greg Smith
Date:
On Mon, 9 Jul 2007, Tom Lane wrote:

> The hardware is just a Dell x86_64 workstation with crappy IDE disk, so
> maybe things would look better elsewhere, but it's all I have to work
> with.

Do you have write-caching turned off on the drive so INSERTs are being
rate-limited by WAL syncs?  Trying to characterize the class of setup
you're using.

The part that seemed curious to me about your results in the unpatched
version is why the first INSERT takes 2.4 seconds, while the second takes
12.2 then later ones settle from 17-23 s.  I could understand the 12-23
variation, but that the first one fires off in 2.4 seems kind of fishy;
why so fast?  Is there something that's just fitting in memory in that
case that's just missing in the patched version?

results-head:
...
executed DELETE in  14.770937 sec
executed VACUUM in  10.663301 sec
executed INSERT in   2.449248 sec (1st)
...
executed INSERT in  12.212027 sec (2nd)


results-patch:
...
executed DELETE in  18.062664 sec
executed VACUUM in  28.487570 sec
executed INSERT in  25.638022 sec (1st)
...
executed INSERT in  40.759404 sec (2nd)

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Maintaining cluster order on insert

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> It's pretty serious what you're suggesting since it means that we'll
> basically never have a real cluster feature.

It seems entirely likely that this is not the way to go about "real
clustering".

            regards, tom lane

Re: Maintaining cluster order on insert

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> On Mon, 9 Jul 2007, Tom Lane wrote:
>> The hardware is just a Dell x86_64 workstation with crappy IDE disk, so
>> maybe things would look better elsewhere, but it's all I have to work
>> with.

> Do you have write-caching turned off on the drive so INSERTs are being
> rate-limited by WAL syncs?  Trying to characterize the class of setup
> you're using.

It's just desktop-grade junk :-(.  Feel free to repeat the test on
something more serious.

> The part that seemed curious to me about your results in the unpatched
> version is why the first INSERT takes 2.4 seconds, while the second takes
> 12.2 then later ones settle from 17-23 s.  I could understand the 12-23
> variation, but that the first one fires off in 2.4 seems kind of fishy;

The test seemed to be mostly I/O bound according to vmstat.  I supposed
that it was fast up until the kernel's available space was full of dirty
pages, and then subsequent writes would be constrained by actual disk
write bandwidth.  Anyway the numbers seemed relatively consistent after
the setup and first test cycle, so I think the part I was actually
trying to draw conclusions from was probably real enough.

            regards, tom lane

Re: Maintaining cluster order on insert

From
Greg Smith
Date:
On Tue, 10 Jul 2007, Tom Lane wrote:

> It's just desktop-grade junk :-(.  Feel free to repeat the test on
> something more serious.

Right, but even such junk can be setup such that the disks honor commits,
just wanted to confirm you didn't go out of your way to do that--sounds
like you didn't.  My mini-lab at home has two PG test systems, one with a
real write cache, the other has an IDE drive with the cache turned off so
commits actually wait to hit the platter.  The painful slowness of writes
in that situation keeps me grounded as to how much data integrity actually
costs if you're not accelerating it, and the differences in benchmarks
between the two systems (which are otherwise a similar class of hardware)
can be informative.

> Anyway the numbers seemed relatively consistent after the setup and
> first test cycle, so I think the part I was actually trying to draw
> conclusions from was probably real enough.

Agreed, just figuring out the test ramp-up situation, and your explanation
for that quirk sounds reasonable.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Maintaining cluster order on insert

From
Bruce Momjian
Date:
This idea has been rejected to do poor performance results reported
later in the thread.

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> While thinking about index-organized-tables and similar ideas, it
> occurred to me that there's some low-hanging-fruit: maintaining cluster
> order on inserts by trying to place new heap tuples close to other
> similar tuples. That involves asking the index am where on the heap the
> new tuple should go, and trying to insert it there before using the FSM.
> Using the new fillfactor parameter makes it more likely that there's
> room on the page. We don't worry about the order within the page.
>
> The API I'm thinking of introduces a new optional index am function,
> amsuggestblock (suggestions for a better name are welcome). It gets the
> same parameters as aminsert, and returns the heap block number that
> would be optimal place to put the new tuple. It's be called from
> ExecInsert before inserting the heap tuple, and the suggestion is passed
> on to heap_insert and RelationGetBufferForTuple.
>
> I wrote a little patch to implement this for btree, attached.
>
> This could be optimized by changing the existing aminsert API, because
> as it is, an insert will have to descend the btree twice. Once in
> amsuggestblock and then in aminsert. amsuggestblock could keep the right
> index page pinned so aminsert could locate it quicker. But I wanted to
> keep this simple for now. Another improvement might be to allow
> amsuggestblock to return a list of suggestions, but that makes it more
> expensive to insert if there isn't room in the suggested pages, since
> heap_insert will have to try them all before giving up.
>
> Comments regarding the general idea or the patch? There should probably
> be a index option to turn the feature on and off. You'll want to turn it
> off when you first load a table, and turn it on after CLUSTER to keep it
> clustered.
>
> Since there's been discussion on keeping the TODO list more up-to-date,
> I hereby officially claim the "Automatically maintain clustering on a
> table" TODO item :). Feel free to bombard me with requests for status
> reports. And just to be clear, I'm not trying to sneak this into 8.2
> anymore, this is 8.3 stuff.
>
> I won't be implementing a background daemon described on the TODO item,
> since that would essentially be an online version of CLUSTER. Which sure
> would be nice, but that's a different story.
>
> - Heikki
>

[ text/x-patch is unsupported, treating like TEXT/PLAIN ]

> Index: doc/src/sgml/catalogs.sgml
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/catalogs.sgml,v
> retrieving revision 2.129
> diff -c -r2.129 catalogs.sgml
> *** doc/src/sgml/catalogs.sgml    31 Jul 2006 20:08:55 -0000    2.129
> --- doc/src/sgml/catalogs.sgml    8 Aug 2006 16:17:21 -0000
> ***************
> *** 499,504 ****
> --- 499,511 ----
>         <entry>Function to parse and validate reloptions for an index</entry>
>        </row>
>
> +      <row>
> +       <entry><structfield>amsuggestblock</structfield></entry>
> +       <entry><type>regproc</type></entry>
> +       <entry><literal><link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>.oid</literal></entry>
> +       <entry>Get the best place in the heap to put a new tuple</entry>
> +      </row>
> +
>       </tbody>
>      </tgroup>
>     </table>
> Index: doc/src/sgml/indexam.sgml
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/indexam.sgml,v
> retrieving revision 2.16
> diff -c -r2.16 indexam.sgml
> *** doc/src/sgml/indexam.sgml    31 Jul 2006 20:08:59 -0000    2.16
> --- doc/src/sgml/indexam.sgml    8 Aug 2006 17:15:25 -0000
> ***************
> *** 391,396 ****
> --- 391,414 ----
>      <function>amoptions</> to test validity of options settings.
>     </para>
>
> +   <para>
> + <programlisting>
> + BlockNumber
> + amsuggestblock (Relation indexRelation,
> +                 Datum *values,
> +                 bool *isnull,
> +                 Relation heapRelation);
> + </programlisting>
> +    Gets the optimal place in the heap for a new tuple. The parameters
> +    correspond the parameters for <literal>aminsert</literal>.
> +    This function is called on the clustered index before a new tuple
> +    is inserted to the heap, and it should choose the optimal insertion
> +    target page on the heap in such manner that the heap stays as close
> +    as possible to the index order.
> +    <literal>amsuggestblock</literal> can return InvalidBlockNumber if
> +    the index am doesn't have a suggestion.
> +   </para>
> +
>    </sect1>
>
>    <sect1 id="index-scanning">
> Index: src/backend/access/heap/heapam.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
> retrieving revision 1.218
> diff -c -r1.218 heapam.c
> *** src/backend/access/heap/heapam.c    31 Jul 2006 20:08:59 -0000    1.218
> --- src/backend/access/heap/heapam.c    8 Aug 2006 16:17:21 -0000
> ***************
> *** 1325,1330 ****
> --- 1325,1335 ----
>    * use_fsm is passed directly to RelationGetBufferForTuple, which see for
>    * more info.
>    *
> +  * suggested_blk can be set by the caller to hint heap_insert which
> +  * block would be the best place to put the new tuple in. heap_insert can
> +  * ignore the suggestion, if there's not enough room on that block.
> +  * InvalidBlockNumber means no preference.
> +  *
>    * The return value is the OID assigned to the tuple (either here or by the
>    * caller), or InvalidOid if no OID.  The header fields of *tup are updated
>    * to match the stored tuple; in particular tup->t_self receives the actual
> ***************
> *** 1333,1339 ****
>    */
>   Oid
>   heap_insert(Relation relation, HeapTuple tup, CommandId cid,
> !             bool use_wal, bool use_fsm)
>   {
>       TransactionId xid = GetCurrentTransactionId();
>       HeapTuple    heaptup;
> --- 1338,1344 ----
>    */
>   Oid
>   heap_insert(Relation relation, HeapTuple tup, CommandId cid,
> !             bool use_wal, bool use_fsm, BlockNumber suggested_blk)
>   {
>       TransactionId xid = GetCurrentTransactionId();
>       HeapTuple    heaptup;
> ***************
> *** 1386,1392 ****
>
>       /* Find buffer to insert this tuple into */
>       buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
> !                                        InvalidBuffer, use_fsm);
>
>       /* NO EREPORT(ERROR) from here till changes are logged */
>       START_CRIT_SECTION();
> --- 1391,1397 ----
>
>       /* Find buffer to insert this tuple into */
>       buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
> !                                        InvalidBuffer, use_fsm, suggested_blk);
>
>       /* NO EREPORT(ERROR) from here till changes are logged */
>       START_CRIT_SECTION();
> ***************
> *** 1494,1500 ****
>   Oid
>   simple_heap_insert(Relation relation, HeapTuple tup)
>   {
> !     return heap_insert(relation, tup, GetCurrentCommandId(), true, true);
>   }
>
>   /*
> --- 1499,1506 ----
>   Oid
>   simple_heap_insert(Relation relation, HeapTuple tup)
>   {
> !     return heap_insert(relation, tup, GetCurrentCommandId(), true,
> !                        true, InvalidBlockNumber);
>   }
>
>   /*
> ***************
> *** 2079,2085 ****
>           {
>               /* Assume there's no chance to put heaptup on same page. */
>               newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
> !                                                buffer, true);
>           }
>           else
>           {
> --- 2085,2092 ----
>           {
>               /* Assume there's no chance to put heaptup on same page. */
>               newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
> !                                                buffer, true,
> !                                                InvalidBlockNumber);
>           }
>           else
>           {
> ***************
> *** 2096,2102 ****
>                    */
>                   LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>                   newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
> !                                                    buffer, true);
>               }
>               else
>               {
> --- 2103,2110 ----
>                    */
>                   LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>                   newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
> !                                                    buffer, true,
> !                                                    InvalidBlockNumber);
>               }
>               else
>               {
> Index: src/backend/access/heap/hio.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/hio.c,v
> retrieving revision 1.63
> diff -c -r1.63 hio.c
> *** src/backend/access/heap/hio.c    3 Jul 2006 22:45:37 -0000    1.63
> --- src/backend/access/heap/hio.c    9 Aug 2006 18:03:01 -0000
> ***************
> *** 93,98 ****
> --- 93,100 ----
>    *    any committed data of other transactions.  (See heap_insert's comments
>    *    for additional constraints needed for safe usage of this behavior.)
>    *
> +  *    If the caller has a suggestion, it's passed in suggestedBlock.
> +  *
>    *    We always try to avoid filling existing pages further than the fillfactor.
>    *    This is OK since this routine is not consulted when updating a tuple and
>    *    keeping it on the same page, which is the scenario fillfactor is meant
> ***************
> *** 103,109 ****
>    */
>   Buffer
>   RelationGetBufferForTuple(Relation relation, Size len,
> !                           Buffer otherBuffer, bool use_fsm)
>   {
>       Buffer        buffer = InvalidBuffer;
>       Page        pageHeader;
> --- 105,112 ----
>    */
>   Buffer
>   RelationGetBufferForTuple(Relation relation, Size len,
> !                           Buffer otherBuffer, bool use_fsm,
> !                           BlockNumber suggestedBlock)
>   {
>       Buffer        buffer = InvalidBuffer;
>       Page        pageHeader;
> ***************
> *** 135,142 ****
>           otherBlock = InvalidBlockNumber;        /* just to keep compiler quiet */
>
>       /*
> !      * We first try to put the tuple on the same page we last inserted a tuple
> !      * on, as cached in the relcache entry.  If that doesn't work, we ask the
>        * shared Free Space Map to locate a suitable page.  Since the FSM's info
>        * might be out of date, we have to be prepared to loop around and retry
>        * multiple times.    (To insure this isn't an infinite loop, we must update
> --- 138,147 ----
>           otherBlock = InvalidBlockNumber;        /* just to keep compiler quiet */
>
>       /*
> !      * We first try to put the tuple on the page suggested by the caller, if
> !      * any. Then we try to put the tuple on the same page we last inserted a
> !      * tuple on, as cached in the relcache entry. If that doesn't work, we
> !      * ask the
>        * shared Free Space Map to locate a suitable page.  Since the FSM's info
>        * might be out of date, we have to be prepared to loop around and retry
>        * multiple times.    (To insure this isn't an infinite loop, we must update
> ***************
> *** 144,152 ****
>        * not to be suitable.)  If the FSM has no record of a page with enough
>        * free space, we give up and extend the relation.
>        *
> !      * When use_fsm is false, we either put the tuple onto the existing target
> !      * page or extend the relation.
>        */
>       if (len + saveFreeSpace <= MaxTupleSize)
>           targetBlock = relation->rd_targblock;
>       else
> --- 149,167 ----
>        * not to be suitable.)  If the FSM has no record of a page with enough
>        * free space, we give up and extend the relation.
>        *
> !      * When use_fsm is false, we skip the fsm lookup if neither the suggested
> !      * nor the cached last insertion page has enough room, and extend the
> !      * relation.
> !      *
> !      * The fillfactor is taken into account when calculating the free space
> !      * on the cached target block, and when using the FSM. The suggested page
> !      * is used whenever there's enough room in it, regardless of the fillfactor,
> !      * because that's exactly the purpose the space is reserved for in the
> !      * first place.
>        */
> +     if (suggestedBlock != InvalidBlockNumber)
> +         targetBlock = suggestedBlock;
> +     else
>       if (len + saveFreeSpace <= MaxTupleSize)
>           targetBlock = relation->rd_targblock;
>       else
> ***************
> *** 219,224 ****
> --- 234,244 ----
>            */
>           pageHeader = (Page) BufferGetPage(buffer);
>           pageFreeSpace = PageGetFreeSpace(pageHeader);
> +
> +         /* If we're trying the suggested block, don't care about fillfactor */
> +         if (targetBlock == suggestedBlock && len <= pageFreeSpace)
> +             return buffer;
> +
>           if (len + saveFreeSpace <= pageFreeSpace)
>           {
>               /* use this page as future insert target, too */
> ***************
> *** 241,246 ****
> --- 261,275 ----
>               ReleaseBuffer(buffer);
>           }
>
> +         /* If we just tried the suggested block, try the cached target
> +          * block next, before consulting the FSM. */
> +         if(suggestedBlock == targetBlock)
> +         {
> +             targetBlock = relation->rd_targblock;
> +             suggestedBlock = InvalidBlockNumber;
> +             continue;
> +         }
> +
>           /* Without FSM, always fall out of the loop and extend */
>           if (!use_fsm)
>               break;
> Index: src/backend/access/index/genam.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/index/genam.c,v
> retrieving revision 1.58
> diff -c -r1.58 genam.c
> *** src/backend/access/index/genam.c    31 Jul 2006 20:08:59 -0000    1.58
> --- src/backend/access/index/genam.c    8 Aug 2006 16:17:21 -0000
> ***************
> *** 259,261 ****
> --- 259,275 ----
>
>       pfree(sysscan);
>   }
> +
> + /*
> +  * This is a dummy implementation of amsuggestblock, to be used for index
> +  * access methods that don't or can't support it. It just returns
> +  * InvalidBlockNumber, which means "no preference".
> +  *
> +  * This is probably not a good best place for this function, but it doesn't
> +  * fit naturally anywhere else either.
> +  */
> + Datum
> + dummysuggestblock(PG_FUNCTION_ARGS)
> + {
> +     PG_RETURN_UINT32(InvalidBlockNumber);
> + }
> Index: src/backend/access/index/indexam.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/index/indexam.c,v
> retrieving revision 1.94
> diff -c -r1.94 indexam.c
> *** src/backend/access/index/indexam.c    31 Jul 2006 20:08:59 -0000    1.94
> --- src/backend/access/index/indexam.c    8 Aug 2006 16:17:21 -0000
> ***************
> *** 18,23 ****
> --- 18,24 ----
>    *        index_rescan    - restart a scan of an index
>    *        index_endscan    - end a scan
>    *        index_insert    - insert an index tuple into a relation
> +  *        index_suggestblock    - get desired insert location for a heap tuple
>    *        index_markpos    - mark a scan position
>    *        index_restrpos    - restore a scan position
>    *        index_getnext    - get the next tuple from a scan
> ***************
> *** 202,207 ****
> --- 203,237 ----
>                                         BoolGetDatum(check_uniqueness)));
>   }
>
> + /* ----------------
> +  *        index_suggestblock - get desired insert location for a heap tuple
> +  *
> +  * The returned BlockNumber is the *heap* page that is the best place
> +  * to insert the given tuple to, according to the index am. The best
> +  * place is usually one that maintains the cluster order.
> +  * ----------------
> +  */
> + BlockNumber
> + index_suggestblock(Relation indexRelation,
> +                    Datum *values,
> +                    bool *isnull,
> +                    Relation heapRelation)
> + {
> +     FmgrInfo   *procedure;
> +
> +     RELATION_CHECKS;
> +     GET_REL_PROCEDURE(amsuggestblock);
> +
> +     /*
> +      * have the am's suggestblock proc do all the work.
> +      */
> +     return DatumGetUInt32(FunctionCall4(procedure,
> +                                       PointerGetDatum(indexRelation),
> +                                       PointerGetDatum(values),
> +                                       PointerGetDatum(isnull),
> +                                       PointerGetDatum(heapRelation)));
> + }
> +
>   /*
>    * index_beginscan - start a scan of an index with amgettuple
>    *
> Index: src/backend/access/nbtree/nbtinsert.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtinsert.c,v
> retrieving revision 1.142
> diff -c -r1.142 nbtinsert.c
> *** src/backend/access/nbtree/nbtinsert.c    25 Jul 2006 19:13:00 -0000    1.142
> --- src/backend/access/nbtree/nbtinsert.c    9 Aug 2006 17:51:33 -0000
> ***************
> *** 146,151 ****
> --- 146,221 ----
>   }
>
>   /*
> +  *    _bt_suggestblock() -- Find the heap block of the closest index tuple.
> +  *
> +  * The logic to find the target should match _bt_doinsert, otherwise
> +  * we'll be making bad suggestions.
> +  */
> + BlockNumber
> + _bt_suggestblock(Relation rel, IndexTuple itup, Relation heapRel)
> + {
> +     int            natts = rel->rd_rel->relnatts;
> +     OffsetNumber offset;
> +     Page        page;
> +     BTPageOpaque opaque;
> +
> +     ScanKey        itup_scankey;
> +     BTStack        stack;
> +     Buffer        buf;
> +     IndexTuple    curitup;
> +     BlockNumber suggestion = InvalidBlockNumber;
> +
> +     /* we need an insertion scan key to do our search, so build one */
> +     itup_scankey = _bt_mkscankey(rel, itup);
> +
> +     /* find the first page containing this key */
> +     stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_READ);
> +     if(!BufferIsValid(buf))
> +     {
> +         /* The index was completely empty. No suggestion then. */
> +         return InvalidBlockNumber;
> +     }
> +     /* we don't need the stack, so free it right away */
> +     _bt_freestack(stack);
> +
> +     page = BufferGetPage(buf);
> +     opaque = (BTPageOpaque) PageGetSpecialPointer(page);
> +
> +     /* Find the location in the page where the new index tuple would go to. */
> +
> +     offset = _bt_binsrch(rel, buf, natts, itup_scankey, false);
> +     if (offset > PageGetMaxOffsetNumber(page))
> +     {
> +         /* _bt_binsrch returned pointer to end-of-page. It means that
> +          * there was no equal items on the page, and the new item should
> +          * be inserted as the last tuple of the page. There could be equal
> +          * items on the next page, however.
> +          *
> +          * At the moment, we just ignore the potential equal items on the
> +          * right, and pretend there isn't any. We could instead walk right
> +          * to the next page to check that, but let's keep it simple for now.
> +          */
> +         offset = OffsetNumberPrev(offset);
> +     }
> +     if(offset < P_FIRSTDATAKEY(opaque))
> +     {
> +         /* We landed on an empty page. We could step left or right until
> +          * we find some items, but let's keep it simple for now.
> +          */
> +     } else {
> +         /* We're now positioned at the index tuple that we're interested in. */
> +
> +         curitup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
> +         suggestion = ItemPointerGetBlockNumber(&curitup->t_tid);
> +     }
> +
> +     _bt_relbuf(rel, buf);
> +     _bt_freeskey(itup_scankey);
> +
> +     return suggestion;
> + }
> +
> + /*
>    *    _bt_check_unique() -- Check for violation of unique index constraint
>    *
>    * Returns InvalidTransactionId if there is no conflict, else an xact ID
> Index: src/backend/access/nbtree/nbtree.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/nbtree/nbtree.c,v
> retrieving revision 1.149
> diff -c -r1.149 nbtree.c
> *** src/backend/access/nbtree/nbtree.c    10 May 2006 23:18:39 -0000    1.149
> --- src/backend/access/nbtree/nbtree.c    9 Aug 2006 18:04:02 -0000
> ***************
> *** 228,233 ****
> --- 228,265 ----
>   }
>
>   /*
> +  *    btsuggestblock() -- find the best place in the heap to put a new tuple.
> +  *
> +  *        This uses the same logic as btinsert to find the place where the index
> +  *        tuple would go if this was a btinsert call.
> +  *
> +  *        There's room for improvement here. An insert operation will descend
> +  *        the tree twice, first by btsuggestblock, then by btinsert. Things
> +  *        might have changed in between, so that the heap tuple is actually
> +  *        not inserted in the optimal page, but since this is just an
> +  *        optimization, it's ok if it happens    sometimes.
> +  */
> + Datum
> + btsuggestblock(PG_FUNCTION_ARGS)
> + {
> +     Relation    rel = (Relation) PG_GETARG_POINTER(0);
> +     Datum       *values = (Datum *) PG_GETARG_POINTER(1);
> +     bool       *isnull = (bool *) PG_GETARG_POINTER(2);
> +     Relation    heapRel = (Relation) PG_GETARG_POINTER(3);
> +     IndexTuple    itup;
> +     BlockNumber suggestion;
> +
> +     /* generate an index tuple */
> +     itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
> +
> +     suggestion =_bt_suggestblock(rel, itup, heapRel);
> +
> +     pfree(itup);
> +
> +     PG_RETURN_UINT32(suggestion);
> + }
> +
> + /*
>    *    btgettuple() -- Get the next tuple in the scan.
>    */
>   Datum
> Index: src/backend/executor/execMain.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execMain.c,v
> retrieving revision 1.277
> diff -c -r1.277 execMain.c
> *** src/backend/executor/execMain.c    31 Jul 2006 01:16:37 -0000    1.277
> --- src/backend/executor/execMain.c    8 Aug 2006 16:17:21 -0000
> ***************
> *** 892,897 ****
> --- 892,898 ----
>       resultRelInfo->ri_RangeTableIndex = resultRelationIndex;
>       resultRelInfo->ri_RelationDesc = resultRelationDesc;
>       resultRelInfo->ri_NumIndices = 0;
> +     resultRelInfo->ri_ClusterIndex = -1;
>       resultRelInfo->ri_IndexRelationDescs = NULL;
>       resultRelInfo->ri_IndexRelationInfo = NULL;
>       /* make a copy so as not to depend on relcache info not changing... */
> ***************
> *** 1388,1394 ****
>           heap_insert(estate->es_into_relation_descriptor, tuple,
>                       estate->es_snapshot->curcid,
>                       estate->es_into_relation_use_wal,
> !                     false);        /* never any point in using FSM */
>           /* we know there are no indexes to update */
>           heap_freetuple(tuple);
>           IncrAppended();
> --- 1389,1396 ----
>           heap_insert(estate->es_into_relation_descriptor, tuple,
>                       estate->es_snapshot->curcid,
>                       estate->es_into_relation_use_wal,
> !                     false, /* never any point in using FSM */
> !                     InvalidBlockNumber);
>           /* we know there are no indexes to update */
>           heap_freetuple(tuple);
>           IncrAppended();
> ***************
> *** 1419,1424 ****
> --- 1421,1427 ----
>       ResultRelInfo *resultRelInfo;
>       Relation    resultRelationDesc;
>       Oid            newId;
> +     BlockNumber suggestedBlock;
>
>       /*
>        * get the heap tuple out of the tuple table slot, making sure we have a
> ***************
> *** 1467,1472 ****
> --- 1470,1479 ----
>       if (resultRelationDesc->rd_att->constr)
>           ExecConstraints(resultRelInfo, slot, estate);
>
> +     /* Ask the index am of the clustered index for the
> +      * best place to put it */
> +     suggestedBlock = ExecSuggestBlock(slot, estate);
> +
>       /*
>        * insert the tuple
>        *
> ***************
> *** 1475,1481 ****
>        */
>       newId = heap_insert(resultRelationDesc, tuple,
>                           estate->es_snapshot->curcid,
> !                         true, true);
>
>       IncrAppended();
>       (estate->es_processed)++;
> --- 1482,1488 ----
>        */
>       newId = heap_insert(resultRelationDesc, tuple,
>                           estate->es_snapshot->curcid,
> !                         true, true, suggestedBlock);
>
>       IncrAppended();
>       (estate->es_processed)++;
> Index: src/backend/executor/execUtils.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/executor/execUtils.c,v
> retrieving revision 1.139
> diff -c -r1.139 execUtils.c
> *** src/backend/executor/execUtils.c    4 Aug 2006 21:33:36 -0000    1.139
> --- src/backend/executor/execUtils.c    9 Aug 2006 18:05:05 -0000
> ***************
> *** 31,36 ****
> --- 31,37 ----
>    *        ExecOpenIndices            \
>    *        ExecCloseIndices         | referenced by InitPlan, EndPlan,
>    *        ExecInsertIndexTuples    /  ExecInsert, ExecUpdate
> +  *        ExecSuggestBlock        Referenced by ExecInsert
>    *
>    *        RegisterExprContextCallback    Register function shutdown callback
>    *        UnregisterExprContextCallback  Deregister function shutdown callback
> ***************
> *** 874,879 ****
> --- 875,881 ----
>       IndexInfo **indexInfoArray;
>
>       resultRelInfo->ri_NumIndices = 0;
> +     resultRelInfo->ri_ClusterIndex = -1;
>
>       /* fast path if no indexes */
>       if (!RelationGetForm(resultRelation)->relhasindex)
> ***************
> *** 913,918 ****
> --- 915,925 ----
>           /* extract index key information from the index's pg_index info */
>           ii = BuildIndexInfo(indexDesc);
>
> +         /* Remember which index is the clustered one.
> +          * It's used to call the suggestblock-method on inserts */
> +         if(indexDesc->rd_index->indisclustered)
> +             resultRelInfo->ri_ClusterIndex = i;
> +
>           relationDescs[i] = indexDesc;
>           indexInfoArray[i] = ii;
>           i++;
> ***************
> *** 1062,1067 ****
> --- 1069,1137 ----
>       }
>   }
>
> + /* ----------------------------------------------------------------
> +  *        ExecSuggestBlock
> +  *
> +  *        This routine asks the index am where a new heap tuple
> +  *        should be placed.
> +  * ----------------------------------------------------------------
> +  */
> + BlockNumber
> + ExecSuggestBlock(TupleTableSlot *slot,
> +                  EState *estate)
> + {
> +     ResultRelInfo *resultRelInfo;
> +     int            i;
> +     Relation    relationDesc;
> +     Relation    heapRelation;
> +     ExprContext *econtext;
> +     Datum        values[INDEX_MAX_KEYS];
> +     bool        isnull[INDEX_MAX_KEYS];
> +     IndexInfo  *indexInfo;
> +
> +     /*
> +      * Get information from the result relation info structure.
> +      */
> +     resultRelInfo = estate->es_result_relation_info;
> +     i = resultRelInfo->ri_ClusterIndex;
> +     if(i == -1)
> +         return InvalidBlockNumber; /* there was no clustered index */
> +
> +     heapRelation = resultRelInfo->ri_RelationDesc;
> +     relationDesc = resultRelInfo->ri_IndexRelationDescs[i];
> +     indexInfo = resultRelInfo->ri_IndexRelationInfo[i];
> +
> +     /* You can't cluster on a partial index */
> +     Assert(indexInfo->ii_Predicate == NIL);
> +
> +     /*
> +      * We will use the EState's per-tuple context for evaluating
> +      * index expressions (creating it if it's not already there).
> +      */
> +     econtext = GetPerTupleExprContext(estate);
> +
> +     /* Arrange for econtext's scan tuple to be the tuple under test */
> +     econtext->ecxt_scantuple = slot;
> +
> +     /*
> +      * FormIndexDatum fills in its values and isnull parameters with the
> +      * appropriate values for the column(s) of the index.
> +      */
> +     FormIndexDatum(indexInfo,
> +                    slot,
> +                    estate,
> +                    values,
> +                    isnull);
> +
> +     /*
> +      * The index AM does the rest.
> +      */
> +     return index_suggestblock(relationDesc,    /* index relation */
> +                  values,    /* array of index Datums */
> +                  isnull,    /* null flags */
> +                  heapRelation);
> + }
> +
>   /*
>    * UpdateChangedParamSet
>    *        Add changed parameters to a plan node's chgParam set
> Index: src/include/access/genam.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/genam.h,v
> retrieving revision 1.65
> diff -c -r1.65 genam.h
> *** src/include/access/genam.h    31 Jul 2006 20:09:05 -0000    1.65
> --- src/include/access/genam.h    9 Aug 2006 17:53:44 -0000
> ***************
> *** 93,98 ****
> --- 93,101 ----
>                ItemPointer heap_t_ctid,
>                Relation heapRelation,
>                bool check_uniqueness);
> + extern BlockNumber index_suggestblock(Relation indexRelation,
> +              Datum *values, bool *isnull,
> +              Relation heapRelation);
>
>   extern IndexScanDesc index_beginscan(Relation heapRelation,
>                   Relation indexRelation,
> ***************
> *** 123,128 ****
> --- 126,133 ----
>   extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
>                     uint16 procnum);
>
> + extern Datum dummysuggestblock(PG_FUNCTION_ARGS);
> +
>   /*
>    * index access method support routines (in genam.c)
>    */
> Index: src/include/access/heapam.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/heapam.h,v
> retrieving revision 1.114
> diff -c -r1.114 heapam.h
> *** src/include/access/heapam.h    3 Jul 2006 22:45:39 -0000    1.114
> --- src/include/access/heapam.h    8 Aug 2006 16:17:21 -0000
> ***************
> *** 156,162 ****
>   extern void setLastTid(const ItemPointer tid);
>
>   extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
> !             bool use_wal, bool use_fsm);
>   extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
>               ItemPointer ctid, TransactionId *update_xmax,
>               CommandId cid, Snapshot crosscheck, bool wait);
> --- 156,162 ----
>   extern void setLastTid(const ItemPointer tid);
>
>   extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
> !             bool use_wal, bool use_fsm, BlockNumber suggestedblk);
>   extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
>               ItemPointer ctid, TransactionId *update_xmax,
>               CommandId cid, Snapshot crosscheck, bool wait);
> Index: src/include/access/hio.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/hio.h,v
> retrieving revision 1.32
> diff -c -r1.32 hio.h
> *** src/include/access/hio.h    13 Jul 2006 17:47:01 -0000    1.32
> --- src/include/access/hio.h    8 Aug 2006 16:17:21 -0000
> ***************
> *** 21,26 ****
>   extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
>                        HeapTuple tuple);
>   extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
> !                     Buffer otherBuffer, bool use_fsm);
>
>   #endif   /* HIO_H */
> --- 21,26 ----
>   extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
>                        HeapTuple tuple);
>   extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
> !                     Buffer otherBuffer, bool use_fsm, BlockNumber suggestedblk);
>
>   #endif   /* HIO_H */
> Index: src/include/access/nbtree.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/nbtree.h,v
> retrieving revision 1.103
> diff -c -r1.103 nbtree.h
> *** src/include/access/nbtree.h    7 Aug 2006 16:57:57 -0000    1.103
> --- src/include/access/nbtree.h    8 Aug 2006 16:17:21 -0000
> ***************
> *** 467,472 ****
> --- 467,473 ----
>   extern Datum btbulkdelete(PG_FUNCTION_ARGS);
>   extern Datum btvacuumcleanup(PG_FUNCTION_ARGS);
>   extern Datum btoptions(PG_FUNCTION_ARGS);
> + extern Datum btsuggestblock(PG_FUNCTION_ARGS);
>
>   /*
>    * prototypes for functions in nbtinsert.c
> ***************
> *** 476,481 ****
> --- 477,484 ----
>   extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
>   extern void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
>                     BTStack stack, bool is_root, bool is_only);
> + extern BlockNumber _bt_suggestblock(Relation rel, IndexTuple itup,
> +              Relation heapRel);
>
>   /*
>    * prototypes for functions in nbtpage.c
> Index: src/include/catalog/pg_am.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_am.h,v
> retrieving revision 1.46
> diff -c -r1.46 pg_am.h
> *** src/include/catalog/pg_am.h    31 Jul 2006 20:09:05 -0000    1.46
> --- src/include/catalog/pg_am.h    8 Aug 2006 16:17:21 -0000
> ***************
> *** 65,70 ****
> --- 65,71 ----
>       regproc        amvacuumcleanup;    /* post-VACUUM cleanup function */
>       regproc        amcostestimate; /* estimate cost of an indexscan */
>       regproc        amoptions;        /* parse AM-specific parameters */
> +     regproc        amsuggestblock;    /* suggest a block where to put heap tuple */
>   } FormData_pg_am;
>
>   /* ----------------
> ***************
> *** 78,84 ****
>    *        compiler constants for pg_am
>    * ----------------
>    */
> ! #define Natts_pg_am                        23
>   #define Anum_pg_am_amname                1
>   #define Anum_pg_am_amstrategies            2
>   #define Anum_pg_am_amsupport            3
> --- 79,85 ----
>    *        compiler constants for pg_am
>    * ----------------
>    */
> ! #define Natts_pg_am                        24
>   #define Anum_pg_am_amname                1
>   #define Anum_pg_am_amstrategies            2
>   #define Anum_pg_am_amsupport            3
> ***************
> *** 102,123 ****
>   #define Anum_pg_am_amvacuumcleanup        21
>   #define Anum_pg_am_amcostestimate        22
>   #define Anum_pg_am_amoptions            23
>
>   /* ----------------
>    *        initial contents of pg_am
>    * ----------------
>    */
>
> ! DATA(insert OID = 403 (  btree    5 1 1 t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan
btmarkposbtrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions )); 
>   DESCR("b-tree index access method");
>   #define BTREE_AM_OID 403
> ! DATA(insert OID = 405 (  hash    1 1 0 f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan
hashendscanhashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions )); 
>   DESCR("hash index access method");
>   #define HASH_AM_OID 405
> ! DATA(insert OID = 783 (  gist    100 7 0 f t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan
gistendscangistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions )); 
>   DESCR("GiST index access method");
>   #define GIST_AM_OID 783
> ! DATA(insert OID = 2742 (  gin    100 4 0 f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan
ginendscanginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions )); 
>   DESCR("GIN index access method");
>   #define GIN_AM_OID 2742
>
> --- 103,125 ----
>   #define Anum_pg_am_amvacuumcleanup        21
>   #define Anum_pg_am_amcostestimate        22
>   #define Anum_pg_am_amoptions            23
> + #define Anum_pg_am_amsuggestblock        24
>
>   /* ----------------
>    *        initial contents of pg_am
>    * ----------------
>    */
>
> ! DATA(insert OID = 403 (  btree    5 1 1 t t t t f t btinsert btbeginscan btgettuple btgetmulti btrescan btendscan
btmarkposbtrestrpos btbuild btbulkdelete btvacuumcleanup btcostestimate btoptions btsuggestblock)); 
>   DESCR("b-tree index access method");
>   #define BTREE_AM_OID 403
> ! DATA(insert OID = 405 (  hash    1 1 0 f f f f f f hashinsert hashbeginscan hashgettuple hashgetmulti hashrescan
hashendscanhashmarkpos hashrestrpos hashbuild hashbulkdelete hashvacuumcleanup hashcostestimate hashoptions
dummysuggestblock));
>   DESCR("hash index access method");
>   #define HASH_AM_OID 405
> ! DATA(insert OID = 783 (  gist    100 7 0 f t t t t t gistinsert gistbeginscan gistgettuple gistgetmulti gistrescan
gistendscangistmarkpos gistrestrpos gistbuild gistbulkdelete gistvacuumcleanup gistcostestimate gistoptions
dummysuggestblock));
>   DESCR("GiST index access method");
>   #define GIST_AM_OID 783
> ! DATA(insert OID = 2742 (  gin    100 4 0 f f f f t f gininsert ginbeginscan gingettuple gingetmulti ginrescan
ginendscanginmarkpos ginrestrpos ginbuild ginbulkdelete ginvacuumcleanup gincostestimate ginoptions dummysuggestblock
));
>   DESCR("GIN index access method");
>   #define GIN_AM_OID 2742
>
> Index: src/include/catalog/pg_proc.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/catalog/pg_proc.h,v
> retrieving revision 1.420
> diff -c -r1.420 pg_proc.h
> *** src/include/catalog/pg_proc.h    6 Aug 2006 03:53:44 -0000    1.420
> --- src/include/catalog/pg_proc.h    9 Aug 2006 18:06:44 -0000
> ***************
> *** 682,687 ****
> --- 682,689 ----
>   DESCR("btree(internal)");
>   DATA(insert OID = 2785 (  btoptions           PGNSP PGUID 12 f f t f s 2 17 "1009 16" _null_ _null_ _null_
btoptions- _null_ )); 
>   DESCR("btree(internal)");
> + DATA(insert OID = 2852 (  btsuggestblock   PGNSP PGUID 12 f f t f v 4 23 "2281 2281 2281 2281" _null_ _null_ _null_
  btsuggestblock - _null_ )); 
> + DESCR("btree(internal)");
>
>   DATA(insert OID = 339 (  poly_same           PGNSP PGUID 12 f f t f i 2 16 "604 604" _null_ _null_ _null_ poly_same
-_null_ )); 
>   DESCR("same as?");
> ***************
> *** 3936,3941 ****
> --- 3938,3946 ----
>   DATA(insert OID = 2749 (  arraycontained       PGNSP PGUID 12 f f t f i 2 16 "2277 2277" _null_ _null_ _null_
arraycontained- _null_ )); 
>   DESCR("anyarray contained");
>
> + DATA(insert OID = 2853 (  dummysuggestblock   PGNSP PGUID 12 f f t f v 4 23 "2281 2281 2281 2281" _null_ _null_
_null_   dummysuggestblock - _null_ )); 
> + DESCR("dummy amsuggestblock implementation (internal)");
> +
>   /*
>    * Symbolic values for provolatile column: these indicate whether the result
>    * of a function is dependent *only* on the values of its explicit arguments,
> Index: src/include/executor/executor.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/executor/executor.h,v
> retrieving revision 1.128
> diff -c -r1.128 executor.h
> *** src/include/executor/executor.h    4 Aug 2006 21:33:36 -0000    1.128
> --- src/include/executor/executor.h    8 Aug 2006 16:17:21 -0000
> ***************
> *** 271,276 ****
> --- 271,277 ----
>   extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
>   extern void ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
>                         EState *estate, bool is_vacuum);
> + extern BlockNumber ExecSuggestBlock(TupleTableSlot *slot, EState *estate);
>
>   extern void RegisterExprContextCallback(ExprContext *econtext,
>                               ExprContextCallbackFunction function,
> Index: src/include/nodes/execnodes.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/nodes/execnodes.h,v
> retrieving revision 1.158
> diff -c -r1.158 execnodes.h
> *** src/include/nodes/execnodes.h    4 Aug 2006 21:33:36 -0000    1.158
> --- src/include/nodes/execnodes.h    8 Aug 2006 16:17:21 -0000
> ***************
> *** 257,262 ****
> --- 257,264 ----
>    *        NumIndices                # of indices existing on result relation
>    *        IndexRelationDescs        array of relation descriptors for indices
>    *        IndexRelationInfo        array of key/attr info for indices
> +  *        ClusterIndex            index to the IndexRelationInfo array of the
> +  *                                clustered index, or -1 if there's none
>    *        TrigDesc                triggers to be fired, if any
>    *        TrigFunctions            cached lookup info for trigger functions
>    *        TrigInstrument            optional runtime measurements for triggers
> ***************
> *** 272,277 ****
> --- 274,280 ----
>       int            ri_NumIndices;
>       RelationPtr ri_IndexRelationDescs;
>       IndexInfo **ri_IndexRelationInfo;
> +     int         ri_ClusterIndex;
>       TriggerDesc *ri_TrigDesc;
>       FmgrInfo   *ri_TrigFunctions;
>       struct Instrumentation *ri_TrigInstrument;
> Index: src/include/utils/rel.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/utils/rel.h,v
> retrieving revision 1.91
> diff -c -r1.91 rel.h
> *** src/include/utils/rel.h    3 Jul 2006 22:45:41 -0000    1.91
> --- src/include/utils/rel.h    8 Aug 2006 16:17:21 -0000
> ***************
> *** 116,121 ****
> --- 116,122 ----
>       FmgrInfo    amvacuumcleanup;
>       FmgrInfo    amcostestimate;
>       FmgrInfo    amoptions;
> +     FmgrInfo    amsuggestblock;
>   } RelationAmInfo;
>
>

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +