From 482386008d03013e525fd4024a1dc9f376eceb52 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 15:51:53 -0700
Subject: [PATCH v8 3/6] Pick nbtree split points discerningly.

Add infrastructure to determine where the earliest difference appears
among a pair of tuples enclosing a candidate split point.  Use this
within _bt_findsplitloc() to weigh how effective suffix truncation will
be.  This is primarily useful because it maximizes the effectiveness of
suffix truncation.  This should not noticeably affect the balance of
free space within each half of the split.

_bt_findsplitloc() is also taught to care about the case where there are
many duplicates, making it hard to find a distinguishing split point.
_bt_findsplitloc() may even conclude that it isn't possible to avoid
filling a page entirely with duplicates, in which case it packs pages
full of duplicates very tightly.

The number of cycles added is not very noticeable, which is important,
since _bt_findsplitloc() is run while an exclusive (leaf page) buffer
lock is held.  We avoid using authoritative insertion scankey
comparisons, unlike suffix truncation proper.

This patch is required to credibly assess anything about the performance
of the patch series.  Applying the patches up to and including this
patch in the series is sufficient to see much better space utilization
and space reuse with cases where many duplicates are inserted.  (Cases
resulting in searches for free space among many pages full of
duplicates, where the search inevitably "gets tired" on the master
branch [1]).

[1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com
---
 src/backend/access/nbtree/README      |  66 ++-
 src/backend/access/nbtree/nbtinsert.c | 638 +++++++++++++++++++++++---
 src/backend/access/nbtree/nbtutils.c  |  78 ++++
 src/include/access/nbtree.h           |   8 +-
 4 files changed, 719 insertions(+), 71 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 75cb1d1e22..6f7297b522 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -165,9 +165,9 @@ Lehman and Yao assume fixed-size keys, but we must deal with
 variable-size keys.  Therefore there is not a fixed maximum number of
 keys per page; we just stuff in as many as will fit.  When we split a
 page, we try to equalize the number of bytes, not items, assigned to
-each of the resulting pages.  Note we must include the incoming item in
-this calculation, otherwise it is possible to find that the incoming
-item doesn't fit on the split page where it needs to go!
+pages (though suffix truncation is also considered).  Note we must include
+the incoming item in this calculation, otherwise it is possible to find
+that the incoming item doesn't fit on the split page where it needs to go!
 
 The Deletion Algorithm
 ----------------------
@@ -669,6 +669,66 @@ variable-length types, such as text.  An opclass support function could
 manufacture the shortest possible key value that still correctly separates
 each half of a leaf page split.
 
+There is sophisticated criteria for choosing a leaf page split point.  The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split.  The
+choice of leaf split point can be thought of as a choice among points
+"between" items on the page to be split, at least if you pretend that the
+incoming tuple was placed on the page already, without provoking a split.
+The split point between two index tuples with differences that appear as
+early as possible allows us to truncate away as many attributes as
+possible.
+
+Obviously suffix truncation is valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.  There are cases where suffix truncation can
+leave a B-Tree significantly smaller in size than it would have otherwise
+been, without actually making any pivot tuple smaller due to restrictions
+relating to alignment.  The criteria for choosing a leaf page split point
+for suffix truncation is often also predictive of future space utilization.
+Furthermore, even truncation that doesn't make pivot tuples smaller still
+prevents pivot tuples from being more restrictive than truly necessary in
+how they describe which values belong on which leaf pages.
+
+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page.  The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the optimal fillfactor-wise split point is chosen.  This idea also comes
+from the Prefix B-Tree paper.  This process has much in common with to what
+happens at the leaf level to make suffix truncation effective.  The overall
+effect is that suffix truncation tends to produce smaller and less
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, less discriminating
+pivot tuples end up in the root page, delaying root page splits.
+
+With v4 B-Trees, every tuple at the leaf level must be individually
+locatable by an insertion scankey that's fully filled-out by
+_bt_mkscankey().  Heap TID is treated as a tie-breaker key attribute to
+make this work.  Suffix truncation must occasionally make a pivot tuple
+*larger* than the leaf tuple that it's based on, since a heap TID must be
+appended when nothing else distinguishes each side of a leaf split.  This
+is not represented in the same way as it is at the leaf level (we must
+append an additional attribute), since pivot tuples already use the generic
+IndexTuple fields to describe which child page they point to, and how many
+attributes are in the pivot tuple.  Adding a heap TID attribute during a
+leaf page split should only occur when there is an entire page full of
+duplicates, though, since the logic for selecting a split point will do all
+it can to avoid this outcome --- it may apply "many duplicates" mode, or
+"single value" mode.
+
+Avoiding appending a heap TID to a pivot tuple is about much more than just
+saving a single MAXALIGN() quantum in each of the pages that store the new
+pivot.  It's worth going out of our way to avoid having a single value (or
+composition of key values) span two leaf pages when that isn't truly
+necessary, since if that's allowed to happen every point index scan will
+have to visit both pages.  It also makes it less likely that VACUUM will be
+able to perform page deletion on either page.  Finally, it's not unheard of
+for unique indexes to have pages full of duplicates in the event of extreme
+contention (which appears as buffer lock contention) --- this is also
+ameliorated.  These are all examples of how "false sharing" across B-Tree
+pages can cause performance problems.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 318cbd3551..0e37b8b23a 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,25 +28,44 @@
 
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
+/* _bt_findsplitloc limits on suffix truncation split interval */
+#define MAX_LEAF_SPLIT_POINTS		9
+#define MAX_INTERNAL_SPLIT_POINTS	3
+
+typedef enum
+{
+	/* strategy to use for a call to FindSplitData */
+	SPLIT_DEFAULT,				/* give some weight to truncation */
+	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
+	SPLIT_SINGLE_VALUE			/* leave left page almost empty */
+} SplitMode;
+
+typedef struct
+{
+	/* FindSplitData candidate split */
+	int			delta;			/* size delta */
+	bool		newitemonleft;	/* new item on left or right of split */
+	OffsetNumber firstright;	/* split point */
+} SplitPoint;
 
 typedef struct
 {
 	/* context data for _bt_checksplitloc */
+	SplitMode	mode;			/* strategy for deciding split point */
 	Size		newitemsz;		/* size of new item to be inserted */
-	int			fillfactor;		/* needed when splitting rightmost page */
+	double		fillfactor;		/* needed for weighted splits */
+	int			goodenough;
 	bool		is_leaf;		/* T if splitting a leaf page */
-	bool		is_rightmost;	/* T if splitting a rightmost page */
+	bool		is_weighted;	/* T if weighted (e.g. rightmost) split */
 	OffsetNumber newitemoff;	/* where the new item is to be inserted */
+	bool		hikeyheaptid;	/* T if high key will likely get heap TID */
 	int			leftspace;		/* space available for items on left page */
 	int			rightspace;		/* space available for items on right page */
 	int			olddataitemstotal;	/* space taken by old items */
 
-	bool		have_split;		/* found a valid split? */
-
-	/* these fields valid only if have_split is true */
-	bool		newitemonleft;	/* new item on left or right of best split */
-	OffsetNumber firstright;	/* best split point */
-	int			best_delta;		/* best size delta so far */
+	int			maxsplits;		/* Maximum number of splits */
+	int			nsplits;		/* Current number of splits */
+	SplitPoint *splits;			/* Sorted by delta */
 } FindSplitData;
 
 
@@ -76,12 +95,21 @@ static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft);
-static void _bt_checksplitloc(FindSplitData *state,
+				 SplitMode mode, OffsetNumber newitemoff,
+				 Size newitemsz, IndexTuple newitem, bool *newitemonleft);
+static int _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright, bool newitemonleft,
 				  int dataitemstoleft, Size firstoldonrightsz);
+static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem, bool *newitemonleft);
+static int  _bt_perfect_penalty(Relation rel, Page page, FindSplitData *state,
+				 OffsetNumber newitemoff, IndexTuple newitem,
+				 SplitMode *secondmode);
+static int _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
@@ -990,8 +1018,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* Choose the split point */
-		firstright = _bt_findsplitloc(rel, page,
-									  newitemoff, itemsz,
+		firstright = _bt_findsplitloc(rel, page, SPLIT_DEFAULT,
+									  newitemoff, itemsz, itup,
 									  &newitemonleft);
 
 		/*
@@ -1687,6 +1715,30 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * for it, we might find ourselves with too little room on the page that
  * it needs to go into!)
  *
+ * We also give some weight to suffix truncation in deciding a split point
+ * on leaf pages.  We try to select a point where a distinguishing attribute
+ * appears earlier in the new high key for the left side of the split, in
+ * order to maximize the number of trailing attributes that can be truncated
+ * away.  Initially, only candidate split points that imply an acceptable
+ * balance of free space on each side are considered.  This is even useful
+ * with pages that only have a single (non-TID) attribute, since it's
+ * helpful to avoid appending an explicit heap TID attribute to the new
+ * pivot tuple (high key/downlink) when it cannot actually be truncated.
+ * Note that it is always assumed that caller goes on to perform truncation,
+ * even with pg_upgrade'd indexes where that isn't actually the case.  There
+ * is still a modest benefit to choosing a split location while weighing
+ * suffix truncation: the resulting (untruncated) pivot tuples are
+ * nevertheless more predictive of future space utilization.
+ *
+ * We do all we can to avoid having to append a heap TID in the new high
+ * key.  We may have to call ourselves recursively in many duplicates mode.
+ * This happens when a heap TID would otherwise be appended, but the page
+ * isn't completely full of logical duplicates (there may be a few as two
+ * distinct values).  Many duplicates mode has no hard requirements for
+ * space utilization, though it still keeps the use of space balanced as a
+ * non-binding secondary goal.  This significantly improves fan-out in
+ * practice, at least with most affected workloads.
+ *
  * If the page is the rightmost page on its level, we instead try to arrange
  * to leave the left split page fillfactor% full.  In this way, when we are
  * inserting successively increasing keys (consider sequences, timestamps,
@@ -1695,6 +1747,16 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * This is the same as nbtsort.c produces for a newly-created tree.  Note
  * that leaf and nonleaf pages use different fillfactors.
  *
+ * If called recursively in single value mode, we also try to arrange to
+ * leave the left split page fillfactor% full, though we arrange to use a
+ * fillfactor that's even more left-heavy than the fillfactor used for
+ * rightmost pages.  This greatly helps with space management in cases where
+ * tuples with the same attribute values span multiple pages.  Newly
+ * inserted duplicates will tend to have higher heap TID values, so we'll
+ * end up splitting to the right in the manner of ascending insertions of
+ * monotonically increasing values.  See nbtree/README for more information
+ * about suffix truncation, and how a split point is chosen.
+ *
  * We are passed the intended insert position of the new tuple, expressed as
  * the offsetnumber of the tuple it must go in front of.  (This could be
  * maxoff+1 if the tuple is to go at the end.)
@@ -1725,8 +1787,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 static OffsetNumber
 _bt_findsplitloc(Relation rel,
 				 Page page,
+				 SplitMode mode,
 				 OffsetNumber newitemoff,
 				 Size newitemsz,
+				 IndexTuple newitem,
 				 bool *newitemonleft)
 {
 	BTPageOpaque opaque;
@@ -1736,15 +1800,16 @@ _bt_findsplitloc(Relation rel,
 	FindSplitData state;
 	int			leftspace,
 				rightspace,
-				goodenough,
 				olddataitemstotal,
-				olddataitemstoleft;
+				olddataitemstoleft,
+				perfectpenalty;
 	bool		goodenoughfound;
+	SplitPoint	splits[MAX_LEAF_SPLIT_POINTS];
+	SplitMode	secondmode;
+	OffsetNumber finalfirstright;
 
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
-	newitemsz += sizeof(ItemIdData);
+	maxoff = PageGetMaxOffsetNumber(page);
 
 	/* Total free space available on a btree page, after fixed overhead */
 	leftspace = rightspace =
@@ -1762,18 +1827,60 @@ _bt_findsplitloc(Relation rel,
 	/* Count up total space in data items without actually scanning 'em */
 	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
 
-	state.newitemsz = newitemsz;
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	state.mode = mode;
+	state.newitemsz = newitemsz + sizeof(ItemIdData);
+	state.hikeyheaptid = (mode == SPLIT_SINGLE_VALUE);
 	state.is_leaf = P_ISLEAF(opaque);
-	state.is_rightmost = P_RIGHTMOST(opaque);
-	state.have_split = false;
+	state.is_weighted = P_RIGHTMOST(opaque);
 	if (state.is_leaf)
-		state.fillfactor = RelationGetFillFactor(rel,
-												 BTREE_DEFAULT_FILLFACTOR);
+	{
+		if (state.mode != SPLIT_SINGLE_VALUE)
+		{
+			/* Only used on rightmost page */
+			state.fillfactor = RelationGetFillFactor(rel,
+													 BTREE_DEFAULT_FILLFACTOR) / 100.0;
+		}
+		else
+		{
+			state.fillfactor = BTREE_SINGLEVAL_FILLFACTOR / 100.0;
+			state.is_weighted = true;
+		}
+	}
 	else
-		state.fillfactor = BTREE_NONLEAF_FILLFACTOR;
-	state.newitemonleft = false;	/* these just to keep compiler quiet */
-	state.firstright = 0;
-	state.best_delta = 0;
+	{
+		Assert(state.mode == SPLIT_DEFAULT);
+		/* Only used on rightmost page */
+		state.fillfactor = BTREE_NONLEAF_FILLFACTOR / 100.0;
+	}
+
+	/*
+	 * Set limits on the split interval/number of candidate split points as
+	 * appropriate.  The "Prefix B-Trees" paper refers to this as sigma l for
+	 * leaf splits and sigma b for internal ("branch") splits.  It's hard to
+	 * provide a theoretical justification for the size of the split interval,
+	 * though it's clear that a small split interval improves space
+	 * utilization.
+	 *
+	 * (Also set interval for case when we split a page that has many
+	 * duplicates, or split a page that's entirely full of tuples of a single
+	 * value.  Future locality of access is prioritized over short-term space
+	 * utilization in these cases.)
+	 */
+	if (!state.is_leaf)
+		state.maxsplits = MAX_INTERNAL_SPLIT_POINTS;
+	else if (state.mode == SPLIT_DEFAULT)
+		state.maxsplits = Min(Max(3, maxoff * 0.05), MAX_LEAF_SPLIT_POINTS);
+	else if (state.mode == SPLIT_MANY_DUPLICATES)
+		state.maxsplits = maxoff + 2;
+	else
+		state.maxsplits = 1;
+	state.nsplits = 0;
+	if (state.mode != SPLIT_MANY_DUPLICATES)
+		state.splits = splits;
+	else
+		state.splits = palloc(sizeof(SplitPoint) * state.maxsplits);
+
 	state.leftspace = leftspace;
 	state.rightspace = rightspace;
 	state.olddataitemstotal = olddataitemstotal;
@@ -1782,13 +1889,15 @@ _bt_findsplitloc(Relation rel,
 	/*
 	 * Finding the best possible split would require checking all the possible
 	 * split points, because of the high-key and left-key special cases.
-	 * That's probably more work than it's worth; instead, stop as soon as we
-	 * find a "good-enough" split, where good-enough is defined as an
-	 * imbalance in free space of no more than pagesize/16 (arbitrary...) This
-	 * should let us stop near the middle on most pages, instead of plowing to
-	 * the end.
+	 * That's probably more work than it's worth outside of many duplicates
+	 * mode; instead, stop as soon as we find sufficiently-many "good-enough"
+	 * splits, where good-enough is defined as an imbalance in free space of
+	 * no more than pagesize/16 (arbitrary...) This should let us stop near
+	 * the middle on most pages, instead of plowing to the end.  Many
+	 * duplicates mode must consider all possible choices, and so does not use
+	 * this threshold for anything.
 	 */
-	goodenough = leftspace / 16;
+	state.goodenough = leftspace / 16;
 
 	/*
 	 * Scan through the data items and calculate space usage for a split at
@@ -1796,13 +1905,13 @@ _bt_findsplitloc(Relation rel,
 	 */
 	olddataitemstoleft = 0;
 	goodenoughfound = false;
-	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (offnum = P_FIRSTDATAKEY(opaque);
 		 offnum <= maxoff;
 		 offnum = OffsetNumberNext(offnum))
 	{
 		Size		itemsz;
+		int			delta;
 
 		itemid = PageGetItemId(page, offnum);
 		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
@@ -1811,28 +1920,35 @@ _bt_findsplitloc(Relation rel,
 		 * Will the new item go to left or right of split?
 		 */
 		if (offnum > newitemoff)
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, true,
+									  olddataitemstoleft, itemsz);
 
 		else if (offnum < newitemoff)
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
 		else
 		{
 			/* need to try it both ways! */
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
+			(void) _bt_checksplitloc(&state, offnum, true,
+									 olddataitemstoleft, itemsz);
 
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
 		}
 
-		/* Abort scan once we find a good-enough choice */
-		if (state.have_split && state.best_delta <= goodenough)
-		{
+		/* Record when good-enough choice found */
+		if (state.nsplits > 0 && state.splits[0].delta <= state.goodenough)
 			goodenoughfound = true;
+
+		/*
+		 * Abort scan once we've found a good-enough choice, and reach the
+		 * point where we stop finding new good-enough choices.  Don't do this
+		 * in many duplicates mode, though, since that has to be completely
+		 * exhaustive.
+		 */
+		if (goodenoughfound && state.mode != SPLIT_MANY_DUPLICATES &&
+			delta > state.goodenough)
 			break;
-		}
 
 		olddataitemstoleft += itemsz;
 	}
@@ -1842,19 +1958,50 @@ _bt_findsplitloc(Relation rel,
 	 * the old items go to the left page and the new item goes to the right
 	 * page.
 	 */
-	if (newitemoff > maxoff && !goodenoughfound)
+	if (newitemoff > maxoff &&
+		(!goodenoughfound || state.mode == SPLIT_MANY_DUPLICATES))
 		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
 
 	/*
 	 * I believe it is not possible to fail to find a feasible split, but just
 	 * in case ...
 	 */
-	if (!state.have_split)
+	if (state.nsplits == 0)
 		elog(ERROR, "could not find a feasible split point for index \"%s\"",
 			 RelationGetRelationName(rel));
 
-	*newitemonleft = state.newitemonleft;
-	return state.firstright;
+	/*
+	 * Search among acceptable split points for the entry with the lowest
+	 * penalty.  See _bt_split_penalty() for the definition of penalty.  The
+	 * goal here is to increase fan-out, by choosing a split point which is
+	 * amenable to being made smaller by suffix truncation, or is already
+	 * small.
+	 *
+	 * First find lowest possible penalty among acceptable split points -- the
+	 * "perfect" penalty.  This will be passed to _bt_bestsplitloc() if it
+	 * determines that candidate split points are good enough to finish
+	 * default mode split.  Perfect penalty saves _bt_bestsplitloc()
+	 * additional work around calculating penalties.
+	 */
+	perfectpenalty = _bt_perfect_penalty(rel, page, &state, newitemoff,
+										 newitem, &secondmode);
+
+	/* Start second pass over page if _bt_perfect_penalty() told us to */
+	if (secondmode != SPLIT_DEFAULT)
+		return _bt_findsplitloc(rel, page, secondmode, newitemoff, newitemsz,
+								newitem, newitemonleft);
+
+	/*
+	 * Search among acceptable split points for the entry that has the lowest
+	 * penalty, and thus maximizes fan-out.  Sets *newitemonleft for us.
+	 */
+	finalfirstright = _bt_bestsplitloc(rel, page, &state, perfectpenalty,
+									   newitemoff, newitem, newitemonleft);
+	/* Be tidy */
+	if (state.splits != splits)
+		pfree(state.splits);
+
+	return finalfirstright;
 }
 
 /*
@@ -1869,8 +2016,11 @@ _bt_findsplitloc(Relation rel,
  *
  * olddataitemstoleft is the total size of all old items to the left of
  * firstoldonright.
+ *
+ * Returns delta between space that will be left free on left and right side
+ * of split.
  */
-static void
+static int
 _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright,
 				  bool newitemonleft,
@@ -1878,7 +2028,8 @@ _bt_checksplitloc(FindSplitData *state,
 				  Size firstoldonrightsz)
 {
 	int			leftfree,
-				rightfree;
+				rightfree,
+				leftleafheaptidsz;
 	Size		firstrightitemsz;
 	bool		newitemisfirstonright;
 
@@ -1898,15 +2049,38 @@ _bt_checksplitloc(FindSplitData *state,
 
 	/*
 	 * The first item on the right page becomes the high key of the left page;
-	 * therefore it counts against left space as well as right space. When
+	 * therefore it counts against left space as well as right space (we
+	 * cannot assume that suffix truncation will make it any smaller).  When
 	 * index has included attributes, then those attributes of left page high
 	 * key will be truncated leaving that page with slightly more free space.
 	 * However, that shouldn't affect our ability to find valid split
-	 * location, because anyway split location should exists even without high
-	 * key truncation.
+	 * location, since we err in the direction of being pessimistic about free
+	 * space on the left half.  Besides, even when suffix truncation of
+	 * non-TID attributes occurs, there often won't be an entire MAXALIGN()
+	 * quantum in pivot space savings.
 	 */
 	leftfree -= firstrightitemsz;
 
+	/*
+	 * Assume that suffix truncation cannot avoid adding a heap TID to the
+	 * left half's new high key when splitting at the leaf level.  Don't let
+	 * this impact the balance of free space in the common case where adding a
+	 * heap TID is considered very unlikely, though, since there is no reason
+	 * to accept a likely-suboptimal split.
+	 *
+	 * When adding a heap TID seems likely, then actually factor that in to
+	 * delta calculation, rather than just having it as a constraint on
+	 * whether or not a split is acceptable.
+	 */
+	leftleafheaptidsz = 0;
+	if (state->is_leaf)
+	{
+		if (!state->hikeyheaptid)
+			leftleafheaptidsz = sizeof(ItemPointerData);
+		else
+			leftfree -= (int) sizeof(ItemPointerData);
+	}
+
 	/* account for the new item */
 	if (newitemonleft)
 		leftfree -= (int) state->newitemsz;
@@ -1922,20 +2096,23 @@ _bt_checksplitloc(FindSplitData *state,
 			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
 
 	/*
-	 * If feasible split point, remember best delta.
+	 * If feasible split point with lower delta than that of most marginal
+	 * spit point so far, or we haven't run out of space for split points,
+	 * remember it.
 	 */
-	if (leftfree >= 0 && rightfree >= 0)
+	if (leftfree - leftleafheaptidsz >= 0 && rightfree >= 0)
 	{
 		int			delta;
 
-		if (state->is_rightmost)
+		if (state->is_weighted)
 		{
 			/*
-			 * If splitting a rightmost page, try to put (100-fillfactor)% of
-			 * free space on left page. See comments for _bt_findsplitloc.
+			 * If splitting a rightmost page, or in single value mode, try to
+			 * put (100-fillfactor)% of free space on left page. See comments
+			 * for _bt_findsplitloc.
 			 */
 			delta = (state->fillfactor * leftfree)
-				- ((100 - state->fillfactor) * rightfree);
+				- ((1.0 - state->fillfactor) * rightfree);
 		}
 		else
 		{
@@ -1945,14 +2122,341 @@ _bt_checksplitloc(FindSplitData *state,
 
 		if (delta < 0)
 			delta = -delta;
-		if (!state->have_split || delta < state->best_delta)
+		/*
+		 * Optimization: Don't recognize differences among marginal split
+		 * points that are unlikely to end up being used anyway.
+		 *
+		 * We cannot do this in many duplicates mode, because that hurts cases
+		 * where there are a small number of available distinguishing split
+		 * points, and consistently picking the least worst choice among them
+		 * matters. (e.g., a non-unique index whose leaf pages each contain a
+		 * small number of distinct values, with each value duplicated a
+		 * uniform number of times.)
+		 */
+		if (delta > state->goodenough && state->mode != SPLIT_MANY_DUPLICATES)
+			delta = state->goodenough + 1;
+		if (state->nsplits < state->maxsplits ||
+			delta < state->splits[state->nsplits - 1].delta)
 		{
-			state->have_split = true;
-			state->newitemonleft = newitemonleft;
-			state->firstright = firstoldonright;
-			state->best_delta = delta;
+			SplitPoint	newsplit;
+			int			j;
+
+			newsplit.delta = delta;
+			newsplit.newitemonleft = newitemonleft;
+			newsplit.firstright = firstoldonright;
+
+			/*
+			 * Make space at the end of the state array for new candidate
+			 * split point if we haven't already reached the maximum number of
+			 * split points.
+			 */
+			if (state->nsplits < state->maxsplits)
+				state->nsplits++;
+
+			/*
+			 * Replace the final item in the nsplits-wise array.  The final
+			 * item is either a garbage still-uninitialized entry, or the most
+			 * marginal real entry when we already have as many split points
+			 * as we're willing to consider.
+			 */
+			for (j = state->nsplits - 1;
+				 j > 0 && state->splits[j - 1].delta > newsplit.delta;
+				 j--)
+			{
+				state->splits[j] = state->splits[j - 1];
+			}
+			state->splits[j] = newsplit;
+		}
+
+		return delta;
+	}
+
+	return INT_MAX;
+}
+
+/*
+ * Subroutine to find the "best" split point among an array of acceptable
+ * candidate split points that split without there being an excessively high
+ * delta between the space left free on the left and right halves.  The "best"
+ * split point is the split point with the lowest penalty, which is an
+ * abstract idea whose definition varies depending on whether we're splitting
+ * at the leaf level, or an internal level.  See _bt_split_penalty() for the
+ * definition.
+ *
+ * "perfectpenalty" is assumed to be the lowest possible penalty among
+ * candidate split points.  This allows us to return early without wasting
+ * cycles on calculating the first differing attribute for all candidate
+ * splits when that clearly cannot improve our choice.  This optimization is
+ * important for several common cases, including insertion into a primary key
+ * index on an auto-incremented or monotonically increasing integer column.
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating if new item is on left of split
+ * point.
+ */
+static OffsetNumber
+_bt_bestsplitloc(Relation rel,
+				 Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	int			bestpenalty,
+				lowsplit;
+
+	/* No point calculating penalties in trivial cases */
+	if (perfectpenalty == INT_MAX || state->nsplits == 1)
+	{
+		*newitemonleft = state->splits[0].newitemonleft;
+		return state->splits[0].firstright;
+	}
+
+	bestpenalty = INT_MAX;
+	lowsplit = 0;
+	for (int i = 0; i < state->nsplits; i++)
+	{
+		int			penalty;
+
+		penalty = _bt_split_penalty(rel, page, newitemoff, newitem,
+									state->splits + i, state->is_leaf);
+
+		if (penalty <= perfectpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+			break;
+		}
+
+		if (penalty < bestpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
 		}
 	}
+
+	*newitemonleft = state->splits[lowsplit].newitemonleft;
+	return state->splits[lowsplit].firstright;
+}
+
+/*
+ * Subroutine to find the lowest possible penalty for any acceptable candidate
+ * split point.  This may be lower than any real penalty for any of the
+ * candidate split points, in which case the optimization is ineffective.
+ * Split penalties are generally discrete rather than continuous, so an
+ * actually-obtainable penalty is common.
+ *
+ * This is also a convenient point to decide to either finish splitting
+ * the page using the default strategy, or, alternatively, to do a second pass
+ * over page using a different strategy.  (This only happens with leaf pages.)
+ */
+static int
+_bt_perfect_penalty(Relation rel, Page page, FindSplitData *state,
+					OffsetNumber newitemoff, IndexTuple newitem,
+					SplitMode *secondmode)
+{
+	ItemId		itemid;
+	OffsetNumber center;
+	IndexTuple	leftmost,
+				rightmost;
+	int			perfectpenalty;
+
+	/* Assume that a second pass over page won't be required for now */
+	*secondmode = SPLIT_DEFAULT;
+
+	/*
+	 * There are a much smaller number of candidate split points when
+	 * splitting an internal page, so we can afford to be exhaustive.  Only
+	 * give up when pivot that will be inserted into parent is as small as
+	 * possible.
+	 */
+	if (!state->is_leaf)
+		return MAXALIGN(sizeof(IndexTupleData) + 1);
+
+	/*
+	 * During a many duplicates pass over page, we settle for a "perfect"
+	 * split point that merely avoids appending a heap TID in new pivot.
+	 * Appending a heap TID is harmful enough to fan-out that it's worth
+	 * avoiding at all costs, but it doesn't make sense to go to those lengths
+	 * to also be able to truncate an extra, earlier attribute.
+	 */
+	if (state->mode == SPLIT_MANY_DUPLICATES)
+		return IndexRelationGetNumberOfKeyAttributes(rel);
+	else if (state->mode == SPLIT_SINGLE_VALUE)
+		return INT_MAX;
+
+	/*
+	 * Complicated though common case -- leaf page default mode split.
+	 *
+	 * Iterate from the end of split array to the start, in search of the
+	 * firstright-wise leftmost and rightmost entries among acceptable split
+	 * points.  The split point with the lowest delta is at the start of the
+	 * array.  It is deemed to be the split point whose firstright offset is
+	 * at the center.  Split points with firstright offsets at both the left
+	 * and right extremes among acceptable split points will be found at the
+	 * end of caller's array.
+	 */
+	leftmost = NULL;
+	rightmost = NULL;
+	center = state->splits[0].firstright;
+
+	/*
+	 * Leaf split points can be thought of as points _between_ tuples on the
+	 * original unsplit page image, at least if you pretend that the incoming
+	 * tuple is already on the page to be split (imagine that the original
+	 * unsplit page actually had enough space to fit the incoming tuple).  The
+	 * rightmost tuple is the tuple that is immediately to the right of a
+	 * split point that is itself rightmost.  Likewise, the leftmost tuple is
+	 * the tuple to the left of the leftmost split point.  It's important that
+	 * many duplicates mode has every opportunity to avoid picking a split
+	 * point that requires that suffix truncation append a heap TID to new
+	 * pivot tuple.
+	 *
+	 * When there are very few candidates, no sensible comparison can be made
+	 * here, resulting in caller selecting lowest delta/the center split point
+	 * by default.  Typically, leftmost and rightmost tuples will be located
+	 * almost immediately.
+	 */
+	perfectpenalty = IndexRelationGetNumberOfKeyAttributes(rel);
+	for (int j = state->nsplits - 1; j > 1; j--)
+	{
+		SplitPoint *split = state->splits + j;
+
+		if (!leftmost && split->firstright <= center)
+		{
+			if (split->newitemonleft && newitemoff == split->firstright)
+				leftmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page,
+									   OffsetNumberPrev(split->firstright));
+				leftmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (!rightmost && split->firstright >= center)
+		{
+			if (!split->newitemonleft && newitemoff == split->firstright)
+				rightmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page, split->firstright);
+				rightmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (leftmost && rightmost)
+		{
+			Assert(leftmost != rightmost);
+			perfectpenalty = _bt_leave_natts_fast(rel, leftmost, rightmost);
+			break;
+		}
+	}
+
+	/*
+	 * Work out which type of second pass caller must perform when even a
+	 * "perfect" penalty fails to avoid appending a heap TID to new pivot
+	 * tuple.
+	 */
+	if (perfectpenalty > IndexRelationGetNumberOfKeyAttributes(rel))
+	{
+		BTPageOpaque opaque;
+		OffsetNumber maxoff;
+		int			outerpenalty;
+
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		if (P_FIRSTDATAKEY(opaque) == newitemoff)
+			leftmost = newitem;
+		else
+		{
+			itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+			leftmost = (IndexTuple) PageGetItem(page, itemid);
+		}
+
+		if (newitemoff > maxoff)
+			rightmost = newitem;
+		else
+		{
+			itemid = PageGetItemId(page, maxoff);
+			rightmost = (IndexTuple) PageGetItem(page, itemid);
+		}
+
+		Assert(leftmost != rightmost);
+		outerpenalty = _bt_leave_natts_fast(rel, leftmost, rightmost);
+
+		/*
+		 * If page has many duplicates but is not entirely full of duplicates,
+		 * a many duplicates mode pass will be performed.  If page is entirely
+		 * full of duplicates, a single value mode pass will be performed.
+		 *
+		 * Caller should avoid a single value mode pass when incoming tuple
+		 * doesn't sort highest among items on the page, though.  Instead, we
+		 * instruct caller to continue with original default mode split, since
+		 * an out-of-order new duplicate item predicts further inserts towards
+		 * the left/middle of the original page's keyspace.  Evenly sharing
+		 * space among each half of the split avoids pathological performance.
+		 */
+		if (outerpenalty > IndexRelationGetNumberOfKeyAttributes(rel))
+		{
+			if (maxoff < newitemoff)
+				*secondmode = SPLIT_SINGLE_VALUE;
+			else
+			{
+				perfectpenalty = INT_MAX;
+				*secondmode = SPLIT_DEFAULT;
+			}
+		}
+		else
+			*secondmode = SPLIT_MANY_DUPLICATES;
+	}
+
+	return perfectpenalty;
+}
+
+/*
+ * Subroutine to find penalty for caller's candidate split point.
+ *
+ * On leaf pages, penalty is the attribute number that distinguishes each side
+ * of a split.  It's the last attribute that needs to be included in new high
+ * key for left page.  It can be greater than the number of key attributes in
+ * cases where a heap TID needs to be appending during truncation.
+ *
+ * On internal pages, penalty is simply the size of the first item on the
+ * right half of the split (excluding ItemId overhead) which becomes the new
+ * high key for the left page.  Internal page splits always use default mode.
+ */
+static int
+_bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf)
+{
+	ItemId		itemid;
+	IndexTuple	lastleft;
+	IndexTuple	firstright;
+
+	if (split->newitemonleft && newitemoff == split->firstright)
+		lastleft = newitem;
+	else
+	{
+		itemid = PageGetItemId(page, OffsetNumberPrev(split->firstright));
+		lastleft = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	if (!split->newitemonleft && newitemoff == split->firstright)
+		firstright = newitem;
+	else
+	{
+		itemid = PageGetItemId(page, split->firstright);
+		firstright = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	if (!is_leaf)
+		return IndexTupleSize(firstright);
+
+	Assert(lastleft != firstright);
+	return _bt_leave_natts_fast(rel, lastleft, firstright);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 629066fcf9..449b5bc63b 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -22,6 +22,7 @@
 #include "access/relscan.h"
 #include "miscadmin.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -2323,6 +2324,83 @@ _bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	return leavenatts;
 }
 
+/*
+ * _bt_leave_natts_fast - fast, approximate variant of _bt_leave_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location.  A naive bitwise approach to datum comparisons is used to
+ * save cycles.  This is inherently approximate, but usually provides the same
+ * answer as the authoritative approach that _bt_leave_natts takes, since the
+ * vast majority of types in Postgres cannot be equal according to any
+ * available opclass unless they're bitwise equal.
+ *
+ * Testing has shown that an approach involving treating the tuple as a
+ * decomposed binary string would work almost as well as the approach taken
+ * here.  It would also be faster.  It might actually be necessary to go that
+ * way in the future, if suffix truncation is made sophisticated enough to
+ * truncate at a finer granularity (i.e. truncate within an attribute, rather
+ * than just truncating away whole attributes).  The current approach isn't
+ * markedly slower, since it works particularly well with the "perfect
+ * penalty" optimization (there are fewer, more expensive calls here).  It
+ * also works with INCLUDE indexes (indexes with non-key attributes) without
+ * any special effort.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+int
+_bt_leave_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			leavenatts;
+
+	/*
+	 * Using authoritative comparisons makes no difference in almost all
+	 * cases. However, there are a small number of shipped opclasses where
+	 * there might occasionally be an inconsistency between the answers given
+	 * by this function and _bt_leave_natts().  This includes numeric_ops,
+	 * since display scale might vary among logically equal datums.
+	 * Case-insensitive collations may also be interesting.
+	 *
+	 * This is assumed to be okay, since there is no risk that inequality will
+	 * look like equality.  Suffix truncation may be less effective than it
+	 * could be in these narrow cases, but it should be impossible for caller
+	 * to spuriously perform a second pass to find a split location, where
+	 * evenly splitting the page is given secondary importance.
+	 */
+#ifdef AUTHORITATIVE_COMPARE_TEST
+	return _bt_leave_natts(rel, lastleft, firstright);
+#endif
+
+	leavenatts = 1;
+	for (int attnum = 1; attnum <= keysz; attnum++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		att = TupleDescAttr(itupdesc, attnum - 1);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			break;
+
+		leavenatts++;
+	}
+
+	return leavenatts;
+}
+
 /*
  *  _bt_check_natts() -- Verify tuple has expected number of attributes.
  *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 30340e9c02..995fb8cc8d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -144,11 +144,15 @@ typedef struct BTMetaPageData
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
  * a rightmost page; when splitting non-rightmost pages we try to
- * divide the data equally.
+ * divide the data equally.  When splitting a page that's entirely
+ * filled with a single value (duplicates), the leaf-page
+ * fillfactor is overridden, and is applied regardless of whether
+ * the page is a rightmost page.
  */
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#define BTREE_SINGLEVAL_FILLFACTOR	99
 
 /*
  *	In general, the btree code tries to localize its knowledge about
@@ -706,6 +710,8 @@ extern bool btproperty(Oid index_oid, int attno,
 		   bool *res, bool *isnull);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 			 IndexTuple firstright, bool build);
+extern int _bt_leave_natts_fast(Relation rel, IndexTuple lastleft,
+					 IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap, Page page,
 					 IndexTuple newtup);
-- 
2.17.1