Thread: Removing "long int"-related limit on hash table sizes

Removing "long int"-related limit on hash table sizes

From
Tom Lane
Date:
Per the discussion at [1], users on Windows are seeing nasty performance
losses in v13/v14 (compared to prior releases) for hash aggregations that
required somewhat more than 2GB in the prior releases.  That's because
they spill to disk where they did not before.  The easy answer of "raise
hash_mem_multiplier" doesn't help, because on Windows the product of
work_mem and hash_mem_multiplier is clamped to 2GB, thanks to the ancient
decision to do a lot of memory-space-related calculations in "long int",
which is only 32 bits on Win64.

While I don't personally have the interest to fix that altogether,
it does seem like we've got a performance regression that we ought
to do something about immediately.  So I took a look at getting rid of
this restriction for calculations associated with hash_mem_multiplier,
and it doesn't seem to be too bad.  I propose the attached patch.
(This is against HEAD; there are minor conflicts in v13 and v14.)

A couple of notes:

* I did not change most of the comments referring to "hash_mem",
even though that's not really a thing anymore.  They seem readable
enough anyway, and I failed to think of a reasonably-short substitute.

* We should drop get_hash_mem() altogether in HEAD and maybe v14.
I figure we'd better leave it available in v13, though, in case
any outside code is using it.

Comments?

            regards, tom lane

[1]
https://www.postgresql.org/message-id/flat/MN2PR15MB25601E80A9B6D1BA6F592B1985E39%40MN2PR15MB2560.namprd15.prod.outlook.com

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 5fd0b26cbc..c11427a1f6 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -165,14 +165,16 @@ BuildTupleHashTableExt(PlanState *parent,
 {
     TupleHashTable hashtable;
     Size        entrysize = sizeof(TupleHashEntryData) + additionalsize;
-    int            hash_mem = get_hash_mem();
+    Size        hash_mem_limit;
     MemoryContext oldcontext;
     bool        allow_jit;

     Assert(nbuckets > 0);

     /* Limit initial table size request to not more than hash_mem */
-    nbuckets = Min(nbuckets, (long) ((hash_mem * 1024L) / entrysize));
+    hash_mem_limit = get_hash_memory_limit() / entrysize;
+    if (nbuckets > hash_mem_limit)
+        nbuckets = hash_mem_limit;

     oldcontext = MemoryContextSwitchTo(metacxt);

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 914b02ceee..39bea204d1 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1802,15 +1802,15 @@ hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
 {
     int            npartitions;
     Size        partition_mem;
-    int            hash_mem = get_hash_mem();
+    Size        hash_mem_limit = get_hash_memory_limit();

     /* if not expected to spill, use all of hash_mem */
-    if (input_groups * hashentrysize < hash_mem * 1024L)
+    if (input_groups * hashentrysize <= hash_mem_limit)
     {
         if (num_partitions != NULL)
             *num_partitions = 0;
-        *mem_limit = hash_mem * 1024L;
-        *ngroups_limit = *mem_limit / hashentrysize;
+        *mem_limit = hash_mem_limit;
+        *ngroups_limit = hash_mem_limit / hashentrysize;
         return;
     }

@@ -1835,10 +1835,10 @@ hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
      * minimum number of partitions, so we aren't going to dramatically exceed
      * work mem anyway.
      */
-    if (hash_mem * 1024L > 4 * partition_mem)
-        *mem_limit = hash_mem * 1024L - partition_mem;
+    if (hash_mem_limit > 4 * partition_mem)
+        *mem_limit = hash_mem_limit - partition_mem;
     else
-        *mem_limit = hash_mem * 1024L * 0.75;
+        *mem_limit = hash_mem_limit * 0.75;

     if (*mem_limit > hashentrysize)
         *ngroups_limit = *mem_limit / hashentrysize;
@@ -1992,32 +1992,36 @@ static int
 hash_choose_num_partitions(double input_groups, double hashentrysize,
                            int used_bits, int *log2_npartitions)
 {
-    Size        mem_wanted;
-    int            partition_limit;
+    Size        hash_mem_limit = get_hash_memory_limit();
+    double        partition_limit;
+    double        mem_wanted;
+    double        dpartitions;
     int            npartitions;
     int            partition_bits;
-    int            hash_mem = get_hash_mem();

     /*
      * Avoid creating so many partitions that the memory requirements of the
      * open partition files are greater than 1/4 of hash_mem.
      */
     partition_limit =
-        (hash_mem * 1024L * 0.25 - HASHAGG_READ_BUFFER_SIZE) /
+        (hash_mem_limit * 0.25 - HASHAGG_READ_BUFFER_SIZE) /
         HASHAGG_WRITE_BUFFER_SIZE;

     mem_wanted = HASHAGG_PARTITION_FACTOR * input_groups * hashentrysize;

     /* make enough partitions so that each one is likely to fit in memory */
-    npartitions = 1 + (mem_wanted / (hash_mem * 1024L));
+    dpartitions = 1 + (mem_wanted / hash_mem_limit);
+
+    if (dpartitions > partition_limit)
+        dpartitions = partition_limit;

-    if (npartitions > partition_limit)
-        npartitions = partition_limit;
+    if (dpartitions < HASHAGG_MIN_PARTITIONS)
+        dpartitions = HASHAGG_MIN_PARTITIONS;
+    if (dpartitions > HASHAGG_MAX_PARTITIONS)
+        dpartitions = HASHAGG_MAX_PARTITIONS;

-    if (npartitions < HASHAGG_MIN_PARTITIONS)
-        npartitions = HASHAGG_MIN_PARTITIONS;
-    if (npartitions > HASHAGG_MAX_PARTITIONS)
-        npartitions = HASHAGG_MAX_PARTITIONS;
+    /* HASHAGG_MAX_PARTITIONS limit makes this safe */
+    npartitions = (int) dpartitions;

     /* ceil(log2(npartitions)) */
     partition_bits = my_log2(npartitions);
@@ -2030,7 +2034,7 @@ hash_choose_num_partitions(double input_groups, double hashentrysize,
         *log2_npartitions = partition_bits;

     /* number of partitions will be a power of two */
-    npartitions = 1L << partition_bits;
+    npartitions = 1 << partition_bits;

     return npartitions;
 }
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index c5f2d1d22b..bc4e36c555 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -675,15 +675,12 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 {
     int            tupsize;
     double        inner_rel_bytes;
-    long        bucket_bytes;
-    long        hash_table_bytes;
-    long        skew_table_bytes;
-    long        max_pointers;
-    long        mppow2;
+    size_t        hash_table_bytes;
+    size_t        bucket_bytes;
+    size_t        max_pointers;
     int            nbatch = 1;
     int            nbuckets;
     double        dbuckets;
-    int            hash_mem = get_hash_mem();

     /* Force a plausible relation size if no info */
     if (ntuples <= 0.0)
@@ -700,9 +697,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
     inner_rel_bytes = ntuples * tupsize;

     /*
-     * Target in-memory hashtable size is hash_mem kilobytes.
+     * Compute in-memory hashtable size limit from GUCs.
      */
-    hash_table_bytes = hash_mem * 1024L;
+    hash_table_bytes = get_hash_memory_limit();

     /*
      * Parallel Hash tries to use the combined hash_mem of all workers to
@@ -710,7 +707,14 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
      * per worker and tries to process batches in parallel.
      */
     if (try_combined_hash_mem)
-        hash_table_bytes += hash_table_bytes * parallel_workers;
+    {
+        /* Careful, this could overflow size_t */
+        double        newlimit;
+
+        newlimit = (double) hash_table_bytes * (double) (parallel_workers + 1);
+        newlimit = Min(newlimit, (double) SIZE_MAX);
+        hash_table_bytes = (size_t) newlimit;
+    }

     *space_allowed = hash_table_bytes;

@@ -730,9 +734,12 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
      */
     if (useskew)
     {
-        skew_table_bytes = hash_table_bytes * SKEW_HASH_MEM_PERCENT / 100;
+        size_t        bytes_per_mcv;
+        size_t        skew_mcvs;

         /*----------
+         * Compute number of MCVs we could hold in hash_table_bytes
+         *
          * Divisor is:
          * size of a hash tuple +
          * worst-case size of skewBucket[] per MCV +
@@ -740,12 +747,26 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
          * size of skew bucket struct itself
          *----------
          */
-        *num_skew_mcvs = skew_table_bytes / (tupsize +
-                                             (8 * sizeof(HashSkewBucket *)) +
-                                             sizeof(int) +
-                                             SKEW_BUCKET_OVERHEAD);
-        if (*num_skew_mcvs > 0)
-            hash_table_bytes -= skew_table_bytes;
+        bytes_per_mcv = tupsize +
+            (8 * sizeof(HashSkewBucket *)) +
+            sizeof(int) +
+            SKEW_BUCKET_OVERHEAD;
+        skew_mcvs = hash_table_bytes / bytes_per_mcv;
+
+        /*
+         * Now scale by SKEW_HASH_MEM_PERCENT (we do it in this order so as
+         * not to worry about size_t overflow in the multiplication)
+         */
+        skew_mcvs = skew_mcvs * SKEW_HASH_MEM_PERCENT / 100;
+
+        /* Now clamp to integer range */
+        skew_mcvs = Min(skew_mcvs, INT_MAX);
+
+        *num_skew_mcvs = (int) skew_mcvs;
+
+        /* Reduce hash_table_bytes by the amount needed for the skew table */
+        if (skew_mcvs > 0)
+            hash_table_bytes -= skew_mcvs * bytes_per_mcv;
     }
     else
         *num_skew_mcvs = 0;
@@ -753,22 +774,20 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
     /*
      * Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
      * memory is filled, assuming a single batch; but limit the value so that
-     * the pointer arrays we'll try to allocate do not exceed hash_mem nor
-     * MaxAllocSize.
+     * the pointer arrays we'll try to allocate do not exceed hash_table_bytes
+     * nor MaxAllocSize.
      *
      * Note that both nbuckets and nbatch must be powers of 2 to make
      * ExecHashGetBucketAndBatch fast.
      */
-    max_pointers = *space_allowed / sizeof(HashJoinTuple);
+    max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
     max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
     /* If max_pointers isn't a power of 2, must round it down to one */
-    mppow2 = 1L << my_log2(max_pointers);
-    if (max_pointers != mppow2)
-        max_pointers = mppow2 / 2;
+    max_pointers = pg_prevpower2_size_t(max_pointers);

     /* Also ensure we avoid integer overflow in nbatch and nbuckets */
     /* (this step is redundant given the current value of MaxAllocSize) */
-    max_pointers = Min(max_pointers, INT_MAX / 2);
+    max_pointers = Min(max_pointers, INT_MAX / 2 + 1);

     dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
     dbuckets = Min(dbuckets, max_pointers);
@@ -776,7 +795,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
     /* don't let nbuckets be really small, though ... */
     nbuckets = Max(nbuckets, 1024);
     /* ... and force it to be a power of 2. */
-    nbuckets = 1 << my_log2(nbuckets);
+    nbuckets = pg_nextpower2_32(nbuckets);

     /*
      * If there's not enough space to store the projected number of tuples and
@@ -786,10 +805,10 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
     if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
     {
         /* We'll need multiple batches */
-        long        lbuckets;
+        size_t        sbuckets;
         double        dbatch;
         int            minbatch;
-        long        bucket_size;
+        size_t        bucket_size;

         /*
          * If Parallel Hash with combined hash_mem would still need multiple
@@ -813,10 +832,10 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
          * overhead for the hash code, pointer to the next tuple, etc.
          */
         bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple));
-        lbuckets = 1L << my_log2(hash_table_bytes / bucket_size);
-        lbuckets = Min(lbuckets, max_pointers);
-        nbuckets = (int) lbuckets;
-        nbuckets = 1 << my_log2(nbuckets);
+        sbuckets = pg_nextpower2_size_t(hash_table_bytes / bucket_size);
+        sbuckets = Min(sbuckets, max_pointers);
+        nbuckets = (int) sbuckets;
+        nbuckets = pg_nextpower2_32(nbuckets);
         bucket_bytes = nbuckets * sizeof(HashJoinTuple);

         /*
@@ -1097,14 +1116,12 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
                 /* Figure out how many batches to use. */
                 if (hashtable->nbatch == 1)
                 {
-                    int            hash_mem = get_hash_mem();
-
                     /*
                      * We are going from single-batch to multi-batch.  We need
                      * to switch from one large combined memory budget to the
                      * regular hash_mem budget.
                      */
-                    pstate->space_allowed = hash_mem * 1024L;
+                    pstate->space_allowed = get_hash_memory_limit();

                     /*
                      * The combined hash_mem of all participants wasn't
@@ -1113,7 +1130,7 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
                      * insufficient.  So try two batches per participant,
                      * rounded up to a power of two.
                      */
-                    new_nbatch = 1 << my_log2(pstate->nparticipants * 2);
+                    new_nbatch = pg_nextpower2_32(pstate->nparticipants * 2);
                 }
                 else
                 {
@@ -1152,7 +1169,7 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
                                    MaxAllocSize / sizeof(dsa_pointer_atomic));
                     new_nbuckets = (int) dbuckets;
                     new_nbuckets = Max(new_nbuckets, 1024);
-                    new_nbuckets = 1 << my_log2(new_nbuckets);
+                    new_nbuckets = pg_nextpower2_32(new_nbuckets);
                     dsa_free(hashtable->area, old_batch0->buckets);
                     hashtable->batches[0].shared->buckets =
                         dsa_allocate(hashtable->area,
@@ -3372,39 +3389,46 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
 }

 /*
- * Get a hash_mem value by multiplying the work_mem GUC's value by the
- * hash_mem_multiplier GUC's value.
- *
- * Returns a work_mem style KB value that hash-based nodes (including but not
- * limited to hash join) use in place of work_mem.  This is subject to the
- * same restrictions as work_mem itself.  (There is no such thing as the
- * hash_mem GUC, but it's convenient for our callers to pretend that there
- * is.)
+ * Calculate the limit on how much memory can be used by Hash and similar
+ * plan types.  This is work_mem times hash_mem_multiplier, and is
+ * expressed in bytes.
  *
- * Exported for use by the planner, as well as other hash-based executor
+ * Exported for use by the planner, as well as other hash-like executor
  * nodes.  This is a rather random place for this, but there is no better
  * place.
  */
+size_t
+get_hash_memory_limit(void)
+{
+    double        mem_limit;
+
+    /* Do initial calculation in double arithmetic */
+    mem_limit = (double) work_mem * hash_mem_multiplier * 1024.0;
+
+    /* Clamp in case it doesn't fit in size_t */
+    mem_limit = Min(mem_limit, (double) SIZE_MAX);
+
+    return (size_t) mem_limit;
+}
+
+/*
+ * Convert the hash memory limit to an integer number of kilobytes,
+ * that is something comparable to work_mem.  Like work_mem, we clamp
+ * the result to ensure that multiplying it by 1024 fits in a long int.
+ *
+ * This is deprecated since it may understate the actual memory limit.
+ * It is unused in core and will eventually be removed.
+ */
 int
 get_hash_mem(void)
 {
-    double        hash_mem;
-
-    Assert(hash_mem_multiplier >= 1.0);
+    size_t        mem_limit = get_hash_memory_limit();

-    hash_mem = (double) work_mem * hash_mem_multiplier;
+    /* Remove the kilobyte factor */
+    mem_limit /= 1024;

-    /*
-     * guc.c enforces a MAX_KILOBYTES limitation on work_mem in order to
-     * support the assumption that raw derived byte values can be stored in
-     * 'long' variables.  The returned hash_mem value must also meet this
-     * assumption.
-     *
-     * We clamp the final value rather than throw an error because it should
-     * be possible to set work_mem and hash_mem_multiplier independently.
-     */
-    if (hash_mem < MAX_KILOBYTES)
-        return (int) hash_mem;
+    /* Clamp to MAX_KILOBYTES, like work_mem */
+    mem_limit = Min(mem_limit, (size_t) MAX_KILOBYTES);

-    return MAX_KILOBYTES;
+    return (int) mem_limit;
 }
diff --git a/src/backend/executor/nodeMemoize.c b/src/backend/executor/nodeMemoize.c
index 2fde4ebce6..bec588b3a0 100644
--- a/src/backend/executor/nodeMemoize.c
+++ b/src/backend/executor/nodeMemoize.c
@@ -905,7 +905,7 @@ ExecInitMemoize(Memoize *node, EState *estate, int eflags)
     mstate->mem_used = 0;

     /* Limit the total memory consumed by the cache to this */
-    mstate->mem_limit = get_hash_mem() * 1024L;
+    mstate->mem_limit = get_hash_memory_limit();

     /* A memory context dedicated for the cache */
     mstate->tableContext = AllocSetContextCreate(CurrentMemoryContext,
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index b54cf34a8e..30c8595f76 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2438,7 +2438,7 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
     Cost        total_cost;

     /* available cache space */
-    hash_mem_bytes = get_hash_mem() * 1024L;
+    hash_mem_bytes = get_hash_memory_limit();

     /*
      * Set the number of bytes each cache entry should consume in the cache.
@@ -3860,7 +3860,6 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
     Cost        run_cost = workspace->run_cost;
     int            numbuckets = workspace->numbuckets;
     int            numbatches = workspace->numbatches;
-    int            hash_mem;
     Cost        cpu_per_tuple;
     QualCost    hash_qual_cost;
     QualCost    qp_qual_cost;
@@ -3986,10 +3985,8 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
      * that way, so it will be unable to drive the batch size below hash_mem
      * when this is true.)
      */
-    hash_mem = get_hash_mem();
     if (relation_byte_size(clamp_row_est(inner_path_rows * innermcvfreq),
-                           inner_path->pathtarget->width) >
-        (hash_mem * 1024L))
+                           inner_path->pathtarget->width) > get_hash_memory_limit())
         startup_cost += disable_cost;

     /*
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1868c4eff4..86816ffe19 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3668,7 +3668,7 @@ consider_groupingsets_paths(PlannerInfo *root,
                             double dNumGroups)
 {
     Query       *parse = root->parse;
-    int            hash_mem = get_hash_mem();
+    Size        hash_mem_limit = get_hash_memory_limit();

     /*
      * If we're not being offered sorted input, then only consider plans that
@@ -3734,7 +3734,7 @@ consider_groupingsets_paths(PlannerInfo *root,
          * with.  Override hash_mem in that case; otherwise, we'll rely on the
          * sorted-input case to generate usable mixed paths.
          */
-        if (hashsize > hash_mem * 1024L && gd->rollups)
+        if (hashsize > hash_mem_limit && gd->rollups)
             return;                /* nope, won't fit */

         /*
@@ -3853,7 +3853,7 @@ consider_groupingsets_paths(PlannerInfo *root,
     {
         List       *rollups = NIL;
         List       *hash_sets = list_copy(gd->unsortable_sets);
-        double        availspace = (hash_mem * 1024.0);
+        double        availspace = hash_mem_limit;
         ListCell   *lc;

         /*
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index b5a61f3933..c9f7a09d10 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -724,7 +724,6 @@ static bool
 subplan_is_hashable(Plan *plan)
 {
     double        subquery_size;
-    int            hash_mem = get_hash_mem();

     /*
      * The estimated size of the subquery result must fit in hash_mem. (Note:
@@ -734,7 +733,7 @@ subplan_is_hashable(Plan *plan)
      */
     subquery_size = plan->plan_rows *
         (MAXALIGN(plan->plan_width) + MAXALIGN(SizeofHeapTupleHeader));
-    if (subquery_size > hash_mem * 1024L)
+    if (subquery_size > get_hash_memory_limit())
         return false;

     return true;
@@ -749,7 +748,6 @@ static bool
 subpath_is_hashable(Path *path)
 {
     double        subquery_size;
-    int            hash_mem = get_hash_mem();

     /*
      * The estimated size of the subquery result must fit in hash_mem. (Note:
@@ -759,7 +757,7 @@ subpath_is_hashable(Path *path)
      */
     subquery_size = path->rows *
         (MAXALIGN(path->pathtarget->width) + MAXALIGN(SizeofHeapTupleHeader));
-    if (subquery_size > hash_mem * 1024L)
+    if (subquery_size > get_hash_memory_limit())
         return false;

     return true;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 037dfaacfd..e9256a2d4d 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1019,7 +1019,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
                     const char *construct)
 {
     int            numGroupCols = list_length(groupClauses);
-    int            hash_mem = get_hash_mem();
+    Size        hash_mem_limit = get_hash_memory_limit();
     bool        can_sort;
     bool        can_hash;
     Size        hashentrysize;
@@ -1055,13 +1055,11 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
      */
     hashentrysize = MAXALIGN(input_path->pathtarget->width) + MAXALIGN(SizeofMinimalTupleHeader);

-    if (hashentrysize * dNumGroups > hash_mem * 1024L)
+    if (hashentrysize * dNumGroups > hash_mem_limit)
         return false;

     /*
-     * See if the estimated cost is no more than doing it the other way.  We
-     * deliberately give the hash case more memory when hash_mem exceeds
-     * standard work mem (i.e. when hash_mem_multiplier exceeds 1.0).
+     * See if the estimated cost is no more than doing it the other way.
      *
      * We need to consider input_plan + hashagg versus input_plan + sort +
      * group.  Note that the actual result plan might involve a SetOp or
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 0c94cbe767..41cbf328c4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1794,9 +1794,8 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
          * planner.c).
          */
         int            hashentrysize = subpath->pathtarget->width + 64;
-        int            hash_mem = get_hash_mem();

-        if (hashentrysize * pathnode->path.rows > hash_mem * 1024L)
+        if (hashentrysize * pathnode->path.rows > get_hash_memory_limit())
         {
             /*
              * We should not try to hash.  Hack the SpecialJoinInfo to
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 446f20df46..91a7093e03 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -727,11 +727,7 @@ shm_mq_receive(shm_mq_handle *mqh, Size *nbytesp, void **datap, bool nowait)
              * Increase size to the next power of 2 that's >= nbytes, but
              * limit to MaxAllocSize.
              */
-#if SIZEOF_SIZE_T == 4
-            newbuflen = pg_nextpower2_32(nbytes);
-#else
-            newbuflen = pg_nextpower2_64(nbytes);
-#endif
+            newbuflen = pg_nextpower2_size_t(nbytes);
             newbuflen = Min(newbuflen, MaxAllocSize);

             if (mqh->mqh_buffer != NULL)
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 4dc343cbc5..3f155ce4f8 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -486,6 +486,7 @@ extern bool BackupInProgress(void);
 extern void CancelBackup(void);

 /* in executor/nodeHash.c */
+extern size_t get_hash_memory_limit(void);
 extern int    get_hash_mem(void);

 #endif                            /* MISCADMIN_H */
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index f9b77ec278..086bd08132 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -137,7 +137,7 @@ pg_rightmost_one_pos64(uint64 word)

 /*
  * pg_nextpower2_32
- *        Returns the next highest power of 2 of 'num', or 'num', if it's
+ *        Returns the next higher power of 2 above 'num', or 'num' if it's
  *        already a power of 2.
  *
  * 'num' mustn't be 0 or be above PG_UINT32_MAX / 2 + 1.
@@ -160,7 +160,7 @@ pg_nextpower2_32(uint32 num)

 /*
  * pg_nextpower2_64
- *        Returns the next highest power of 2 of 'num', or 'num', if it's
+ *        Returns the next higher power of 2 above 'num', or 'num' if it's
  *        already a power of 2.
  *
  * 'num' mustn't be 0 or be above PG_UINT64_MAX / 2  + 1.
@@ -181,6 +181,52 @@ pg_nextpower2_64(uint64 num)
     return ((uint64) 1) << (pg_leftmost_one_pos64(num) + 1);
 }

+/*
+ * pg_nextpower2_size_t
+ *        Returns the next higher power of 2 above 'num', for a size_t input.
+ */
+#if SIZEOF_SIZE_T == 4
+#define pg_nextpower2_size_t(num) pg_nextpower2_32(num)
+#else
+#define pg_nextpower2_size_t(num) pg_nextpower2_64(num)
+#endif
+
+/*
+ * pg_prevpower2_32
+ *        Returns the next lower power of 2 below 'num', or 'num' if it's
+ *        already a power of 2.
+ *
+ * 'num' mustn't be 0.
+ */
+static inline uint32
+pg_prevpower2_32(uint32 num)
+{
+    return ((uint32) 1) << pg_leftmost_one_pos32(num);
+}
+
+/*
+ * pg_prevpower2_64
+ *        Returns the next lower power of 2 below 'num', or 'num' if it's
+ *        already a power of 2.
+ *
+ * 'num' mustn't be 0.
+ */
+static inline uint64
+pg_prevpower2_64(uint64 num)
+{
+    return ((uint64) 1) << pg_leftmost_one_pos64(num);
+}
+
+/*
+ * pg_prevpower2_size_t
+ *        Returns the next lower power of 2 below 'num', for a size_t input.
+ */
+#if SIZEOF_SIZE_T == 4
+#define pg_prevpower2_size_t(num) pg_prevpower2_32(num)
+#else
+#define pg_prevpower2_size_t(num) pg_prevpower2_64(num)
+#endif
+
 /*
  * pg_ceil_log2_32
  *        Returns equivalent of ceil(log2(num))

Re: Removing "long int"-related limit on hash table sizes

From
Andres Freund
Date:
Hi,

On 2021-07-23 17:15:24 -0400, Tom Lane wrote:
> Per the discussion at [1], users on Windows are seeing nasty performance
> losses in v13/v14 (compared to prior releases) for hash aggregations that
> required somewhat more than 2GB in the prior releases.

Ugh :(.


> That's because they spill to disk where they did not before.  The easy
> answer of "raise hash_mem_multiplier" doesn't help, because on Windows
> the product of work_mem and hash_mem_multiplier is clamped to 2GB,
> thanks to the ancient decision to do a lot of memory-space-related
> calculations in "long int", which is only 32 bits on Win64.

We really ought to just remove every single use of long. As Thomas
quipped on twitter at some point, "long is the asbestos of C". I think
we've incurred far more cost due to weird workarounds to deal with the
difference in long width between windows and everything else, than just
removing all use of it outright would incur.

And perhaps once we've done that, we shoulde experiment with putting
__attribute__((deprecated)) on long, but conditionalize it so it's only
used for building PG internal stuff, and doesn't leak into pg_config
output. Perhaps it'll be to painful due to external headers, but it
seems worth trying.

But obviously that doesn't help with the issue in the release branches.


> While I don't personally have the interest to fix that altogether,
> it does seem like we've got a performance regression that we ought
> to do something about immediately.  So I took a look at getting rid of
> this restriction for calculations associated with hash_mem_multiplier,
> and it doesn't seem to be too bad.  I propose the attached patch.
> (This is against HEAD; there are minor conflicts in v13 and v14.)

Hm. I wonder if we would avoid some overflow dangers on 32bit systems if
we made get_hash_memory_limit() and the relevant variables 64 bit,
rather than 32bit / size_t.  E.g.

> @@ -700,9 +697,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
>      inner_rel_bytes = ntuples * tupsize;
>
>      /*
> -     * Target in-memory hashtable size is hash_mem kilobytes.
> +     * Compute in-memory hashtable size limit from GUCs.
>       */
> -    hash_table_bytes = hash_mem * 1024L;
> +    hash_table_bytes = get_hash_memory_limit();
>
>      /*
>       * Parallel Hash tries to use the combined hash_mem of all workers to
> @@ -710,7 +707,14 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
>       * per worker and tries to process batches in parallel.
>       */
>      if (try_combined_hash_mem)
> -        hash_table_bytes += hash_table_bytes * parallel_workers;
> +    {
> +        /* Careful, this could overflow size_t */
> +        double        newlimit;
> +
> +        newlimit = (double) hash_table_bytes * (double) (parallel_workers + 1);
> +        newlimit = Min(newlimit, (double) SIZE_MAX);
> +        hash_table_bytes = (size_t) newlimit;
> +    }

Wouldn't need to be as carful, I think?



> @@ -740,12 +747,26 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
>           * size of skew bucket struct itself
>           *----------
>           */
> -        *num_skew_mcvs = skew_table_bytes / (tupsize +
> -                                             (8 * sizeof(HashSkewBucket *)) +
> -                                             sizeof(int) +
> -                                             SKEW_BUCKET_OVERHEAD);
> -        if (*num_skew_mcvs > 0)
> -            hash_table_bytes -= skew_table_bytes;
> +        bytes_per_mcv = tupsize +
> +            (8 * sizeof(HashSkewBucket *)) +
> +            sizeof(int) +
> +            SKEW_BUCKET_OVERHEAD;
> +        skew_mcvs = hash_table_bytes / bytes_per_mcv;
> +
> +        /*
> +         * Now scale by SKEW_HASH_MEM_PERCENT (we do it in this order so as
> +         * not to worry about size_t overflow in the multiplication)
> +         */
> +        skew_mcvs = skew_mcvs * SKEW_HASH_MEM_PERCENT / 100;

I always have to think about the evaluation order of things like this
(it's left to right for these), so I'd prefer parens around the
multiplication. I did wonder briefly whether the SKEW_HASH_MEM_PERCENT /
100 just evaluates to 0...


Looks like a good idea to me.

Greetings,

Andres Freund



Re: Removing "long int"-related limit on hash table sizes

From
Bruce Momjian
Date:
On Sat, Jul 24, 2021 at 06:25:53PM -0700, Andres Freund wrote:
> > That's because they spill to disk where they did not before.  The easy
> > answer of "raise hash_mem_multiplier" doesn't help, because on Windows
> > the product of work_mem and hash_mem_multiplier is clamped to 2GB,
> > thanks to the ancient decision to do a lot of memory-space-related
> > calculations in "long int", which is only 32 bits on Win64.
> 
> We really ought to just remove every single use of long. As Thomas
> quipped on twitter at some point, "long is the asbestos of C". I think
> we've incurred far more cost due to weird workarounds to deal with the
> difference in long width between windows and everything else, than just
> removing all use of it outright would incur.

+1

As I understand it, making long of undermined length was to allow
someone to choose a data type that _might_ be longer than int if the
compiler/OS/CPU was optimized for that, but at this point, such
optimizations just don't seem to make sense, and we know every(?) CPU
supports long-long, so why not go for something concrete?  Do we really
want our feature limits to be determined by whether we have an optimized
type longer than int?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: Removing "long int"-related limit on hash table sizes

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2021-07-23 17:15:24 -0400, Tom Lane wrote:
>> That's because they spill to disk where they did not before.  The easy
>> answer of "raise hash_mem_multiplier" doesn't help, because on Windows
>> the product of work_mem and hash_mem_multiplier is clamped to 2GB,
>> thanks to the ancient decision to do a lot of memory-space-related
>> calculations in "long int", which is only 32 bits on Win64.

> We really ought to just remove every single use of long.

I have no objection to that as a long-term goal.  But I'm not volunteering
to do all the work, and in any case it wouldn't be a back-patchable fix.
I feel that we do need to do something about this performance regression
in v13.

> Hm. I wonder if we would avoid some overflow dangers on 32bit systems if
> we made get_hash_memory_limit() and the relevant variables 64 bit,
> rather than 32bit / size_t.  E.g.

No, I don't like that.  Using size_t for memory-size variables is good
discipline.  Moreover, I'm not convinced that even with 64-bit ints,
overflow would be impossible in all the places I fixed here.  They're
multiplying several potentially very large values (one of which
is a float).  I think this is just plain sloppy coding, independently
of which bit-width you choose to be sloppy in.

>> +        skew_mcvs = skew_mcvs * SKEW_HASH_MEM_PERCENT / 100;

> I always have to think about the evaluation order of things like this
> (it's left to right for these), so I'd prefer parens around the
> multiplication. I did wonder briefly whether the SKEW_HASH_MEM_PERCENT /
> 100 just evaluates to 0...

OK, will do.  I see your point, because I'd sort of instinctively
wanted to write that as
        skew_mcvs *= SKEW_HASH_MEM_PERCENT / 100;
which of course would not work.

Thanks for looking at the code.

            regards, tom lane



Re: Removing "long int"-related limit on hash table sizes

From
Ranier Vilela
Date:
Em dom., 25 de jul. de 2021 às 13:28, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
Andres Freund <andres@anarazel.de> writes:
> On 2021-07-23 17:15:24 -0400, Tom Lane wrote:
>> That's because they spill to disk where they did not before.  The easy
>> answer of "raise hash_mem_multiplier" doesn't help, because on Windows
>> the product of work_mem and hash_mem_multiplier is clamped to 2GB,
>> thanks to the ancient decision to do a lot of memory-space-related
>> calculations in "long int", which is only 32 bits on Win64.

> We really ought to just remove every single use of long.

I have no objection to that as a long-term goal.  But I'm not volunteering
to do all the work, and in any case it wouldn't be a back-patchable fix.
I'm a volunteer, if you want to work together.
I think int64 is in most cases the counterpart of *long* on Windows.

regards,
Ranier Vilela

Re: Removing "long int"-related limit on hash table sizes

From
Tom Lane
Date:
Ranier Vilela <ranier.vf@gmail.com> writes:
> I think int64 is in most cases the counterpart of *long* on Windows.

I'm not particularly on board with s/long/int64/g as a universal
solution.  I think that most of these usages are concerned with
memory sizes and would be better off as "size_t".  We might need
int64 in places where we're concerned with sums of memory usage
across processes, or where the value needs to be allowed to be
negative.  So it'll take case-by-case analysis to do it right.

BTW, one aspect of this that I'm unsure how to tackle is the
common usage of "L" constants; in particular, "work_mem * 1024L"
is a really common idiom that we'll need to get rid of.  Not sure
that grep will be a useful aid for finding those.

            regards, tom lane



Re: Removing "long int"-related limit on hash table sizes

From
Ranier Vilela
Date:
Em dom., 25 de jul. de 2021 às 15:53, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
Ranier Vilela <ranier.vf@gmail.com> writes:
> I think int64 is in most cases the counterpart of *long* on Windows.

I'm not particularly on board with s/long/int64/g as a universal
solution. 
Sure, not a universal solution, I mean a start point.
When I look for a type that is signed and size 8 bytes in Windows, I only see int64.

I think that most of these usages are concerned with
memory sizes and would be better off as "size_t".
Ok, but let's not forget that size_t is unsigned.

  We might need
int64 in places where we're concerned with sums of memory usage
across processes, or where the value needs to be allowed to be
negative.  So it'll take case-by-case analysis to do it right.
Sure.


BTW, one aspect of this that I'm unsure how to tackle is the
common usage of "L" constants; in particular, "work_mem * 1024L"
is a really common idiom that we'll need to get rid of.  Not sure
that grep will be a useful aid for finding those.
I can see 30 matches in the head tree. (grep -d "1024L" *.c)

File backend\access\gin\ginfast.c:
        if (metadata->nPendingPages * GIN_PAGE_FREESIZE > cleanupSize * 1024L)
                         (accum.allocatedMemory >= workMemory * 1024L)))
Is it a good point to start?

or one more simple?
(src/backend/access/hash/hash.c) has one *long*.

regards,
Ranier Vilela

Re: Removing "long int"-related limit on hash table sizes

From
Michael Paquier
Date:
On Sun, Jul 25, 2021 at 12:28:04PM -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> We really ought to just remove every single use of long.
>
> I have no objection to that as a long-term goal.  But I'm not volunteering
> to do all the work, and in any case it wouldn't be a back-patchable fix.
> I feel that we do need to do something about this performance regression
> in v13.

Another idea may be to be more aggressive in c.h?  A tweak there would
be dirtier than marking long as deprecated, but that would be less
invasive.  Any of that is not backpatchable, of course..

> No, I don't like that.  Using size_t for memory-size variables is good
> discipline.  Moreover, I'm not convinced that even with 64-bit ints,
> overflow would be impossible in all the places I fixed here.  They're
> multiplying several potentially very large values (one of which
> is a float).  I think this is just plain sloppy coding, independently
> of which bit-width you choose to be sloppy in.

Yeah, using size_t where adapted is usually a good idea.
--
Michael

Attachment

Re: Removing "long int"-related limit on hash table sizes

From
Alvaro Herrera
Date:
On 2021-Jul-25, Ranier Vilela wrote:

> > BTW, one aspect of this that I'm unsure how to tackle is the
> > common usage of "L" constants; in particular, "work_mem * 1024L"
> > is a really common idiom that we'll need to get rid of.  Not sure
> > that grep will be a useful aid for finding those.
> >
> I can see 30 matches in the head tree. (grep -d "1024L" *.c)

grep grep '[0-9]L\>' -- *.[chyl]
shows some more constants.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/



Re: Removing "long int"-related limit on hash table sizes

From
ilmari@ilmari.org (Dagfinn Ilmari Mannsåker)
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> On 2021-Jul-25, Ranier Vilela wrote:
>
>> > BTW, one aspect of this that I'm unsure how to tackle is the
>> > common usage of "L" constants; in particular, "work_mem * 1024L"
>> > is a really common idiom that we'll need to get rid of.  Not sure
>> > that grep will be a useful aid for finding those.
>> >
>> I can see 30 matches in the head tree. (grep -d "1024L" *.c)
>
> grep grep '[0-9]L\>' -- *.[chyl]
> shows some more constants.

git grep -Eiw '(0x[0-9a-f]+|[0-9]+)U?LL?' -- *.[chyl]

gives about a hundred more hits.

We also have the (U)INT64CONST() macros, which are about about two
thirds as common as the U?LL? suffixes.

- ilmari



Re: Removing "long int"-related limit on hash table sizes

From
Tom Lane
Date:
ilmari@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> We also have the (U)INT64CONST() macros, which are about about two
> thirds as common as the U?LL? suffixes.

Yeah.  Ideally we'd forbid direct use of the suffixes and insist
you go through those macros, but I don't know of any way that
we could enforce such a coding rule, short of grepping the tree
periodically.

            regards, tom lane



Re: Removing "long int"-related limit on hash table sizes

From
Alvaro Herrera
Date:
On 2021-Jul-26, Tom Lane wrote:

> ilmari@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> > We also have the (U)INT64CONST() macros, which are about about two
> > thirds as common as the U?LL? suffixes.
> 
> Yeah.  Ideally we'd forbid direct use of the suffixes and insist
> you go through those macros, but I don't know of any way that
> we could enforce such a coding rule, short of grepping the tree
> periodically.

IIRC we have one buildfarm member that warns us about perlcritic; maybe
this is just another setup of that sort.

(Personally I run the perlcritic check in my local commit-verifying
script before pushing.)

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"XML!" Exclaimed C++.  "What are you doing here? You're not a programming
language."
"Tell that to the people who use me," said XML.
https://burningbird.net/the-parable-of-the-languages/



Re: Removing "long int"-related limit on hash table sizes

From
Andres Freund
Date:
Hi,

On 2021-07-26 11:38:41 +0900, Michael Paquier wrote:
> On Sun, Jul 25, 2021 at 12:28:04PM -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> >> We really ought to just remove every single use of long.
> > 
> > I have no objection to that as a long-term goal.  But I'm not volunteering
> > to do all the work, and in any case it wouldn't be a back-patchable fix.
> > I feel that we do need to do something about this performance regression
> > in v13.
> 
> Another idea may be to be more aggressive in c.h?  A tweak there would
> be dirtier than marking long as deprecated, but that would be less
> invasive.  Any of that is not backpatchable, of course..

Hard to see how that could work - plenty system headers use long...

Greetings,

Andres Freund