Re: Removing "long int"-related limit on hash table sizes - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Removing "long int"-related limit on hash table sizes
Date
Msg-id 20210725012553.fntqmqnipikg3mnx@alap3.anarazel.de
Whole thread Raw
In response to Removing "long int"-related limit on hash table sizes  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Removing "long int"-related limit on hash table sizes  (Bruce Momjian <bruce@momjian.us>)
Re: Removing "long int"-related limit on hash table sizes  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2021-07-23 17:15:24 -0400, Tom Lane wrote:
> Per the discussion at [1], users on Windows are seeing nasty performance
> losses in v13/v14 (compared to prior releases) for hash aggregations that
> required somewhat more than 2GB in the prior releases.

Ugh :(.


> That's because they spill to disk where they did not before.  The easy
> answer of "raise hash_mem_multiplier" doesn't help, because on Windows
> the product of work_mem and hash_mem_multiplier is clamped to 2GB,
> thanks to the ancient decision to do a lot of memory-space-related
> calculations in "long int", which is only 32 bits on Win64.

We really ought to just remove every single use of long. As Thomas
quipped on twitter at some point, "long is the asbestos of C". I think
we've incurred far more cost due to weird workarounds to deal with the
difference in long width between windows and everything else, than just
removing all use of it outright would incur.

And perhaps once we've done that, we shoulde experiment with putting
__attribute__((deprecated)) on long, but conditionalize it so it's only
used for building PG internal stuff, and doesn't leak into pg_config
output. Perhaps it'll be to painful due to external headers, but it
seems worth trying.

But obviously that doesn't help with the issue in the release branches.


> While I don't personally have the interest to fix that altogether,
> it does seem like we've got a performance regression that we ought
> to do something about immediately.  So I took a look at getting rid of
> this restriction for calculations associated with hash_mem_multiplier,
> and it doesn't seem to be too bad.  I propose the attached patch.
> (This is against HEAD; there are minor conflicts in v13 and v14.)

Hm. I wonder if we would avoid some overflow dangers on 32bit systems if
we made get_hash_memory_limit() and the relevant variables 64 bit,
rather than 32bit / size_t.  E.g.

> @@ -700,9 +697,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
>      inner_rel_bytes = ntuples * tupsize;
>
>      /*
> -     * Target in-memory hashtable size is hash_mem kilobytes.
> +     * Compute in-memory hashtable size limit from GUCs.
>       */
> -    hash_table_bytes = hash_mem * 1024L;
> +    hash_table_bytes = get_hash_memory_limit();
>
>      /*
>       * Parallel Hash tries to use the combined hash_mem of all workers to
> @@ -710,7 +707,14 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
>       * per worker and tries to process batches in parallel.
>       */
>      if (try_combined_hash_mem)
> -        hash_table_bytes += hash_table_bytes * parallel_workers;
> +    {
> +        /* Careful, this could overflow size_t */
> +        double        newlimit;
> +
> +        newlimit = (double) hash_table_bytes * (double) (parallel_workers + 1);
> +        newlimit = Min(newlimit, (double) SIZE_MAX);
> +        hash_table_bytes = (size_t) newlimit;
> +    }

Wouldn't need to be as carful, I think?



> @@ -740,12 +747,26 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
>           * size of skew bucket struct itself
>           *----------
>           */
> -        *num_skew_mcvs = skew_table_bytes / (tupsize +
> -                                             (8 * sizeof(HashSkewBucket *)) +
> -                                             sizeof(int) +
> -                                             SKEW_BUCKET_OVERHEAD);
> -        if (*num_skew_mcvs > 0)
> -            hash_table_bytes -= skew_table_bytes;
> +        bytes_per_mcv = tupsize +
> +            (8 * sizeof(HashSkewBucket *)) +
> +            sizeof(int) +
> +            SKEW_BUCKET_OVERHEAD;
> +        skew_mcvs = hash_table_bytes / bytes_per_mcv;
> +
> +        /*
> +         * Now scale by SKEW_HASH_MEM_PERCENT (we do it in this order so as
> +         * not to worry about size_t overflow in the multiplication)
> +         */
> +        skew_mcvs = skew_mcvs * SKEW_HASH_MEM_PERCENT / 100;

I always have to think about the evaluation order of things like this
(it's left to right for these), so I'd prefer parens around the
multiplication. I did wonder briefly whether the SKEW_HASH_MEM_PERCENT /
100 just evaluates to 0...


Looks like a good idea to me.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Avoiding data loss with synchronous replication
Next
From: Bruce Momjian
Date:
Subject: Re: Removing "long int"-related limit on hash table sizes