Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands. - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.
Date
Msg-id 9ff90ff9-5ac1-4529-b2f4-f93b37734b9c@app.fastmail.com
Whole thread Raw
In response to Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.  (Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>)
Responses Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.
List pgsql-hackers
On Mon, Jul 1, 2024, at 14:25, Dagfinn Ilmari Mannsåker wrote:
> div_var() also has an optimisation for 3- and 4-digit operands under
> HAVE_INT128 (added in commit 0aa38db56bf), would that make sense in
> mul_var() too?

I considered it, but it only gives a marginal speed-up on Intel Core i9-14900K,
and is actually slower on Apple M3 Max.
Not really sure why. Maybe the code I tried can be optimized further:

```
#ifdef HAVE_INT128
    /*
     * If var1 and var2 are up to four digits, their product will fit in
     * an int128 can be computed directly, which is significantly faster.
     */
    if (var2ndigits <= 4)
    {
        int128        product = 0;

        switch (var1ndigits)
        {
            case 1:
                product = var1digits[0];
                break;
            case 2:
                product = var1digits[0] * NBASE + var1digits[1];
                break;
            case 3:
                product = ((int128) var1digits[0] * NBASE + var1digits[1])
                        * NBASE + var1digits[2];
                break;
            case 4:
                product = (((int128) var1digits[0] * NBASE + var1digits[1])
                        * NBASE + var1digits[2]) * NBASE + var1digits[3];
                break;
        }

        switch (var2ndigits)
        {
            case 1:
                product *= var2digits[0];
                break;
            case 2:
                product *= var2digits[0] * NBASE + var2digits[1];
                break;
            case 3:
                product = ((int128) var2digits[0] * NBASE + var2digits[1])
                        * NBASE + var2digits[2];
                break;
            case 4:
                product = (((int128) var2digits[0] * NBASE + var2digits[1])
                        * NBASE + var2digits[2]) * NBASE + var2digits[3];
                break;
        }

        alloc_var(result, res_ndigits);
        res_digits = result->digits;
        for (i = res_ndigits - 1; i >= 0; i--)
        {
            res_digits[i] = product % NBASE;
            product /= NBASE;
        }
        Assert(product == 0);

        /*
         * Finally, round the result to the requested precision.
         */
        result->weight = res_weight;
        result->sign = res_sign;

        /* Round to target rscale (and set result->dscale) */
        round_var(result, rscale);

        /* Strip leading and trailing zeroes */
        strip_var(result);

        return;
    }
#endif
```

Benchmark 1, testing 2 ndigits * 2 ndigits:

SELECT
   timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a,
   timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b,
   total_time_a::numeric/total_time_b AS performance_ratio
FROM timeit.cmp(
   'numeric_mul',
   'numeric_mul_patched',
   input_values := ARRAY[
      '11112222',
      '33334444'
   ],
   min_time := 1000000,
   timeout := '10 s'
);

Apple M3 Max:

 execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
 32.2 ns          | 20.5 ns          | 1.5700112246809388
(1 row)

Intel Core i9-14900K:

 execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
 30.2 ns          | 21.4 ns          | 1.4113042510107371
(1 row)

So 57% and 41% faster.

Benchmark 2, testing 4 ndigits * 4 ndigits:

SELECT
   timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a,
   timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b,
   total_time_a::numeric/total_time_b AS performance_ratio
FROM timeit.cmp(
   'numeric_mul',
   'numeric_mul_patched',
   input_values := ARRAY[
      '1111222233334444',
      '5555666677778888'
   ],
   min_time := 1000000,
   timeout := '10 s'
);

Apple M3 Max:

 execution_time_a | execution_time_b |   performance_ratio
------------------+------------------+------------------------
 41.9 ns          | 51.3 ns          | 0.81733655797170943614
(1 row)

Intel Core i9-14900K:

 execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
 40 ns            | 38 ns            | 1.0515610914706320
(1 row)

So 18% slower on Apple M3 Max and just 5% faster on Intel Core i9-14900K.

/Joel



pgsql-hackers by date:

Previous
From: Yugo NAGATA
Date:
Subject: psql: Add leakproof field to \dAo+ meta-command results
Next
From: Yugo NAGATA
Date:
Subject: Re: PATCH: Add query for operators unusable with RLS to documentation