Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands. - Mailing list pgsql-hackers
| From | Joel Jacobson |
|---|---|
| Subject | Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands. |
| Date | |
| Msg-id | 9ff90ff9-5ac1-4529-b2f4-f93b37734b9c@app.fastmail.com Whole thread Raw |
| In response to | Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands. (Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>) |
| Responses |
Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.
|
| List | pgsql-hackers |
On Mon, Jul 1, 2024, at 14:25, Dagfinn Ilmari Mannsåker wrote:
> div_var() also has an optimisation for 3- and 4-digit operands under
> HAVE_INT128 (added in commit 0aa38db56bf), would that make sense in
> mul_var() too?
I considered it, but it only gives a marginal speed-up on Intel Core i9-14900K,
and is actually slower on Apple M3 Max.
Not really sure why. Maybe the code I tried can be optimized further:
```
#ifdef HAVE_INT128
/*
* If var1 and var2 are up to four digits, their product will fit in
* an int128 can be computed directly, which is significantly faster.
*/
if (var2ndigits <= 4)
{
int128 product = 0;
switch (var1ndigits)
{
case 1:
product = var1digits[0];
break;
case 2:
product = var1digits[0] * NBASE + var1digits[1];
break;
case 3:
product = ((int128) var1digits[0] * NBASE + var1digits[1])
* NBASE + var1digits[2];
break;
case 4:
product = (((int128) var1digits[0] * NBASE + var1digits[1])
* NBASE + var1digits[2]) * NBASE + var1digits[3];
break;
}
switch (var2ndigits)
{
case 1:
product *= var2digits[0];
break;
case 2:
product *= var2digits[0] * NBASE + var2digits[1];
break;
case 3:
product = ((int128) var2digits[0] * NBASE + var2digits[1])
* NBASE + var2digits[2];
break;
case 4:
product = (((int128) var2digits[0] * NBASE + var2digits[1])
* NBASE + var2digits[2]) * NBASE + var2digits[3];
break;
}
alloc_var(result, res_ndigits);
res_digits = result->digits;
for (i = res_ndigits - 1; i >= 0; i--)
{
res_digits[i] = product % NBASE;
product /= NBASE;
}
Assert(product == 0);
/*
* Finally, round the result to the requested precision.
*/
result->weight = res_weight;
result->sign = res_sign;
/* Round to target rscale (and set result->dscale) */
round_var(result, rscale);
/* Strip leading and trailing zeroes */
strip_var(result);
return;
}
#endif
```
Benchmark 1, testing 2 ndigits * 2 ndigits:
SELECT
timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a,
timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b,
total_time_a::numeric/total_time_b AS performance_ratio
FROM timeit.cmp(
'numeric_mul',
'numeric_mul_patched',
input_values := ARRAY[
'11112222',
'33334444'
],
min_time := 1000000,
timeout := '10 s'
);
Apple M3 Max:
execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
32.2 ns | 20.5 ns | 1.5700112246809388
(1 row)
Intel Core i9-14900K:
execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
30.2 ns | 21.4 ns | 1.4113042510107371
(1 row)
So 57% and 41% faster.
Benchmark 2, testing 4 ndigits * 4 ndigits:
SELECT
timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a,
timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b,
total_time_a::numeric/total_time_b AS performance_ratio
FROM timeit.cmp(
'numeric_mul',
'numeric_mul_patched',
input_values := ARRAY[
'1111222233334444',
'5555666677778888'
],
min_time := 1000000,
timeout := '10 s'
);
Apple M3 Max:
execution_time_a | execution_time_b | performance_ratio
------------------+------------------+------------------------
41.9 ns | 51.3 ns | 0.81733655797170943614
(1 row)
Intel Core i9-14900K:
execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
40 ns | 38 ns | 1.0515610914706320
(1 row)
So 18% slower on Apple M3 Max and just 5% faster on Intel Core i9-14900K.
/Joel
pgsql-hackers by date: