Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands. - Mailing list pgsql-hackers
From | Joel Jacobson |
---|---|
Subject | Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands. |
Date | |
Msg-id | 9ff90ff9-5ac1-4529-b2f4-f93b37734b9c@app.fastmail.com Whole thread Raw |
In response to | Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands. (Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>) |
Responses |
Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.
|
List | pgsql-hackers |
On Mon, Jul 1, 2024, at 14:25, Dagfinn Ilmari Mannsåker wrote: > div_var() also has an optimisation for 3- and 4-digit operands under > HAVE_INT128 (added in commit 0aa38db56bf), would that make sense in > mul_var() too? I considered it, but it only gives a marginal speed-up on Intel Core i9-14900K, and is actually slower on Apple M3 Max. Not really sure why. Maybe the code I tried can be optimized further: ``` #ifdef HAVE_INT128 /* * If var1 and var2 are up to four digits, their product will fit in * an int128 can be computed directly, which is significantly faster. */ if (var2ndigits <= 4) { int128 product = 0; switch (var1ndigits) { case 1: product = var1digits[0]; break; case 2: product = var1digits[0] * NBASE + var1digits[1]; break; case 3: product = ((int128) var1digits[0] * NBASE + var1digits[1]) * NBASE + var1digits[2]; break; case 4: product = (((int128) var1digits[0] * NBASE + var1digits[1]) * NBASE + var1digits[2]) * NBASE + var1digits[3]; break; } switch (var2ndigits) { case 1: product *= var2digits[0]; break; case 2: product *= var2digits[0] * NBASE + var2digits[1]; break; case 3: product = ((int128) var2digits[0] * NBASE + var2digits[1]) * NBASE + var2digits[2]; break; case 4: product = (((int128) var2digits[0] * NBASE + var2digits[1]) * NBASE + var2digits[2]) * NBASE + var2digits[3]; break; } alloc_var(result, res_ndigits); res_digits = result->digits; for (i = res_ndigits - 1; i >= 0; i--) { res_digits[i] = product % NBASE; product /= NBASE; } Assert(product == 0); /* * Finally, round the result to the requested precision. */ result->weight = res_weight; result->sign = res_sign; /* Round to target rscale (and set result->dscale) */ round_var(result, rscale); /* Strip leading and trailing zeroes */ strip_var(result); return; } #endif ``` Benchmark 1, testing 2 ndigits * 2 ndigits: SELECT timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a, timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b, total_time_a::numeric/total_time_b AS performance_ratio FROM timeit.cmp( 'numeric_mul', 'numeric_mul_patched', input_values := ARRAY[ '11112222', '33334444' ], min_time := 1000000, timeout := '10 s' ); Apple M3 Max: execution_time_a | execution_time_b | performance_ratio ------------------+------------------+-------------------- 32.2 ns | 20.5 ns | 1.5700112246809388 (1 row) Intel Core i9-14900K: execution_time_a | execution_time_b | performance_ratio ------------------+------------------+-------------------- 30.2 ns | 21.4 ns | 1.4113042510107371 (1 row) So 57% and 41% faster. Benchmark 2, testing 4 ndigits * 4 ndigits: SELECT timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a, timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b, total_time_a::numeric/total_time_b AS performance_ratio FROM timeit.cmp( 'numeric_mul', 'numeric_mul_patched', input_values := ARRAY[ '1111222233334444', '5555666677778888' ], min_time := 1000000, timeout := '10 s' ); Apple M3 Max: execution_time_a | execution_time_b | performance_ratio ------------------+------------------+------------------------ 41.9 ns | 51.3 ns | 0.81733655797170943614 (1 row) Intel Core i9-14900K: execution_time_a | execution_time_b | performance_ratio ------------------+------------------+-------------------- 40 ns | 38 ns | 1.0515610914706320 (1 row) So 18% slower on Apple M3 Max and just 5% faster on Intel Core i9-14900K. /Joel
pgsql-hackers by date: