Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands. - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.
Date
Msg-id 58e5e7d2-7ad8-40b4-9b76-a5c3049346e5@app.fastmail.com
Whole thread Raw
In response to Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.
List pgsql-hackers
On Fri, Jul 5, 2024, at 17:41, Dean Rasheed wrote:
> On Fri, 5 Jul 2024 at 12:56, Joel Jacobson <joel@compiler.org> wrote:
>>
>> Interesting you got so bad bench results for v6-mul_var_int64.patch
>> for var1ndigits=4, that patch is actually the winner on AMD Ryzen 9 7950X3D.
>
> Interesting.

I remeasured just to be sure, and yes, it was the winner among the previous patches,
but the new v7 beats it.

>> On Intel Core i9-14900K the winner is v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.
>
> That must be random noise, since
> v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch doesn't
> invoke mul_var_small() for 4-digit inputs.

Yes, something was off with the HEAD measurements for that one,
I remeasured and then got almost identical results (as expected)
between HEAD and  v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
for 4-digit inputs.

>> On Apple M3 Max, HEAD is the winner.
>
> Importantly, mul_var_int64() is around 1.25x slower there, and it was
> even worse on my machine.
>
> Attached is a v7 mul_var_small() patch adding 4-digit support. For me,
> this gives a nice speedup:
>
> SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4;
> Time: 5617.150 ms (00:05.617)  -- HEAD
> Time: 8203.081 ms (00:08.203)  -- v6-mul_var_int64.patch
> Time: 4750.212 ms (00:04.750)  -- v7-mul_var_small.patch
>
> The other advantage, of course, is that it doesn't require 128-bit
> integer support.

Very nice, v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
is now the winner on all my CPUs:

-- Apple M3 Max

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3574.865 ms (00:03.575)
Time: 3573.678 ms (00:03.574)
Time: 3576.953 ms (00:03.577)
Time: 3580.536 ms (00:03.581)
Time: 3589.007 ms (00:03.589)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3110.171 ms (00:03.110)
Time: 3098.558 ms (00:03.099)
Time: 3105.873 ms (00:03.106)
Time: 3104.484 ms (00:03.104)
Time: 3109.035 ms (00:03.109)

-- Intel Core i9-14900K

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3751.767 ms (00:03.752)
Time: 3745.916 ms (00:03.746)
Time: 3742.542 ms (00:03.743)
Time: 3746.139 ms (00:03.746)
Time: 3745.493 ms (00:03.745)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3747.640 ms (00:03.748)
Time: 3747.231 ms (00:03.747)
Time: 3747.965 ms (00:03.748)
Time: 3748.309 ms (00:03.748)
Time: 3746.498 ms (00:03.746)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3417.924 ms (00:03.418)
Time: 3417.088 ms (00:03.417)
Time: 3415.708 ms (00:03.416)
Time: 3415.453 ms (00:03.415)
Time: 3419.566 ms (00:03.420)

-- AMD Ryzen 9 7950X3D

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 3970.131 ms (00:03.970)
Time: 3924.335 ms (00:03.924)
Time: 3927.863 ms (00:03.928)
Time: 3924.761 ms (00:03.925)
Time: 3926.290 ms (00:03.926)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v6-add-mul_var_int64.patch
Time: 3874.769 ms (00:03.875)
Time: 3858.071 ms (00:03.858)
Time: 3836.698 ms (00:03.837)
Time: 3871.388 ms (00:03.871)
Time: 3844.907 ms (00:03.845)

SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v7-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3397.846 ms (00:03.398)
Time: 3398.050 ms (00:03.398)
Time: 3395.279 ms (00:03.395)
Time: 3393.285 ms (00:03.393)
Time: 3402.570 ms (00:03.403)

Code wise I think it's now very nice and clear, with just enough comments.

Also nice to see that the var1ndigits=4 case isn't much more complex
than var1ndigits=3, since it follows the same pattern.

Regards,
Joel



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Optimize commit performance with a large number of 'on commit delete rows' temp tables
Next
From: Peter Geoghegan
Date:
Subject: Avoiding superfluous buffer locking during nbtree backwards scans