Hi,
> Also, are you still seeing the same improvement with the __builtin_clz
> as your inline asm implementation?
In my benchmark program, it is a little different performance
in fls implementation and inline asm implementation.
However, the result of a pgbench is almost the same improvement.
Here is the result of my benchmark.
Xeon(Core architecture) bytes : 4 8 16 32 64 128 256 512 1024 mix original : 0.780 0.790
0.8200.870 0.930 0.980 1.040 1.080 1.140 0.910 inline asm: 0.320 0.180 0.190 0.180 0.190 0.180 0.190 0.180 0.190 0.170
fls : 0.270 0.260 0.290 0.290 0.290 0.290 0.290 0.300 0.290 0.380
Xeon(P4 architecrure) bytes : 4 8 16 32 64 128 256 512 1024 mix original : 0.520 0.520
0.6700.780 0.950 1.000 1.060 1.190 1.250 0.940 inline asm: 0.610 0.530 0.530 0.520 0.520 0.540 0.540 0.580 0.540 0.600
fls : 0.390 0.370 0.780 0.780 0.780 0.790 0.780 0.780 0.780 0.520
pgbench result (measured by oprofile)
CPU: Xeon(P4 architecrure)
test program: pgbench -c 1 -t 50000 (fsync=off)
original
samples % symbol name
66854 6.6725 AllocSetAlloc
11817 1.1794 AllocSetFree
inline asm
samples % symbol name
47610 4.9333 AllocSetAlloc
6248 0.6474 AllocSetFree
fls
samples % symbol name
48779 4.9954 AllocSetAlloc
7648 0.7832 AllocSetFree
Best regards,
---
Atsushi Ogawa