Re: Enabling Checksums - Mailing list pgsql-hackers

From Ants Aasma
Subject Re: Enabling Checksums
Date
Msg-id CA+CSw_su1fopLNBz1NAfkSNw4_=gv+5pf0KdLQmpvuKW1Q4v+Q@mail.gmail.com
Whole thread Raw
In response to Re: Enabling Checksums  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Enabling Checksums  (Jeff Davis <pgsql@j-davis.com>)
Re: Enabling Checksums  (Jeff Davis <pgsql@j-davis.com>)
Re: Enabling Checksums  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
On Fri, Mar 22, 2013 at 3:04 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> I've been following your analysis and testing, and it looks like there
> are still at least three viable approaches:
>
> 1. Some variant of Fletcher
> 2. Some variant of CRC32
> 3. Some SIMD-based checksum
>
> Each of those has some open implementation questions, as well. If we
> settle on one of those approaches, we don't necessarily need the fastest
> implementation right away. I might even argue that the first patch to be
> committed should be a simple implementation of whatever algorithm we
> choose, and then optimization should be done in a separate patch (if it
> is tricky to get right).

+1 on correct first, fast second.

> Of course, it's hard to settle on the general algorithm to use without
> knowing the final performance numbers. So right now I'm in somewhat of a
> holding pattern until we settle on something.

For performance the K8 results gave me confidence that we have a
reasonably good overview what the performance is like for the class of
CPU's that PostgreSQL is likely to run on. I don't think there is
anything left to optimize there, all algorithms are pretty close to
maximum theoretical performance. Still, benchmarks on AMD's Bulldozer
arch and maybe on some non-x86 machines (Power, Itanium, Sparc) would
be very welcome to ensure that I haven't missed anything.

To see real world performance numbers I dumped the algorithms on top
of the checksums patch. I set up postgres with 32MB shared buffers,
and ran with concurrency 4 select only pgbench and a worst case
workload, results are median of 5 1-minute runs. I used fletcher as it
was in the checksums patch without unrolling. Unrolling would cut the
performance hit by a third or so.

The worst case workload is set up using
CREATE TABLE sparse (id serial primary key, v text) WITH (fillfactor=10);
INSERT INTO sparse (v) SELECT REPEAT('x', 1000) FROM generate_series(1,100000);
VACUUM ANALYZE sparse;

The test query itself is a simple SELECT count(v) FROM sparse;

Results for the worst case workload:
No checksums:       tps = 14.710519
Fletcher checksums: tps = 10.825564 (1.359x slowdown)
CRC checksums:      tps =  5.844995 (2.517x slowdown)
SIMD checksums:     tps = 14.062388 (1.046x slowdown)

Results for pgbench scale 100:
No checksums:       tps = 56623.819783
Fletcher checksums: tps = 55282.222687 (1.024x slowdown)
CRC Checksums:      tps = 50571.324795 (1.120x slowdown)
SIMD Checksums:     tps = 56608.888985 (1.000x slowdown)

So to conclude, the 3 approaches:

CRC:
Time to checksum 8192 bytes:   12'000 - 16'000 cycles best case without special hardware    1'200 cycles with hardware
(newIntel only) 
Code size: 131 bytes
* Can calculate arbitrary number of bytes per invocation, state is 4
bytes. Implementation can be shared with WAL.
* Quite slow without hardware acceleration.
* Software implementation requires a 8kB table for calculation or it
will be even slower. Quite likely to fall out of cache.
* If we wish to use hardware acceleration then the polynomial should
be switched to Castagnoli. I think the old polynomial needs to stay as
the values seem to be stored in indexes by tsvector compression and
multibyte trigrams. (not 100% sure, just skimmed the code)
* Error detection of 32bit Castagnoli CRC is known to be good, the
effect of truncating to 16 bits is not analyzed yet.

Fletcher:
Time to checksum 8192 bytes:    2'600 cycles +- 100
Code size: 170 bytes unrolled
* Very simple implementation for optimal speed.
* Needs to calculate 4 bytes at a time, requires 8 bytes of state.
Implementation that can work for WAL would be tricky but not
impossible. Probably wouldn't share code.
* Should give good enough error detection with suitable choice for
final recombination.

SIMD Checksums:
Time to checksum 8192 bytes:      730 cycles for processors with 128bit SIMD units     1830 cycles for processors with
64bitSIMD units 
Code size: 436 bytes
* Requires vectorization, intrinsics or ASM for decent performance.
* Needs to calculate 128bytes at a time, requires 128 bytes of state.
Using for anything other than summing fixed size blocks looks tricky.
* Loosely based on Fowler-Noll-Vo and should have reasonably good
error detection capabilities.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Strange Windows problem, lock_timeout test request
Next
From: Merlin Moncure
Date:
Subject: Re: JSON Function Bike Shedding