Re: Enabling Checksums - Mailing list pgsql-hackers

From Ants Aasma
Subject Re: Enabling Checksums
Date
Msg-id CA+CSw_vGGpbomdXOnXdkPaAbxmF4RXJF4Mo3L-AkFi9ZHSUqhQ@mail.gmail.com
Whole thread Raw
In response to Re: Enabling Checksums  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Enabling Checksums  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
On Fri, Mar 22, 2013 at 7:35 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Fri, 2013-03-22 at 17:09 +0200, Ants Aasma wrote:
>> For performance the K8 results gave me confidence that we have a
>> reasonably good overview what the performance is like for the class of
>> CPU's that PostgreSQL is likely to run on. I don't think there is
>> anything left to optimize there, all algorithms are pretty close to
>> maximum theoretical performance.
>
> Great work!

Thanks.

>> The worst case workload is set up using
>> CREATE TABLE sparse (id serial primary key, v text) WITH (fillfactor=10);
>> INSERT INTO sparse (v) SELECT REPEAT('x', 1000) FROM generate_series(1,100000);
>> VACUUM ANALYZE sparse;
>>
>> The test query itself is a simple SELECT count(v) FROM sparse;
>>
>> Results for the worst case workload:
>> No checksums:       tps = 14.710519
>> Fletcher checksums: tps = 10.825564 (1.359x slowdown)
>> CRC checksums:      tps =  5.844995 (2.517x slowdown)
>> SIMD checksums:     tps = 14.062388 (1.046x slowdown)
>
> I assume this is in the "bad region" identified by Greg, where there is
> no disk activity, but shared_buffers is small, leading to a lot of
> movement between the OS cache and shared buffers?
>
> What do you mean by TPS exactly? If the select query is writing hint
> bits, then you wouldn't be able to repeat it because they are already
> set. So are you repeating the creation/loading of the table, as well?

The table is created once, size is 800MB with one hinted tuple per
page. Shared buffers is set to 32MB, machine is Intel Core i5-2500K
with 16GB of memory (2 memory channels, 1333MHz, overheads are likely
to be larger with faster memory). This is the worst case workload for
in-memory workload that doesn't fit into shared_buffers as almost no
work other than swapping buffer pages in is done. I think things like
bitmap heap scans might show similar characteristics.

>> Results for pgbench scale 100:
>> No checksums:       tps = 56623.819783
>> Fletcher checksums: tps = 55282.222687 (1.024x slowdown)
>> CRC Checksums:      tps = 50571.324795 (1.120x slowdown)
>> SIMD Checksums:     tps = 56608.888985 (1.000x slowdown)
>>
>> So to conclude, the 3 approaches:
>
> Great analysis. Still a tough choice.
>
> One thing that might be interesting is to look at doing SIMD for both
> data and WAL. I wonder if that would be a noticeable speedup for WAL
> full-page writes? That would give greater justification for the extra
> work it will take (intrinsics/ASM), and it would be a nice win for
> non-checksum users.

Andres showed that switching out the existing CRC for zlib's would
result in 8-30% increase in INSERT-SELECT speed
(http://www.postgresql.org/message-id/201005202227.49990.andres@anarazel.de)
with the speeded up CRC still showing up as 10% of the profile. So I
guess another 5% speedup by doing the CRC 8 bytes at a time instead of
the used 4. And another couple % by using Fletcher or SIMD.

> I also notice that http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%
> 80%93Vo_hash_function explicitly mentions adapting FNV to a smaller
> size. That gives me a little more confidence. Do you have other links we
> should read about this approach, or possible weaknesses?

It mentions that one should use 32bit FNV and fold it down to 16bit
via xor. This doesn't work here because SSE2 doesn't have pmulld
(SSE4.1). I have taken some liberties here by actually doing 64 16bit
FNV like operations in parallel and then doing an FNV like combination
of them at the end. However the choices there are concerned with good
hashing performance, while for checksums it should matter much even if
the average error detection rate goes from 99.998% to 99.99% as long
as common error scenarios don't match up with the collisions. If
decide to go this route we should definitely research what the
effectiveness consequences here are and what are good choices for the
prime values used. On the face of it multiply by prime and add/xor
looks like it provides pretty good mixing, resists transposed
sequences, zeroing out blocks. The worst case seems to be bit errors.
As far as I can see, this implementation should detect all single bit
errors, but if one of the bit errors is on MSB, a second single error
in MSB will cancel it out. I haven't done the math but it should still
work out as better than 99% chance to detect random 2 bit errors.

On Fri, Mar 22, 2013 at 8:00 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Fri, 2013-03-22 at 17:09 +0200, Ants Aasma wrote:
>> So to conclude, the 3 approaches:
>
> One other question: assuming that the algorithms use the full 16-bit
> space, is there a good way to avoid zero without skewing the result? Can
> we do something like un-finalize (after we figure out that it's zero),
> compute in an extra salt value, and then re-finalize? That might work
> for Fletcher; but I don't think that works for CRC or Fowler-Noll-Vo
> because the final value is the same as the state.

Taking the Fletcher or CRC32 result modulo 65521 (largest prime <
16bits) only gives a very slight skew that shouldn't really matter for
all practical purposes. For the SIMD FNV implementation we can just
reduce the 64 16bit values down to 4, concat them together to a single
64bit number (by just skipping the last two reduction steps) and take
a modulo from that.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



pgsql-hackers by date:

Previous
From: Atri Sharma
Date:
Subject: Re: Page replacement algorithm in buffer cache
Next
From: Kevin Grittner
Date:
Subject: Re: Materialized view assertion failure in HEAD