Re: WAL CPU overhead/optimization (was Master-slave visibility order) - Mailing list pgsql-hackers

From Ants Aasma
Subject Re: WAL CPU overhead/optimization (was Master-slave visibility order)
Date
Msg-id CA+CSw_tz5-ErTgj6SWghiTVEOx9r63=bVnTB9WEgEjqqwb68nQ@mail.gmail.com
Whole thread Raw
In response to WAL CPU overhead/optimization (was Master-slave visibility order)  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: WAL CPU overhead/optimization (was Master-slave visibility order)  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
On Fri, Aug 30, 2013 at 1:30 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-08-30 01:10:40 +0300, Ants Aasma wrote:
>> On Fri, Aug 30, 2013 at 12:33 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > FWIW, WAL is still the major bottleneck for INSERT heavy workloads. The
>> > per CPU overhead actually minimally increased (at least in my tests), it
>> > just scales noticeably better than before.
>>
>> Interesting. Do you have any insight what is behind the CPU overhead?
>> Maybe the solution is to make WAL insertion cheap enough to not
>> matter. That won't be easy, but neither are the alternatives.
>
> Funnily by far the biggest thing I have seen in benchmarks is the CRC32
> computation. I plan to brush up my ~3 year old CRC32 reimplementation
> patch sometime, but afair you had a much better one?
>
> I have some doubts about weakening the hash function by also using FNV
> or similar here, so I'd first like to try how much of a difference a
> better CRC32 implementation can make with the current XLogInsert()
> implementation.

The CRC32 implementations mostly differ by the amount of lookups that
are done in parallel. Postgresql does 1 lookup, IIRC zlib
implementation does 4, Intel has a paper that recommends going up to
8. The tradeoff is that each level requires a 4KB lookup table - for
small records the additional cache misses will probably kill any
speedup.

A quick overview of the hot cache large buffer performance of a few
interesting options:
crc32 slice-by-1: 0.148 bytes/cycle
crc32 slice-by-4: 0.392 bytes/cycle
crc32 slice-by-8: 0.654 bytes/cycle
crc32c instruction pipelined by 3: 6.8 bytes/cycle (number from Intels paper)
FNV 1 byte at a time version: 0.333 bytes/cycle
md5: 0.159 bytes/cycle
Murmur3A: 1.019 bytes/cycle
CityHash64: 4.246 bytes/cycle

CityHash64 actually looks pretty good, there no known hash quality
issues. Compared to CRC, the only weakening is that single bit errors
are not guaranteed to be 100% detected. There's also the issue that
only a 64bit implementation exists, but I'm sure this can be resolved
(if necessary, we can just use Murmur3 on 32bit).

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



pgsql-hackers by date:

Previous
From: Hannu Krosing
Date:
Subject: Re: PL/pgSQL PERFORM with CTE
Next
From: Andres Freund
Date:
Subject: Re: WAL CPU overhead/optimization (was Master-slave visibility order)