Re: WAL CPU overhead/optimization (was Master-slave visibility order) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: WAL CPU overhead/optimization (was Master-slave visibility order)
Date
Msg-id 20130830000243.GH4283@awork2.anarazel.de
Whole thread Raw
In response to Re: WAL CPU overhead/optimization (was Master-slave visibility order)  (Ants Aasma <ants@cybertec.at>)
Responses Re: WAL CPU overhead/optimization (was Master-slave visibility order)
List pgsql-hackers
On 2013-08-30 02:53:54 +0300, Ants Aasma wrote:
> On Fri, Aug 30, 2013 at 1:30 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-08-30 01:10:40 +0300, Ants Aasma wrote:
> >> On Fri, Aug 30, 2013 at 12:33 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > FWIW, WAL is still the major bottleneck for INSERT heavy workloads. The
> >> > per CPU overhead actually minimally increased (at least in my tests), it
> >> > just scales noticeably better than before.
> >>
> >> Interesting. Do you have any insight what is behind the CPU overhead?
> >> Maybe the solution is to make WAL insertion cheap enough to not
> >> matter. That won't be easy, but neither are the alternatives.
> >
> > Funnily by far the biggest thing I have seen in benchmarks is the CRC32
> > computation. I plan to brush up my ~3 year old CRC32 reimplementation
> > patch sometime, but afair you had a much better one?
> >
> > I have some doubts about weakening the hash function by also using FNV
> > or similar here, so I'd first like to try how much of a difference a
> > better CRC32 implementation can make with the current XLogInsert()
> > implementation.
> 
> The CRC32 implementations mostly differ by the amount of lookups that
> are done in parallel. Postgresql does 1 lookup, IIRC zlib
> implementation does 4, Intel has a paper that recommends going up to
> 8. The tradeoff is that each level requires a 4KB lookup table - for
> small records the additional cache misses will probably kill any
> speedup.
> 
> A quick overview of the hot cache large buffer performance of a few
> interesting options:
> [interesting data]

I am not sure "hot cache large buffer performance" is really the
interesting case. Most of the XLogInsert()s are pretty small in the
common workloads. I vaguely recall trying 8 and getting worse
performance on many workloads, but that might have been a problem of my
implementation.

The reason I'd like to go for a faster CRC32 implementation as a first
step is that it's easy. Easy to verify, easy to analyze, easy to
backout. I personally don't have enough interest/time in the 9.4 cycle
to purse conversion to a different algorithm (I find the idea of using
different ones on 32/64bit pretty bad), but I obviously won't stop
somebody else ;)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Ants Aasma
Date:
Subject: Re: WAL CPU overhead/optimization (was Master-slave visibility order)
Next
From: didier
Date:
Subject: Re: Properly initialize negative/empty cache entries in relfilenodemap