Re: fallocate / posix_fallocate for new WAL file creation (etc...) - Mailing list pgsql-hackers

From Jon Nelson
Subject Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date
Msg-id CAKuK5J2juNZRwsiR0+S4iNQLCmCWFErNzkD7SJeCPPJY+_zefA@mail.gmail.com
Whole thread Raw
In response to Re: fallocate / posix_fallocate for new WAL file creation (etc...)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: fallocate / posix_fallocate for new WAL file creation (etc...)  (Andres Freund <andres@2ndquadrant.com>)
Re: fallocate / posix_fallocate for new WAL file creation (etc...)  (Greg Smith <greg@2ndQuadrant.com>)
List pgsql-hackers
On Tue, May 28, 2013 at 9:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, May 28, 2013 at 10:15 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> On 2013-05-28 10:03:58 -0400, Robert Haas wrote:
>>> On Sat, May 25, 2013 at 2:55 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
>>> >> The biggest thing missing from this submission is information about what
>>> >> performance testing you did.  Ideally performance patches are submitted with
>>> >> enough information for a reviewer to duplicate the same test the author did,
>>> >> as well as hard before/after performance numbers from your test system.  It
>>> >> often turns tricky to duplicate a performance gain, and being able to run
>>> >> the same test used for initial development eliminates a lot of the problems.
>>> >
>>> > This has been a bit of a struggle. While it's true that WAL file
>>> > creation doesn't happen with great frequency, and while it's also true
>>> > that - with strace and other tests - it can be proven that
>>> > fallocate(16MB) is much quicker than writing it zeroes by hand,
>>> > proving that in the larger context of a running install has been
>>> > challenging.
>>>
>>> It's nice to be able to test things in the context of a running
>>> install, but sometimes a microbenchmark is just as good.  I mean, if
>>> posix_fallocate() is faster, then it's just faster, right?
>>
>> Well, it's a bit more complex than that. Fallocate doesn't actually
>> initializes the disk space in most filesystems, just marks it as
>> allocated and zeroed which is one of the reasons it can be noticeably
>> faster. But that can make the runtime overhead of writing to those pages
>> higher.
>
> Maybe it would be good to measure that impact.  Something like this:
>
> 1. Write 16MB of zeroes to an empty file in the same size chunks we're
> currently using (8kB?).  Time that.  Rewrite the file with real data.
> Time that.
> 2. posix_fallocate() an empty file out to 16MB.  Time that.  Rewrite
> the fie with real data.  Time that.
>
> Personally, I have trouble believing that writing 16MB of zeroes by
> hand is "better" than telling the OS to do it for us.  If that's so,
> the OS is just stupid, because it ought to be able to optimize the
> crap out of that compared to anything we can do.  Of course, it is
> more than possible that the OS is in fact stupid.  But I'd like to
> hope not.

I wrote a little C program to do something very similar to that (which
I'll hope to post later today).
It opens a new file, fallocates 16MB, calls fdatasync.  Then it loops
10 times:  seek to the start of the file, writes 16MB of ones, calls
fdatasync.
Then it closes and removes the file, re-opens it, and this time writes
out 16MB of zeroes, calls fdatasync, and then does the same loop as
above. The program times the process from file open to file unlink,
inclusive.

The results - for me - are pretty consistent: using fallocate is
12-13% quicker than writing out zeroes. I used fdatasync twice to
(attempt) to mimic what the WAL writer does.

--
Jon



pgsql-hackers by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: Planning incompatibilities for Postgres 10.0
Next
From: Merlin Moncure
Date:
Subject: Re: PostgreSQL Process memory architecture