Thread: mmap for zeroing WAL log

mmap for zeroing WAL log

From
Tom Lane
Date:
[ redirected to pgsql-hackers instead of -patches ]

Matthew Kirkwood <matthew@hairy.beasts.org> writes:
> On Sat, 24 Feb 2001, Bruce Momjian wrote:
>> I am confused why mmap() is better than writing to a real file.

> It isn't, except that it allows to initialise the logfile in
> one syscall, without first allocating and zeroing (and hence
> dirtying) 16Mb of memory.

Uh, the existing code does not zero 16Mb of memory... it zeroes
8K and then writes that block repeatedly.  It's possible that the
overhead of a syscall for each 8K block is significant, but on the
other hand writing a block at a time is a heavily used and heavily
optimized path in all Unixen.  It's at least as plausible that the
mmap-as-source-of-zeroes path will be slower!

I think this is worth looking into, but I'm very far from being
sold on it...
        regards, tom lane


Re: mmap for zeroing WAL log

From
Tom Lane
Date:
Matthew Kirkwood <matthew@hairy.beasts.org> writes:
> I had assumed that the overhead would come from synchronous
> metadata incurring writes of at least the inode, block bitmap
> and probably an indirect block for each syscall.

No Unix that I've ever heard of forces metadata to disk after each
"write" call; anyone who tried it would have abysmal performance.
That's what fsync and the syncer daemon are for.
        regards, tom lane


Re: mmap for zeroing WAL log

From
Matthew Kirkwood
Date:
On Sat, 24 Feb 2001, Tom Lane wrote:

> >> I am confused why mmap() is better than writing to a real file.
> 
> > It isn't, except that it allows to initialise the logfile in
> > one syscall, without first allocating and zeroing (and hence
> > dirtying) 16Mb of memory.
> 
> Uh, the existing code does not zero 16Mb of memory... it zeroes
> 8K and then writes that block repeatedly.

See the "one syscall" bit above.

> It's possible that the overhead of a syscall for each 8K block is
> significant,

I had assumed that the overhead would come from synchronous
metadata incurring writes of at least the inode, block bitmap
and probably an indirect block for each syscall.

> but on the other hand writing a block at a time is a heavily used and
> heavily optimized path in all Unixen.  It's at least as plausible that
> the mmap-as-source-of-zeroes path will be slower!

Results:

On Linux/ext2, it appears good for a gain of 3-5% for log
creations (via a fairly minimal test program).

On FreeBSD 4.1-RELEASE/ffs (with all of sync/async/softupdates)
it is a couple of percent worse in elapsed time, but consumes
around a third more system CPU time (12sec vs 9sec on one test
system).

I am awaiting numbers from reiserfs but, for now, it looks like
I am far from vindicated.

Matthew.



Re: mmap for zeroing WAL log

From
Matthew Kirkwood
Date:
On Tue, 27 Feb 2001, Tom Lane wrote:

> Matthew Kirkwood <matthew@hairy.beasts.org> writes:
> > I had assumed that the overhead would come from synchronous
> > metadata incurring writes of at least the inode, block bitmap
> > and probably an indirect block for each syscall.
>
> No Unix that I've ever heard of forces metadata to disk after each
> "write" call; anyone who tried it would have abysmal performance.
> That's what fsync and the syncer daemon are for.

My understanding was that that's exactly what ffs' synchronous
metadata writes do.

Am I missing something here?  Do they jsut schedule I/O, but
return without waiting for its completion?

Matthew.