Re: AW: WAL does not recover gracefully from out-of-disk-sp ace - Mailing list pgsql-hackers

From Tom Lane
Subject Re: AW: WAL does not recover gracefully from out-of-disk-sp ace
Date
Msg-id 6717.984150433@sss.pgh.pa.us
Whole thread Raw
In response to AW: WAL does not recover gracefully from out-of-disk-sp ace  (Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>)
List pgsql-hackers
Zeugswetter Andreas SB  <ZeugswetterA@wien.spardat.at> writes:
> Even with true fdatasync it's not obviously good for performance - it takes
> too long time to write 16Mb files and fills OS buffer cache with trash-:(
>> 
>> True.  But at least the write is (hopefully) being done at a
>> non-performance-critical time.

> So you have non critical time every five minutes ?
> Those platforms that don't have fdatasync won't profit anyway.

Yes they will; you're forgetting the cost of updating filesystem overhead.

Suppose that we do not preallocate the log files.  Each WAL fsync will
require a write of the added data block(s), plus a write of at least one
indirect block to record the allocation of new blocks to the file, plus
a write of the file's inode, plus a write of the cylinder group's
free-space bitmap.  It takes extremely lucky placement of the file and
indirect blocks to achieve less than four seeks per WAL block written.
Total cost to write a 16MB file: roughly eight thousand seeks, assuming
8K block size.  Even if we consider it safe to use fdatasync in this
scenario, it will save only one of the four seeks, since the indirect
block and freespace map *must* be updated regardless.

Now consider the preallocation approach.  In the preallocation phase,
we write like mad and then fsync the file ONCE.  This means *one* write
of each affected data block, indirect block, freespace map block, and
the inode, versus one write of each data block and circa two thousand
writes of the others.  Furthermore the kernel is free to schedule these
writes in some reasonable fashion, and so we may hope that something
less than two thousand seeks will be used to do it.

Then we come to the phase of actually writing the file.  No indirect
block or freespace bitmap updates will occur.  On a machine that
implements fdatasync, we write data blocks and nothing else.  One
seek per block written, possibly no seeks if the layout is good.
Even if we don't have fdatasync, it's only two seeks per block written
(the block and the inode only).  So, at worst four thousand seeks in
this phase, at best much less than two thousand.

Bottom line is that it should take fewer seeks overall to do it this
way, even on a machine without fdatasync, and even if we don't get to
count any benefit from doing a large part of the work outside the
critical path of transaction commit.

Also, given that modern systems *do* have fdatasync, I do not see why
we should not optimize for that case.

It is true that prezeroing the file will tend to fill the kernel's disk
cache with entirely useless blocks.  I don't know of any portable way
around that, but even an unportable way might be worth #ifdefing in on
platforms where it works.  Does anyone know a way of suppressing caching
of outgoing blocks, or flushing them from the kernel's cache right away?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Martin Devera
Date:
Subject: RE: WAL & SHM principles
Next
From: Peter Eisentraut
Date:
Subject: Re: Internationalized error messages