Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Andres Freund
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id 20190727022740.5ccut3sfrqdd5ec6@alap3.anarazel.de
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: POC: Cleaning up orphaned files using undo logs  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
Hi,

On 2019-07-25 17:51:33 +1200, Thomas Munro wrote:
> 1.  WAL's use of fdatasync():  The reason we fill and then fsync()
> newly created WAL files up front is because we want to make sure the
> blocks are definitely on disk.  The comment doesn't spell out exactly
> why the author considered later fdatasync() calls to be insufficient,
> but they were: it was many years after commit 33cc5d8a4d0d that Linux
> ext3/4 filesystems began flushing file size changes to disk in
> fdatasync()[1][2].  I don't know if its original behaviour was
> intentional or not.  So, if you didn't use the bigger fsync() hammer
> on that OS, you might lose the end of a recently extended file in a
> power failure even though fdatasync() had returned success.
> 
> By my reading of POSIX, that shouldn't be necessary on a conforming
> implementation of fdatasync(), and that was fixed years ago in Linux.
> I'm not proposing any changes there, and I'm not proposing to take
> advantage of that in the new code.  I'm pointing out that that we
> don't have to worry about that for these undo segments, because they
> are already flushed with fsync(), not fdatasync().

> (To understand POSIX's descriptions of fsync() and fdatasync() you
> have to find the meanings of "Synchronized I/O Data Integrity
> Completion" and "Synchronized I/O File Integrity Completion" elsewhere
> in the spec.  TL;DR: fdatasync() is only allowed to skip flushing
> attributes like the modified time, it's not allowed to skip flushing a
> file size change since that would interfere with retrieving the data.)

Note that there's very good performance reasons trying to avoid metadata
changes at e.g. commit time. They're commonly journaled at the FS level,
which can add a good chunk of IO and synchronization to an operations
that we commonly want to be as fast as possible. Basically you often at
least double the amount of synchronous writes.

And for the potential future where use async direct IO - writes that
change the file size take considerably slower codepaths, and add a lot
of synchronization.

I suspect that's much more likely to be the reason for the preallocation
in 33cc5d8a4d0d, than avoiding an ext* bug (I doubt the bug you
reference existed back then, it IIUC didn't apply to ext2, and ext3 was
was introduced after 33cc5d8a4d0d).


> 2.  Time of reservation:  Although they don't call fsync(), regular
> relations and these new undo files still write zeroes up front
> (respectively, for a new block and for a new segment).  One reason for
> that is that most popular filesystems reserve space at write time, so
> you'll get ENOSPC when trying to allocate undo space, and that's a
> non-fatal ERROR.  If we deferred until writing back buffer contents,
> we might get file holes, and deferred ENOSPC is much harder to report
> to users and for users to deal with.

FWIW, the hole bit I don't quite buy - we could zero the hole at that
time (and not be worse than today, except that it might be done by
somebody that didn't cause the extension), or even better just look up
the buffers between the FS end of the relation, and the block currently
written, and write them out in order.

The whole thing with deferred ENOSPC being harder to report to users is
obviously true regardless of htat.



> BTW we could probably use posix_fallocate() instead of writing zeroes;
> I think Andres mentioned that recently.  I see also that someone tried
> that for WAL and it got reverted back in 2013 (commit
> b1892aaeaaf34d8d1637221fc1cbda82ac3fcd71, I didn't try to hunt down
> the discussion).

IIRC the problem from back then was that while the space is reserved on
the FS level, the actual blocks don't contain zeroes at that time. Which
means that

a) Small writes need to write more, because the surrounding data also
   needs to be zeroed (annoying but not terrible).

b) Writes into the fallocated but not written range IIRC effectively
   cause metadata writes, because while the "allocated file ending"
   doesn't change anymore, the new "non-zero written to" fileending does
   need to be journaled to disk before an f[data]sync - otherwise you
   could end up with the old value after a crash, and would read
   spurious zeroes.

   That's quite bad.

Those don't necessarily apply to e.g. extending relations as we
e.g. don't granularly fsync them. Although even there the performance
picture is mixed - it helps a lot in certain workloads, but there's
others were it mildly regresses performance on ext4. Not sure why yet,
possibly it's due to more heavyweight locking needed when later changing
the "non-zero size", or it's the additional metadata changes. I suspect
those would be mostly gone if we didn't write back blocks in random
order under memory pressure.

Note that neither of those mean that it's not a good idea to
posix_fallocate() and *then* write zeroes, when initializing. For
several filesystems that's more likely to result in more optimally sized
filesystem extents, reducing fragmentation. And without an intervening
f[data]sync, there's not much additional metadata journalling. Although
that's less of an issue on some newer filesystems, IIRC (due to delayed
allocation).

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Patch for SortSupport implementation on inet/cdir
Next
From: Thomas Munro
Date:
Subject: Re: SegFault on 9.6.14