Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date
Msg-id CAMsr+YFRh38B7XdE7BKheVDY1eBWtmb8e+NfwRO8kTou4H9WKw@mail.gmail.com
Whole thread Raw
In response to Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Bruce Momjian <bruce@momjian.us>)
Responses Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
List pgsql-hackers
wrOn 18 April 2018 at 19:46, Bruce Momjian <bruce@momjian.us> wrote:

> So, if sync mode passes the write to NFS, and NFS pre-reserves write
> space, and throws an error on reservation failure, that means that NFS
> will not corrupt a cluster on out-of-space errors.

Yeah. I need to verify in a concrete test case.

The thing is that write() is allowed to be asynchronous anyway. Most
file systems choose to implement eager reservation of space, but it's
not mandated. AFAICS that's largely a historical accident to keep
applications happy, because FSes used to *allocate* the space at
write() time too, and when they moved to delayed allocations, apps
tended to break too easily unless they at least reserved space. NFS
would have to do a round-trip on write() to reserve space.

The Linux man pages (http://man7.org/linux/man-pages/man2/write.2.html) say:

"
       A successful return from write() does not make any guarantee that
       data has been committed to disk.  On some filesystems, including NFS,
       it does not even guarantee that space has successfully been reserved
       for the data.  In this case, some errors might be delayed until a
       future write(2), fsync(2), or even close(2).  The only way to be sure
       is to call fsync(2) after you are done writing all your data.
"

... and I'm inclined to believe it when it refuses to make guarantees.
Especially lately.

> So, what about thin provisioning?  I can understand sharing _free_ space
> among file systems

Most thin provisioning is done at the block level, not file system
level. So the FS is usually unaware it's on a thin-provisioned volume.
Usually the whole kernel is unaware, because the thin provisioning is
done on the SAN end or by a hypervisor. But the same sort of thing may
be done via LVM - see lvmthin. For example, you may make 100 different
1TB ext4 FSes, each on 1TB iSCSI volumes backed by SAN with a total of
50TB of concrete physical capacity. The SAN is doing block mapping and
only allocating storage chunks to a given volume when the FS has
written blocks to every previous free block in the previous storage
chunk. It may also do things like block de-duplication, compression of
storage chunks that aren't written to for a while, etc.

The idea is that when the SAN's actual physically allocate storage
gets to 40TB it starts telling you to go buy another rack of storage
so you don't run out. You don't have to resize volumes, resize file
systems, etc. All the storage space admin is centralized on the SAN
and storage team, and your sysadmins, DBAs and app devs are none the
wiser. You buy storage when you need it, not when the DBA demands they
need a 200% free space margin just in case. Whether or not you agree
with this philosophy or think it's sensible is kind of moot, because
it's an extremely widespread model, and servers you work on may well
be backed by thin provisioned storage _even if you don't know it_.

Think of it as a bit like VM overcommit, for storage. You can malloc()
as much memory as you like and everything's fine until you try to
actually use it. Then you go to dirty a page, no free pages are
available, and *boom*.

The thing is, the SAN (or LVM) doesn't have any idea about the FS's
internal in-memory free space counter and its space reservations. Nor
does it understand any FS metadata. All it cares about is "has this
LBA ever been written to by the FS?". If so, it must make sure backing
storage for it exists. If not, it won't bother.

Most FSes only touch the blocks on dirty writeback, or sometimes
lazily as part of delayed allocation. So if your SAN is running out of
space and there's 100MB free, each of your 100 FSes may have
decremented its freelist by 2MB and be happily promising more space to
apps on write() because, well, as far as they know they're only 50%
full. When they all do dirty writeback and flush to storage, kaboom,
there's nowhere to put some of the data.

I don't know if posix_fallocate is a sufficient safeguard either.
You'd have to actually force writes to each page through to the
backing storage to know for sure the space existed. Yes, the docs say

"
       After a
       successful call to posix_fallocate(), subsequent writes to bytes in
       the specified range are guaranteed not to fail because of lack of
       disk space.
"

... but they're speaking from the filesystem's perspective. If the FS
doesn't dirty and flush the actual blocks, a thin provisioned storage
system won't know.

It's reasonable enough to throw up our hands in this case and say
"your setup is crazy, you're breaking the rules, don't do that". The
truth is they AREN'T breaking the rules, but we can disclaim support
for such configurations anyway.

After all, we tell people not to use Linux's VM overcommit too. How's
that working for you? I see it enabled on the great majority of
systems I work with, and some people are very reluctant to turn it off
because they don't want to have to add swap.

If someone has a 50TB SAN and wants to allow for unpredictable space
use expansion between various volumes, and we say "you can't do that,
go buy a 100TB SAN instead" ... that's not going to go down too well
either. Often we can actually say "make sure the 5TB volume PostgreSQL
is using is eagerly provisioned, and expand it at need using online
resize if required. We don't care about the rest of the SAN.".

I guarantee you that when you create a 100GB EBS volume on AWS EC2,
you don't get 100GB of storage preallocated. AWS are probably pretty
good about not running out of backing store, though.


There _are_ file systems optimised for thin provisioning, etc, too.
But that's more commonly done by having them do things like zero
deallocated space so the thin provisioning system knows it can return
it to the free pool, and now things like DISCARD provide much of that
signalling in a standard way.



-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Next
From: Etsuro Fujita
Date:
Subject: Re: Oddity in tuple routing for foreign partitions