Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers
From | Craig Ringer |
---|---|
Subject | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
Date | |
Msg-id | CAMsr+YFRh38B7XdE7BKheVDY1eBWtmb8e+NfwRO8kTou4H9WKw@mail.gmail.com Whole thread Raw |
In response to | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS (Bruce Momjian <bruce@momjian.us>) |
Responses |
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
List | pgsql-hackers |
wrOn 18 April 2018 at 19:46, Bruce Momjian <bruce@momjian.us> wrote: > So, if sync mode passes the write to NFS, and NFS pre-reserves write > space, and throws an error on reservation failure, that means that NFS > will not corrupt a cluster on out-of-space errors. Yeah. I need to verify in a concrete test case. The thing is that write() is allowed to be asynchronous anyway. Most file systems choose to implement eager reservation of space, but it's not mandated. AFAICS that's largely a historical accident to keep applications happy, because FSes used to *allocate* the space at write() time too, and when they moved to delayed allocations, apps tended to break too easily unless they at least reserved space. NFS would have to do a round-trip on write() to reserve space. The Linux man pages (http://man7.org/linux/man-pages/man2/write.2.html) say: " A successful return from write() does not make any guarantee that data has been committed to disk. On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data. In this case, some errors might be delayed until a future write(2), fsync(2), or even close(2). The only way to be sure is to call fsync(2) after you are done writing all your data. " ... and I'm inclined to believe it when it refuses to make guarantees. Especially lately. > So, what about thin provisioning? I can understand sharing _free_ space > among file systems Most thin provisioning is done at the block level, not file system level. So the FS is usually unaware it's on a thin-provisioned volume. Usually the whole kernel is unaware, because the thin provisioning is done on the SAN end or by a hypervisor. But the same sort of thing may be done via LVM - see lvmthin. For example, you may make 100 different 1TB ext4 FSes, each on 1TB iSCSI volumes backed by SAN with a total of 50TB of concrete physical capacity. The SAN is doing block mapping and only allocating storage chunks to a given volume when the FS has written blocks to every previous free block in the previous storage chunk. It may also do things like block de-duplication, compression of storage chunks that aren't written to for a while, etc. The idea is that when the SAN's actual physically allocate storage gets to 40TB it starts telling you to go buy another rack of storage so you don't run out. You don't have to resize volumes, resize file systems, etc. All the storage space admin is centralized on the SAN and storage team, and your sysadmins, DBAs and app devs are none the wiser. You buy storage when you need it, not when the DBA demands they need a 200% free space margin just in case. Whether or not you agree with this philosophy or think it's sensible is kind of moot, because it's an extremely widespread model, and servers you work on may well be backed by thin provisioned storage _even if you don't know it_. Think of it as a bit like VM overcommit, for storage. You can malloc() as much memory as you like and everything's fine until you try to actually use it. Then you go to dirty a page, no free pages are available, and *boom*. The thing is, the SAN (or LVM) doesn't have any idea about the FS's internal in-memory free space counter and its space reservations. Nor does it understand any FS metadata. All it cares about is "has this LBA ever been written to by the FS?". If so, it must make sure backing storage for it exists. If not, it won't bother. Most FSes only touch the blocks on dirty writeback, or sometimes lazily as part of delayed allocation. So if your SAN is running out of space and there's 100MB free, each of your 100 FSes may have decremented its freelist by 2MB and be happily promising more space to apps on write() because, well, as far as they know they're only 50% full. When they all do dirty writeback and flush to storage, kaboom, there's nowhere to put some of the data. I don't know if posix_fallocate is a sufficient safeguard either. You'd have to actually force writes to each page through to the backing storage to know for sure the space existed. Yes, the docs say " After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space. " ... but they're speaking from the filesystem's perspective. If the FS doesn't dirty and flush the actual blocks, a thin provisioned storage system won't know. It's reasonable enough to throw up our hands in this case and say "your setup is crazy, you're breaking the rules, don't do that". The truth is they AREN'T breaking the rules, but we can disclaim support for such configurations anyway. After all, we tell people not to use Linux's VM overcommit too. How's that working for you? I see it enabled on the great majority of systems I work with, and some people are very reluctant to turn it off because they don't want to have to add swap. If someone has a 50TB SAN and wants to allow for unpredictable space use expansion between various volumes, and we say "you can't do that, go buy a 100TB SAN instead" ... that's not going to go down too well either. Often we can actually say "make sure the 5TB volume PostgreSQL is using is eagerly provisioned, and expand it at need using online resize if required. We don't care about the rest of the SAN.". I guarantee you that when you create a 100GB EBS volume on AWS EC2, you don't get 100GB of storage preallocated. AWS are probably pretty good about not running out of backing store, though. There _are_ file systems optimised for thin provisioning, etc, too. But that's more commonly done by having them do things like zero deallocated space so the thin provisioning system knows it can return it to the free pool, and now things like DISCARD provide much of that signalling in a standard way. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: