Re: Postgres, fsync, and OSs (specifically linux) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Postgres, fsync, and OSs (specifically linux) |
Date | |
Msg-id | CAEepm=04ZCG_8N3m61kXZP-7Ecr02HUNNG-QsAhwyFLim4su2g@mail.gmail.com Whole thread Raw |
In response to | Re: Postgres, fsync, and OSs (specifically linux) (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Postgres, fsync, and OSs (specifically linux)
|
List | pgsql-hackers |
On Wed, May 23, 2018 at 8:02 AM, Andres Freund <andres@anarazel.de> wrote: > [patches] Hi Andres, Obviously there is more work to be done here but the basic idea in your clone-fd-checkpointer branch as of today seems OK to me. I think Craig and I both had similar ideas (sending file descriptors that have an old enough f_wb_err) but we thought rlimit management was going to be hard (shared memory counters + deduplication, bleugh). You made it look easy. Nice work. First, let me describe in my own words what's going on, mostly to make sure I really understand this: 1. We adopt a new fd lifecycle that is carefully tailored to avoid error loss on Linux, and can't hurt on other OSes. By sending the file descriptor used for every write to the checkpointer, we guarantee that (1) the inode stays pinned (the file is "in flight" in the socket, even if the sender closes it before the checkpointer receives it) so Linux won't be tempted to throw away the precious information in i_mapping->wb_err, and (2) the checkpointer finishes up with a file descriptor that points to the very same "struct file" with the f_wb_err value that was originally sampled before the write, by the sender. So we can't miss any write-back errors. Wahey! However, errors are still reported only once, so we probably need to panic. Hmm... Is there any way that the *sender* could finish up in file_check_and_advance_wb_err() for the same struct file, before the checkpointer manages to call fsync() on its dup'd fd? I don't immediately see how (it looks like you have to call one of the various sync functions to reach that, and your patch removes the fallback just-call-FileSync()-myself code from register_dirty_segment()). I guess it would be bad if, say, close() were to do that in some future Linux release because then we'd have no race-free way to tell the checkpointer that the file is borked before it runs fsync() and potentially gets an OK response and reports a successful checkpoint (even if we panicked, with sufficiently bad timing it might manage to report a successful checkpoint). 2. In order to make sure that we don't exceed our soft limit on the number of file descriptors per process, you use a simple backend-local counter in the checkpointer, on the theory that we don't care about fds (or, technically, the files they point to) waiting in the socket buffer, we care only about how many the checkpointer has actually received but not yet closed. As long as we make sure there is space for at least one more before we read one message, everything should be fine. Good and simple. One reason I thought this would be harder is because I had no model of how RLIMIT_NOFILE would apply when you start flinging fds around between processes (considering there can be moments when neither end has the file open), so I feared the worst and thought we would need to maintain a counter in shared memory and have senders and receiver cooperate to respect it. My intuition that this was murky and required pessimism wasn't too far from the truth actually: apparently the kernel didn't do a great job at accounting for that until a patch[1] landed for CVE-2013-4312. The behaviour in older releases is super lax, so no problem there. The behaviour from 4.9 onward (or is it 4.4.1?) is that you get a separate per-user RLIMIT_NOFILE allowance for in-flight fds. So effectively the sender doesn't have to worry about about fds it has sent but closed and the receiver doesn't have to care about fds it hasn't received yet, so your counting scheme seems OK. As for exceeding RLIMIT_NOFILE with in-flight fds, it's at least bounded by the fact that the socket would block/EWOULDBLOCK if the receiver isn't draining it fast enough and can only hold a small and finite amount of data and thus file descriptors, so we can probably just ignore that. If you did manage to exceed it, I think you'd find out about that with ETOOMANYREFS at send time (curiously absent from the sendmsg() man page, but there in black and white in the source for unix_attach_fds()), and then you'd just raise an error (currently FATAL in your patch). I have no idea how the rlimit for SCM-ified files works on other Unixoid systems though. Some actual code feedback: + if (entry->syncfds[forknum][segno] == -1) + { + char *path = mdpath(entry->rnode, forknum, segno); + open_fsync_queue_files++; + /* caller must have reserved entry */ + entry->syncfds[forknum][segno] = + FileOpenForFd(fd, path); + pfree(path); + } + else + { + /* + * File is already open. Have to keep the older fd, errors + * might only be reported to it, thus close the one we just + * got. + * + * XXX: check for errrors. + */ + close(fd); + } Wait... it's not guaranteed to be older in open() time, is it? It's just older in sendmsg() time. Those might be different: A: open("base/42/1234") A: write() kernel: inode->i_mapping->wb_err++ B: open("base/42/1234") B: write() B: sendmsg() A: sendmsg() C: recvmsg() /* clone B's fd */ C: recvmsg() /* clone A's fd, throw it away because we already have B's */ C: fsync() I think that would eat an error under the 4.13 code. I think it might be OK under the new 4.17 code, because the new "must report it to at least one caller" thing would save the day. So now I'm wondering if that's getting backported to the 4.14 long term kernel. Aha, yes it has been already[2]. So... if you don't have to worry about kernels without that patch, I suspect the exact ordering doesn't matter anymore, as long as *someone* held the file open at all times beween write and fsync() to keep the inode around, which this patch achieves. (And of course no other process sets the ERRSEQ_SEEN flag, for example some other kind of backend or random other program that opened our data file and called one of the sync syscalls, or any future syscalls that start calling file_check_and_advance_wb_err()). + /* + * FIXME: do DuplicateHandle dance for windows - can that work + * trivially? + */ I don't know, I guess something like CreatePipe() and then write_duplicate_handle()? And some magic (maybe our own LWLock) to allow atomic messages? A more interesting question is: how will you cap the number file handles you send through that pipe? On that OS you call DuplicateHandle() to fling handles into another process's handle table directly. Then you send the handle number as plain old data to the other process via carrier pigeon, smoke signals, a pipe etc. That's interesting because the handle allocation is asynchronous from the point of view of the receiver. Unlike the Unix case where the receiver can count handles and make sure there is space for one more before it reads a potentially-SCM-containing message, here the *senders* will somehow need to make sure they don't create too many in the receiving process. I guess that would involve a shared counter, and a strategy for what to do when the number is too high (probably just wait). Hmm. I wonder if that would be a safer approach on all operating systems. + if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, fsync_fds) < 0) ... + size = sendmsg(sock, &msg, 0); Here you are relying on atomic sending and receiving of whole messages over a stream socket. For example, in Linux's unix_stream_sendmsg() it's going to be chopped into buffers of size (sk->sk_sndbuf >> 1) - 64 (I have no clue how big that is) that are appended one at a time to the receiving socket's queue, with no locking in between, so a concurrently send messages can be interleaved with yours. How big is big enough for that to happen? There doesn't seem to be an equivalent of PIPE_BUF for Unix domain sockets. These tiny messages are almost certainly safe, but I wonder if we should be using SOCK_SEQPACKET instead of SOCK_STREAM? Might be a less theoretical problem if we switch to variable sized messages containing file paths as you once contemplated in an off-list chat. Presumably for EXEC_BACKEND we'll need to open PGDATA/something/something/socket or similar. + if (returnCode < 0) + { + /* XXX: decide on policy */ + bms_add_member(entry->requests[forknum], segno); Obviously this is pseudocode (doesn't even keep the return value), but just BTW, I think that if we decide not to PANIC unconditionally on any kind of fsync() failure, we definitely can't use bms_add_member() here (it might fail to allocate, and then we forget the segment, raise and error and won't try again). It's got to be PANIC or no-fail code (like the patch I proposed in another thread). +SendFsyncRequest(CheckpointerRequest *request, int fd) ... + * Don't think short reads will ever happen in realistic ... + ereport(FATAL, (errmsg("could not receive fsync request: %m"))); Short *writes*, could not *send*. + * back, as that'd lead to loosing error reporting guarantees on s/loosing/losing/ [1] https://github.com/torvalds/linux/commit/712f4aad406bb1ed67f3f98d04c044191f0ff593 [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/lib/errseq.c?h=linux-4.14.y&id=0799a0ea96e4923f52f85fe315b62e9176a3319c -- Thomas Munro http://www.enterprisedb.com
pgsql-hackers by date: