Re: Postgres, fsync, and OSs (specifically linux) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Postgres, fsync, and OSs (specifically linux)
Date
Msg-id CAEepm=2WSPP03-20XHpxohSd2UyG_dvw5zWS1v7Eas8Rd=5e4A@mail.gmail.com
Whole thread Raw
In response to Re: Postgres, fsync, and OSs (specifically linux)  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: Postgres, fsync, and OSs (specifically linux)
List pgsql-hackers
On Sun, Jul 29, 2018 at 6:14 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> As a way of poking this thread, here are some more thoughts.

I am keen to move this forward, not only because it is something we
need to get fixed, but also because I have some other pending patches
in this area and I want this sorted out first.

Here are some small fix-up patches for Andres's patchset:

1.  Use FD_CLOEXEC instead of the non-portable Linuxism SOCK_CLOEXEC.

2.  Fix the self-deadlock hazard reported by Dmitry Dolgov.  Instead
of the checkpoint trying to send itself a CKPT_REQUEST_SYN message
through the socket (whose buffer may be full), I included the
ckpt_started counter in all messages.  When AbsorbAllFsyncRequests()
drains the socket, it stops at messages with the current ckpt_started
value.

3.  Handle postmaster death while waiting.

4.  I discovered that macOS would occasionally return EMSGSIZE for
sendmsg(), but treating that just like EAGAIN seems to work the next
time around.  I couldn't make that happen on FreeBSD (I mention that
because the implementation is somehow related).  So handle that weird
case on macOS only for now.

Testing on other Unixoid systems would be useful.  The case that
produced occasional EMSGSIZE on macOS was: shared_buffers=1MB,
max_files_per_process=32, installcheck-parallel.  Based on man pages
that seems to imply an error in the client code but I don't see it.

(I also tried to use SOCK_SEQPACKET instead of SOCK_STREAM, but it's
not supported on macOS.  I also tried to use SOCK_DGRAM, but that
produced occasional ENOBUFS errors and retrying didn't immediately
succeed leading to busy syscall churn.  This is all rather
unsatisfying, since SOCK_STREAM is not guaranteed by any standard to
be atomic, and we're writing messages from many backends into the
socket so we're assuming atomicity.  I don't have a better idea that
is portable.)

There are a couple of FIXMEs remaining, and I am aware of three more problems:

* Andres mentioned to me off-list that there may be a deadlock risk
where the checkpointer gets stuck waiting for an IO lock.  I'm going
to look into that.
* Windows.  Patch soon.
* The ordering problem that I mentioned earlier: the patchset wants to
keep the *oldest* fd, but it's really the oldest it has received.  An
idea Andres and I discussed is to use a shared atomic counter to
assign a number to all file descriptors just before their first write,
and send that along with it to the checkpointer.  Patch soon.

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: pgbench exit code
Next
From: Alexander Kuzmenkov
Date:
Subject: Re: Reopen logfile on SIGHUP