Thread: Multithreaded SIGPIPE race in libpq on Solaris

Multithreaded SIGPIPE race in libpq on Solaris

From
Thomas Munro
Date:
Hello,

A while back someone showed me a program blocking in libpq 9.2 on
Solaris 11, inside sigwait called by pq_reset_sigpipe.  It had
happened a couple of times before during a period of
instability/crashing with a particular DB (a commercial PostgreSQL
derivative, but the client was using regular libpq).  This was a very
busy multithreaded client where each thread had its own connection.
My theory is that if two connections accessed by different threads get
shut down around the same time, there is a race scenario where each of
them fails to write to its socket, sees errno == EPIPE and then sees a
pending SIGPIPE with sigpending(), but only one thread returns from
sigwait() due to signal merging.

We never saw the problem again after we made the following change:

--- a/src/interfaces/libpq/fe-secure.c
+++ b/src/interfaces/libpq/fe-secure.c
@@ -450,7 +450,6 @@ voidpq_reset_sigpipe(sigset_t *osigset, bool sigpipe_pending, bool got_epipe){       int
        save_errno = SOCK_ERRNO;
 
-       int                     signo;       sigset_t        sigset;
       /* Clear SIGPIPE only if none was pending */
@@ -460,11 +459,13 @@ pq_reset_sigpipe(sigset_t *osigset, bool
sigpipe_pending, bool got_epipe)                       sigismember(&sigset, SIGPIPE))               {
   sigset_t        sigpipe_sigset;
 
+                       siginfo_t       siginfo;
+                       struct timespec timeout = { 0, 0 };
                       sigemptyset(&sigpipe_sigset);                       sigaddset(&sigpipe_sigset, SIGPIPE);

-                       sigwait(&sigpipe_sigset, &signo);
+                       sigtimedwait(&sigpipe_sigset, &siginfo, &timeout);               }       }

Does this make any sense?

Best regards,
Thomas Munro



Re: Multithreaded SIGPIPE race in libpq on Solaris

From
Tom Lane
Date:
Thomas Munro <munro@ip9.org> writes:
> My theory is that if two connections accessed by different threads get
> shut down around the same time, there is a race scenario where each of
> them fails to write to its socket, sees errno == EPIPE and then sees a
> pending SIGPIPE with sigpending(), but only one thread returns from
> sigwait() due to signal merging.

Hm, that does sound like it could be a problem, if the platform fails
to track pending SIGPIPE on a per-thread basis.

> We never saw the problem again after we made the following change:
> ...
> Does this make any sense?

I don't think that patch would fix the problem if it's real.  It would
prevent libpq from hanging up when it's trying to throw away a pending
SIGPIPE, but the fundamental issue is that that action could cause a
SIGPIPE that's meant for some other thread to get lost; and that other
thread isn't necessarily doing anything with libpq.

I don't claim to be an expert on this stuff, but I had the idea that
multithreaded environments were supposed to track signal state per-thread
not just per-process, precisely because of issues like this.
        regards, tom lane



Re: Multithreaded SIGPIPE race in libpq on Solaris

From
Thomas Munro
Date:
On 28 August 2014 23:45, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I don't claim to be an expert on this stuff, but I had the idea that
> multithreaded environments were supposed to track signal state per-thread
> not just per-process, precisely because of issues like this.

After some googling, I found reply #3 in
https://community.oracle.com/thread/1950900?start=0&tstart=0 and
various other sources which say that in Solaris versions 10 they
changed SIGPIPE delivery from per process (as specified by UNIX98) to
per thread (as specified by POSIX:2001).  But we are on version 11, so
my theory doesn't look great.  (Though 9 is probably still in use out
there somewhere...)

I also found this article:
http://krokisplace.blogspot.co.uk/2010/02/suppressing-sigpipe-in-library.html

The author recommends an approach nearly identical to the PostgreSQL
approach, except s/he says:  "to do this we use sigtimedwait() with
zero timeout; this is to avoid blocking in a scenario where malicious
user sent SIGPIPE manually to a whole process: in this case we will
see it pending, but other thread may handle it before we had a
[chance] to wait for it".

Maybe we have malicious users sending signals to processes.  It does
seem more likely the crashing database triggered this somehow though,
perhaps in combination with something else the client app was doing,
though I can't think what it could be that would eat another thread's
SIGPIPE in between the sigpending and sigwait syscalls.

Best regards,
Thomas Munro



Re: Multithreaded SIGPIPE race in libpq on Solaris

From
Thomas Munro
Date:
On 29 August 2014 01:04, Thomas Munro <munro@ip9.org> wrote:
> On 28 August 2014 23:45, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't claim to be an expert on this stuff, but I had the idea that
>> multithreaded environments were supposed to track signal state per-thread
>> not just per-process, precisely because of issues like this.
>
> After some googling, I found reply #3 in
> https://community.oracle.com/thread/1950900?start=0&tstart=0 and
> various other sources which say that in Solaris versions 10 they
> changed SIGPIPE delivery from per process (as specified by UNIX98) to
> per thread (as specified by POSIX:2001).  But we are on version 11, so
> my theory doesn't look great.  (Though 9 is probably still in use out
> there somewhere...)

I discovered that the machine we saw the problem on was running
Solaris 9 at the time, but has been upgraded since.  So I think my
sigwait race theory may have been correct, but we can just put this
down to a historical quirk and forget about it, because Solaris 9 is
basically deceased ("extended support" ends in October 2014).  Sorry
for the noise.

Best regards,
Thomas Munro