Thread: pg_listener entries deleted under heavy NOTIFY load only on Windows

pg_listener entries deleted under heavy NOTIFY load only on Windows

From

Radu Ilie

Date:

22 January 2010, 20:30:40

On a Windows server under heavy load of NOTIFY events, entries in pg_listener table for some events are deleted. It is
likeUNLISTEN was called.<br /><br /> PostgreSQL version: 8.3.9.<br /> Operating System: Windows XP.<br /><br />
PostgreSQLbelieves that if it fails to notify a listener (by signaling the respective backend), then the backend
doesn'texist anymore and so it should get rid of the pg_listener entry. The relevant code is in
src/backend/commands/async.c,function Send_Notify:<br /><br /> if (kill(listenerPID, SIGUSR2) < 0)<br /> {<br />
/*<br/> * Get rid of pg_listener entry if it refers to a PID that no<br /> * longer exists.  Presumably, that backend
crashedwithout<br /> * deleting its pg_listener entries. This code used to only<br /> * delete the entry if
errno==ESRCH,but as far as I can see<br /> * we should just do it for any failure (certainly at least<br /> * for EPERM
too...)<br/> */<br /> simple_heap_delete(lRel, &lTuple->t_self);<br /> }<br /><br /> The problem is that under
Windows,kill can fail even if the process is still alive. PostgreSQL uses named pipes under Windows to send signals to
backends.The present implementation has a bug that causes a client to fail to write data to the named pipe, even though
theserver process is alive. This is because the server doesn't maintain the named pipe at all times. The bug is present
infile src/backend/port/win32/signal.c, function pg_signal_thread.<br /><br /> The server code stays in a loop in which
itcontinuously creates an instance of the named pipe (via CreateNamedPipe) and waits for a client process to connect
(viaConnectNamedPipe). Once a client connects, the communication with the client is handled in a new thread, with the
threadprocedure pg_signal_dispatch_thread. This function is very simple: it reads one byte from the named pipe, it
writesit back and (very important) closes the handle to the named pipe instance. The main loop creates another instance
ofthe named pipe and waits for another client to connect.<br /><br /> Now imagine that the server is under heavy load.
Thereare dozens of backends and threads running and the CPU usage is close to 100%. The following succession of events
ispossible:<br /><br /> 1. Server signal handling thread (in function pg_signal_thread) creates the first, one and only
instanceof the named pipe via CreateNamedPipe.<br /> 2. Server code starts waiting for clients to connect with
ConnectNamedPipe.<br/> 3. Client wishes to make a transaction on the named pipe and calls CallNamedPipe (in file
src/port/kill.c,function pgkill).<br /> 4. Server code returns from ConnectNamedPipe. It creates a new thread with the
threadprocedure pg_signal_dispatch_thread.<br /> 5. The signal dispatch thread is scheduled for execution and it runs
tocompletion. As you can see, the last thing it does related to the named pipe is to close the handle via CloseHandle
(infunction pg_signal_dispatch_thread). This closes the last instance of the named pipe. The named pipe is gone. There
isno more named pipe. The signal handling thread was not yet scheduled by the operating system for execution and thus
didn'thave an opportunity to call CreateNamedPipe.<br /> 6. Another client (or the same one, it doesn't matter) tries
towrite to the named pipe via CallNamedPipe. The call returns ERROR_FILE_NOT_FOUND, because the named pipe is gone. The
clientbelieves the backend is gone and it removes the entry from pg_listener.<br /> 7. The signal handling thread (in
functionpg_signal_thread) is finally scheduled for execution and it calls CreateNamedPipe. We now have an instance of
thenamed pipe available.<br /><br /> So we end up with the server backend alive, the named pipe is there, but the row
isgone from pg_listener. This is easy to reproduce under Windows. I used the scripts posted by Steve Marshall in a
similarthread from 01/15/2009 and the problem appears within one minute all the time. For testing I used a Windows XP
machinewith 2 cores and 2GB of RAM. The CPU usage was over 70% during the trials.<br /><br /> The solution is to create
anew instance of the named pipe before launching the signal dispatch thread. This means changing the code in
src/backend/port/win32/signal.cto look like this:<br /><br /> @@ -250,6 +250,7 @@<br /> {<br />         char           
pipename[128];<br/>         HANDLE          pipe = pgwin32_initial_signal_pipe;<br /> +       HANDLE          new_pipe
=pgwin32_initial_signal_pipe;<br /><br />         snprintf(pipename, sizeof(pipename), "\\\\.\\pipe\\pgsignal_%u",
GetCurrentProcessId());<br/><br /> @@ -275,6 +276,10 @@<br />                 fConnected = ConnectNamedPipe(pipe, NULL)
?TRUE : (GetLastError() == ERROR_PIPE_CONNECTED);<br />                 if (fConnected)<br />                 {<br />
+                      new_pipe = CreateNamedPipe(pipename, PIPE_ACCESS_DUPLEX,<br />
+                                         PIPE_TYPE_MESSAGE | PIPE_READMODE_MESSAGE | PIPE_WAIT,<br />
+                                                         PIPE_UNLIMITED_INSTANCES, 16, 16, 1000, NULL);<br /> +<br />
                       hThread = CreateThread(NULL, 0,<br />                                                  
(LPTHREAD_START_ROUTINE)pg_signal_dispatch_thread,<br />
                                                                  (LPVOID) pipe, 0, NULL);<br /> @@ -288,8 +293,7 @@<br
/>                        /* Connection failed. Cleanup and try again */<br />                        
CloseHandle(pipe);<br/><br /> -               /* Set up so we create a new pipe on next loop */<br /> -              
pipe= INVALID_HANDLE_VALUE;<br /> +               pipe = new_pipe;<br />         }<br />         return 0;<br /> }<br
/><br/> This will guarantee that we have an instance of the named pipe available at any given moment. If we do this, we
canalso remove the 3 tries loop from src/port/kill.c:<br /><br /> @@ -25,7 +25,6 @@<br />         BYTE           
sigData= sig;<br />         BYTE            sigRet = 0;<br />         DWORD           bytes;<br /> -      
int                    pipe_tries;<br /><br />         /* we allow signal 0 here, but it will be ignored in
pg_queue_signal*/<br />         if (sig >= PG_SIGNAL_COUNT || sig < 0)<br /> @@ -41,14 +40,6 @@<br />        
}<br/>         snprintf(pipename, sizeof(pipename), "\\\\.\\pipe\\pgsignal_%u", pid);<br /><br /> -       /*<br />
-       * Writing data to the named pipe can fail for transient reasons.<br /> -        * Therefore, it is useful to
retryif it fails.  The maximum number of<br /> -        * calls to make was empirically determined from a 90-hour
notification<br/> -        * stress test.<br /> -        */<br /> -       for (pipe_tries = 0; pipe_tries < 3;
pipe_tries++)<br/> -       {<br />                 if (CallNamedPipe(pipename, &sigData, 1, &sigRet, 1,
&bytes,1000))<br />                 {<br />                         if (bytes != 1 || sigRet != sig)<br /> @@ -58,7
+49,6@@<br />                         }<br />                         return 0;<br />                 }<br /> -      
}<br/><br />         if (GetLastError() == ERROR_FILE_NOT_FOUND)<br />                 errno = ESRCH;<br /><br /> As a
note,the original code has a timeout of 1000 milliseconds specified in the call to CallNamedPipe in kill.c. That
timeoutdoesn't help for this bug, because it is the timeout that the client will wait for an instance of the named pipe
tobe available for communication. This means it helps in the case when we have one client already using that instance
toread/write to the server and this second client will wait until the first one finishes. It does not help if there is
noinstance at all. In that case the CallNamedPipe returns immediately with ERROR_FILE_NOT_FOUND.<br /><br /> Thanks,<br
/>Radu<br /><br /><br />

Re: pg_listener entries deleted under heavy NOTIFY load only on Windows

From

Craig Ringer

Date:

25 January 2010, 09:42:28

On 23/01/2010 5:30 AM, Radu Ilie wrote:
> On a Windows server under heavy load of NOTIFY events, entries in
> pg_listener table for some events are deleted. It is like UNLISTEN was
> called.

This has been reported on -general, then -bugs, now -hackers, and seems 
to just get ignored. The poster has done a fair bit of research and has 
documented the issue, and fix, in a fair bit of depth.

I don't have the access rights to do anything about it. Is someone who 
does going to look at it, or will it remain a lurking bug?

(If this has since made it into cvs without any ML activity, then sorry 
- but I'm worried it'll get missed again and want to make sure it doesn't.)

--
Craig Ringer

Re: pg_listener entries deleted under heavy NOTIFY load only on Windows

From

Stefan Kaltenbrunner

Date:

25 January 2010, 11:38:05

Craig Ringer wrote:
> On 23/01/2010 5:30 AM, Radu Ilie wrote:
>> On a Windows server under heavy load of NOTIFY events, entries in
>> pg_listener table for some events are deleted. It is like UNLISTEN was
>> called.
> 
> This has been reported on -general, then -bugs, now -hackers, and seems 
> to just get ignored. The poster has done a fair bit of research and has 
> documented the issue, and fix, in a fair bit of depth.
> 
> I don't have the access rights to do anything about it. Is someone who 
> does going to look at it, or will it remain a lurking bug?
> 
> (If this has since made it into cvs without any ML activity, then sorry 
> - but I'm worried it'll get missed again and want to make sure it doesn't.)

well it is on the Open Items list for a while now:

http://wiki.postgresql.org/wiki/PostgreSQL_8.5_Open_Items so I won't be 
forgotten...


Stefan

Re: pg_listener entries deleted under heavy NOTIFY load only on Windows

From

Magnus Hagander

Date:

25 January 2010, 13:53:01

On Monday, January 25, 2010, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:
> Craig Ringer wrote:
>
> On 23/01/2010 5:30 AM, Radu Ilie wrote:
>
> On a Windows server under heavy load of NOTIFY events, entries in
> pg_listener table for some events are deleted. It is like UNLISTEN was
> called.
>
>
> This has been reported on -general, then -bugs, now -hackers, and seems to just get ignored. The poster has done a
fairbit of research and has documented the issue, and fix, in a fair bit of depth.
 
>
> I don't have the access rights to do anything about it. Is someone who does going to look at it, or will it remain a
lurkingbug?
 
>
> (If this has since made it into cvs without any ML activity, then sorry - but I'm worried it'll get missed again and
wantto make sure it doesn't.)
 
>
>
> well it is on the Open Items list for a while now:
>
> http://wiki.postgresql.org/wiki/PostgreSQL_8.5_Open_Items so I won't be forgotten...
>

it's in my list. I've just been unable to reproduce it yet. But I'm on
it and will definitely get to it before the next update release. And
side th op had a patch that works in his environment and nobody else
ha reported it I didn't give it max priority.

I couldve done a better job of feedback. I think I posted the info in
a different thread that only linked to the original bug report, so it
wa probably missed because of that.




-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/

Re: pg_listener entries deleted under heavy NOTIFY load only on Windows

From

Magnus Hagander

Date:

31 January 2010, 16:18:56

On Fri, Jan 22, 2010 at 22:30, Radu Ilie <rilie@wsi.com> wrote:
> On a Windows server under heavy load of NOTIFY events, entries in
> pg_listener table for some events are deleted. It is like UNLISTEN was
> called.
>
> PostgreSQL version: 8.3.9.
> Operating System: Windows XP.
>
> PostgreSQL believes that if it fails to notify a listener (by signaling the
> respective backend), then the backend doesn't exist anymore and so it should
> get rid of the pg_listener entry. The relevant code is in
> src/backend/commands/async.c, function Send_Notify:
>
> if (kill(listenerPID, SIGUSR2) < 0)
> {
> /*
> * Get rid of pg_listener entry if it refers to a PID that no
> * longer exists.  Presumably, that backend crashed without
> * deleting its pg_listener entries. This code used to only
> * delete the entry if errno==ESRCH, but as far as I can see
> * we should just do it for any failure (certainly at least
> * for EPERM too...)
> */
> simple_heap_delete(lRel, &lTuple->t_self);
> }
>
> The problem is that under Windows, kill can fail even if the process is
> still alive. PostgreSQL uses named pipes under Windows to send signals to
> backends. The present implementation has a bug that causes a client to fail
> to write data to the named pipe, even though the server process is alive.
> This is because the server doesn't maintain the named pipe at all times. The
> bug is present in file src/backend/port/win32/signal.c, function
> pg_signal_thread.
>
> The server code stays in a loop in which it continuously creates an instance
> of the named pipe (via CreateNamedPipe) and waits for a client process to
> connect (via ConnectNamedPipe). Once a client connects, the communication
> with the client is handled in a new thread, with the thread procedure
> pg_signal_dispatch_thread. This function is very simple: it reads one byte
> from the named pipe, it writes it back and (very important) closes the
> handle to the named pipe instance. The main loop creates another instance of
> the named pipe and waits for another client to connect.
>
> Now imagine that the server is under heavy load. There are dozens of
> backends and threads running and the CPU usage is close to 100%. The
> following succession of events is possible:
>
> 1. Server signal handling thread (in function pg_signal_thread) creates the
> first, one and only instance of the named pipe via CreateNamedPipe.
> 2. Server code starts waiting for clients to connect with ConnectNamedPipe.
> 3. Client wishes to make a transaction on the named pipe and calls
> CallNamedPipe (in file src/port/kill.c, function pgkill).
> 4. Server code returns from ConnectNamedPipe. It creates a new thread with
> the thread procedure pg_signal_dispatch_thread.
> 5. The signal dispatch thread is scheduled for execution and it runs to
> completion. As you can see, the last thing it does related to the named pipe
> is to close the handle via CloseHandle (in function
> pg_signal_dispatch_thread). This closes the last instance of the named pipe.
> The named pipe is gone. There is no more named pipe. The signal handling
> thread was not yet scheduled by the operating system for execution and thus
> didn't have an opportunity to call CreateNamedPipe.
> 6. Another client (or the same one, it doesn't matter) tries to write to the
> named pipe via CallNamedPipe. The call returns ERROR_FILE_NOT_FOUND, because
> the named pipe is gone. The client believes the backend is gone and it
> removes the entry from pg_listener.
> 7. The signal handling thread (in function pg_signal_thread) is finally
> scheduled for execution and it calls CreateNamedPipe. We now have an
> instance of the named pipe available.
>
> So we end up with the server backend alive, the named pipe is there, but the
> row is gone from pg_listener. This is easy to reproduce under Windows. I
> used the scripts posted by Steve Marshall in a similar thread from
> 01/15/2009 and the problem appears within one minute all the time. For
> testing I used a Windows XP machine with 2 cores and 2GB of RAM. The CPU
> usage was over 70% during the trials.

Thanks for a *very* detailed analysis. I've finally managed to
reproduce this problem, and test your fix.

Interestingly enough, the race condition is copied from Microsoft's
example of how to write a multi-threaded pipe server. That doesn't
make it any less real of course :-)


> The solution is to create a new instance of the named pipe before launching
> the signal dispatch thread. This means changing the code in
> src/backend/port/win32/signal.c to look like this:

I have made some minor stylistic changes to this, such as some missing
error reporting, and added a lot of comments to explain why we do
this, and applied.

If you can, please verify that the tip of CVS for your version solves
the problem in your real world scenario, and not just my reproduction
test case.


> This will guarantee that we have an instance of the named pipe available at
> any given moment. If we do this, we can also remove the 3 tries loop from
> src/port/kill.c:

I have removed these on HEAD only, and I'm leaving them in the
backbranches just to be sure there's nothing else hiding around there.


> As a note, the original code has a timeout of 1000 milliseconds specified in
> the call to CallNamedPipe in kill.c. That timeout doesn't help for this bug,
> because it is the timeout that the client will wait for an instance of the
> named pipe to be available for communication. This means it helps in the
> case when we have one client already using that instance to read/write to
> the server and this second client will wait until the first one finishes. It
> does not help if there is no instance at all. In that case the CallNamedPipe
> returns immediately with ERROR_FILE_NOT_FOUND.

That's indeed true. I guess the fact that it worked with 3 was that it
pushed the likelihood of getting context-switched up far enough so the
race wouldn't show up :)


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/