Thread: Unsuccessful SIGINT

Unsuccessful SIGINT

From
Brian Wipf
Date:
I have a connection that I am unable to kill with a sigint.

ps auxww for the process in question:
postgres  3578  0.3  3.6 6526396 1213344 ?     S    Dec01   0:32
postgres: postgres ssprod 192.168.0.52(49333) SELECT

and gdb shows:
(gdb) bt
#0  0x00002ba62c18f085 in send () from /lib64/libc.so.6
#1  0x0000000000504765 in internal_flush ()
#2  0x0000000000504896 in internal_putbytes ()
#3  0x00000000005048fc in pq_putmessage ()
#4  0x0000000000505ea4 in pq_endmessage ()
#5  0x000000000043e37a in printtup ()
#6  0x00000000004e9349 in ExecutorRun ()
#7  0x0000000000567931 in PortalRunSelect ()
#8  0x00000000005685f0 in PortalRun ()
#9  0x0000000000565ea8 in PostgresMain ()
#10 0x0000000000540624 in ServerLoop ()
#11 0x000000000054131a in PostmasterMain ()
#12 0x000000000050676e in main ()

lsof on the client machine (192.168.0.52) shows no connections on
port 49333, so it doesn't appear to be a simple matter of killing the
client connection. If I have to, I can reboot the client machine, but
this seems like overkill and I'm not certain this will fix the
problem. Anything else I can try on the server or the client short of
restarting the database or rebooting the client?

Brian Wipf
<brian@clickspace.com>


Re: Unsuccessful SIGINT

From
Brian Wipf
Date:
Sorry, I forgot to mention this is on PostgreSQL 8.1.5. The server is
SUSE Linux 10.1, the client is OS X Server 10.4.8.

On 1-Dec-06, at 5:42 PM, Brian Wipf wrote:

> I have a connection that I am unable to kill with a sigint.
>
> ps auxww for the process in question:
> postgres  3578  0.3  3.6 6526396 1213344 ?     S    Dec01   0:32
> postgres: postgres ssprod 192.168.0.52(49333) SELECT
>
> and gdb shows:
> (gdb) bt
> #0  0x00002ba62c18f085 in send () from /lib64/libc.so.6
> #1  0x0000000000504765 in internal_flush ()
> #2  0x0000000000504896 in internal_putbytes ()
> #3  0x00000000005048fc in pq_putmessage ()
> #4  0x0000000000505ea4 in pq_endmessage ()
> #5  0x000000000043e37a in printtup ()
> #6  0x00000000004e9349 in ExecutorRun ()
> #7  0x0000000000567931 in PortalRunSelect ()
> #8  0x00000000005685f0 in PortalRun ()
> #9  0x0000000000565ea8 in PostgresMain ()
> #10 0x0000000000540624 in ServerLoop ()
> #11 0x000000000054131a in PostmasterMain ()
> #12 0x000000000050676e in main ()
>
> lsof on the client machine (192.168.0.52) shows no connections on
> port 49333, so it doesn't appear to be a simple matter of killing
> the client connection. If I have to, I can reboot the client
> machine, but this seems like overkill and I'm not certain this will
> fix the problem. Anything else I can try on the server or the
> client short of restarting the database or rebooting the client?
>
> Brian Wipf
> <brian@clickspace.com>



Re: Unsuccessful SIGINT - More Info

From
Brian Wipf
Date:
Based on the backend_start time in pg_stat_activity, I was able to
find the problem query in our logs. The query is a simple one, but
returns a lot of results for a report. This was the error in the logs:

org.postgresql.util.PSQLException: Ran out of memory retrieving query
results.
         at org.postgresql.core.v3.QueryExecutorImpl.processResults
(QueryExecutorImpl.java:1291)
         at org.postgresql.core.v3.QueryExecutorImpl.execute
(QueryExecutorImpl.java:188)
         at org.postgresql.jdbc2.AbstractJdbc2Statement.execute
(AbstractJdbc2Statement.java:452)
         at
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags
(AbstractJdbc2Statement.java:340)
         at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery
(AbstractJdbc2Statement.java:239)
    ...
java.lang.OutOfMemoryError

The instance of the application is no longer running where this error
occurred, but the server still shows the hung non-sigint-able
connection.

On 1-Dec-06, at 5:54 PM, Brian Wipf wrote:

> Sorry, I forgot to mention this is on PostgreSQL 8.1.5. The server
> is SUSE Linux 10.1, the client is OS X Server 10.4.8.
>
> On 1-Dec-06, at 5:42 PM, Brian Wipf wrote:
>
>> I have a connection that I am unable to kill with a sigint.
>>
>> ps auxww for the process in question:
>> postgres  3578  0.3  3.6 6526396 1213344 ?     S    Dec01   0:32
>> postgres: postgres ssprod 192.168.0.52(49333) SELECT
>>
>> and gdb shows:
>> (gdb) bt
>> #0  0x00002ba62c18f085 in send () from /lib64/libc.so.6
>> #1  0x0000000000504765 in internal_flush ()
>> #2  0x0000000000504896 in internal_putbytes ()
>> #3  0x00000000005048fc in pq_putmessage ()
>> #4  0x0000000000505ea4 in pq_endmessage ()
>> #5  0x000000000043e37a in printtup ()
>> #6  0x00000000004e9349 in ExecutorRun ()
>> #7  0x0000000000567931 in PortalRunSelect ()
>> #8  0x00000000005685f0 in PortalRun ()
>> #9  0x0000000000565ea8 in PostgresMain ()
>> #10 0x0000000000540624 in ServerLoop ()
>> #11 0x000000000054131a in PostmasterMain ()
>> #12 0x000000000050676e in main ()
>>
>> lsof on the client machine (192.168.0.52) shows no connections on
>> port 49333, so it doesn't appear to be a simple matter of killing
>> the client connection. If I have to, I can reboot the client
>> machine, but this seems like overkill and I'm not certain this
>> will fix the problem. Anything else I can try on the server or the
>> client short of restarting the database or rebooting the client?
>>
>> Brian Wipf
>> <brian@clickspace.com>
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 6: explain analyze is your friend
>



Re: Unsuccessful SIGINT - More Info

From
Brian Wipf
Date:
I finally reboot the client server. It took a couple of minutes after
that, but the hung connection did go away on the server.

I found a similar cause to my problem in the archives:
http://archives.postgresql.org/pgsql-jdbc/2005-05/msg00044.php

In order for the PostgreSQL JDBC adaptor to not fetch the entire
result set it is necessary to call Statement.setFetchSize().
 From the archive: "Currently it only takes effect with autocommit
off and TYPE_FORWARD_ONLY resultsets"

Now I know the cause at least. If anyone has an idea on how to kill a
similar hung connection without rebooting the server, I would
appreciate any suggestions.

Thanks,

Brian Wipf
<brian@clickspace.com>

On 1-Dec-06, at 6:30 PM, Brian Wipf wrote:

> Based on the backend_start time in pg_stat_activity, I was able to
> find the problem query in our logs. The query is a simple one, but
> returns a lot of results for a report. This was the error in the logs:
>
> org.postgresql.util.PSQLException: Ran out of memory retrieving
> query results.
>         at org.postgresql.core.v3.QueryExecutorImpl.processResults
> (QueryExecutorImpl.java:1291)
>         at org.postgresql.core.v3.QueryExecutorImpl.execute
> (QueryExecutorImpl.java:188)
>         at org.postgresql.jdbc2.AbstractJdbc2Statement.execute
> (AbstractJdbc2Statement.java:452)
>         at
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags
> (AbstractJdbc2Statement.java:340)
>         at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery
> (AbstractJdbc2Statement.java:239)
>     ...
> java.lang.OutOfMemoryError
>
> The instance of the application is no longer running where this
> error occurred, but the server still shows the hung non-sigint-able
> connection.
>
> On 1-Dec-06, at 5:54 PM, Brian Wipf wrote:
>
>> Sorry, I forgot to mention this is on PostgreSQL 8.1.5. The server
>> is SUSE Linux 10.1, the client is OS X Server 10.4.8.
>>
>> On 1-Dec-06, at 5:42 PM, Brian Wipf wrote:
>>
>>> I have a connection that I am unable to kill with a sigint.
>>>
>>> ps auxww for the process in question:
>>> postgres  3578  0.3  3.6 6526396 1213344 ?     S    Dec01   0:32
>>> postgres: postgres ssprod 192.168.0.52(49333) SELECT
>>>
>>> and gdb shows:
>>> (gdb) bt
>>> #0  0x00002ba62c18f085 in send () from /lib64/libc.so.6
>>> #1  0x0000000000504765 in internal_flush ()
>>> #2  0x0000000000504896 in internal_putbytes ()
>>> #3  0x00000000005048fc in pq_putmessage ()
>>> #4  0x0000000000505ea4 in pq_endmessage ()
>>> #5  0x000000000043e37a in printtup ()
>>> #6  0x00000000004e9349 in ExecutorRun ()
>>> #7  0x0000000000567931 in PortalRunSelect ()
>>> #8  0x00000000005685f0 in PortalRun ()
>>> #9  0x0000000000565ea8 in PostgresMain ()
>>> #10 0x0000000000540624 in ServerLoop ()
>>> #11 0x000000000054131a in PostmasterMain ()
>>> #12 0x000000000050676e in main ()
>>>
>>> lsof on the client machine (192.168.0.52) shows no connections on
>>> port 49333, so it doesn't appear to be a simple matter of killing
>>> the client connection. If I have to, I can reboot the client
>>> machine, but this seems like overkill and I'm not certain this
>>> will fix the problem. Anything else I can try on the server or
>>> the client short of restarting the database or rebooting the client?
>>>
>>> Brian Wipf
>>> <brian@clickspace.com>
>>
>>
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 6: explain analyze is your friend
>>
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that
> your
>       message can get through to the mailing list cleanly
>



Re: Unsuccessful SIGINT - More Info

From
Martijn van Oosterhout
Date:
On Fri, Dec 01, 2006 at 08:26:53PM -0700, Brian Wipf wrote:
> Now I know the cause at least. If anyone has an idea on how to kill a
> similar hung connection without rebooting the server, I would
> appreciate any suggestions.

I'm unsure about why it wouldn't respond to a sigint, but did you try
try stronger signals?

Have a ncie day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: Unsuccessful SIGINT - More Info

From
Brian Wipf
Date:
On 2-Dec-06, at 6:27 AM, Martijn van Oosterhout wrote:
> On Fri, Dec 01, 2006 at 08:26:53PM -0700, Brian Wipf wrote:
>> Now I know the cause at least. If anyone has an idea on how to kill a
>> similar hung connection without rebooting the server, I would
>> appreciate any suggestions.
>
> I'm unsure about why it wouldn't respond to a sigint, but did you try
> try stronger signals?


I tried a SIGHUP, SIGINT and SIGQUIT.

Do you believe a postgres process in the following:
#0  0x00002ba62c18f085 in send () from /lib64/libc.so.6
#1  0x0000000000504765 in internal_flush ()
#2  0x0000000000504896 in internal_putbytes ()
#3  0x00000000005048fc in pq_putmessage ()
#4  0x0000000000505ea4 in pq_endmessage ()
...
is listening for some other signal? I, of course, didn't try a
SIGKILL. This is a production database.

Brian Wipf
<brian@clickspace.com>


Re: Unsuccessful SIGINT

From
"Albe Laurenz"
Date:
> I have a connection that I am unable to kill with a sigint.
>
> ps auxww for the process in question:
> postgres  3578  0.3  3.6 6526396 1213344 ?     S    Dec01   0:32
> postgres: postgres ssprod 192.168.0.52(49333) SELECT
>
> and gdb shows:
> (gdb) bt
> #0  0x00002ba62c18f085 in send () from /lib64/libc.so.6
> #1  0x0000000000504765 in internal_flush ()
> #2  0x0000000000504896 in internal_putbytes ()
> #3  0x00000000005048fc in pq_putmessage ()
> #4  0x0000000000505ea4 in pq_endmessage ()
> #5  0x000000000043e37a in printtup ()
> #6  0x00000000004e9349 in ExecutorRun ()
> #7  0x0000000000567931 in PortalRunSelect ()
> #8  0x00000000005685f0 in PortalRun ()
> #9  0x0000000000565ea8 in PostgresMain ()
> #10 0x0000000000540624 in ServerLoop ()
> #11 0x000000000054131a in PostmasterMain ()
> #12 0x000000000050676e in main ()
>
> lsof on the client machine (192.168.0.52) shows no connections on
> port 49333, so it doesn't appear to be a simple matter of killing the

> client connection. If I have to, I can reboot the client machine, but

> this seems like overkill and I'm not certain this will fix the
> problem. Anything else I can try on the server or the client short of

> restarting the database or rebooting the client?

Do I get it right that there is no process on the client machine
using port 49333?
Maybe you can reboot the client machine to make sure.

I'd wait for some time, because the send() might be stuck in kernel
space, and I guess it should timeout at some point. Then the process
will go away.

If the server process is still there after a couple of hours, hmm,
I don't know. Maybe resort to a kill -9. If that does not get rid
of the server process, it is stuck in kernel space for good and
probably nothing except a reboot will get rid of it.

Yours,
Laurenz Albe

Re: Unsuccessful SIGINT

From
Brian Wipf
Date:
On 4-Dec-06, at 1:43 AM, Albe Laurenz wrote:
>> lsof on the client machine (192.168.0.52) shows no connections on
>> port 49333, so it doesn't appear to be a simple matter of killing the
>
>> client connection. If I have to, I can reboot the client machine, but
>
>> this seems like overkill and I'm not certain this will fix the
>> problem. Anything else I can try on the server or the client short of
>
>> restarting the database or rebooting the client?
>
> Do I get it right that there is no process on the client machine
> using port 49333?
> Maybe you can reboot the client machine to make sure.
>
> I'd wait for some time, because the send() might be stuck in kernel
> space, and I guess it should timeout at some point. Then the process
> will go away.
The Java process on the client machine that held the connection was
killed off and lsof no longer showed a process with a connection on
port 49333. I waited about 7 hours and the database server still
showed the hung connection from port 49333 of the client. I finally
reboot the client computer, which fixed the problem. I suppose
something lower level than the application process was hanging on to
the connection somehow and lsof couldn't even detect it. The client
is a Mac OS X 10.4.8 box. It would have been nice if I could have
killed the process from the server side as well, but I'm sure there's
a good reason why you can't when it's in this state:
send () from /lib64/libc.so.6
in internal_flush ()
in internal_putbytes ()
in pq_putmessage ()
in pq_endmessage ()
in printtup ()
in ExecutorRun ()
in PortalRunSelect ()

> If the server process is still there after a couple of hours, hmm,
> I don't know. Maybe resort to a kill -9. If that does not get rid
> of the server process, it is stuck in kernel space for good and
> probably nothing except a reboot will get rid of it.
The last time I tried a kill -9 on a server process the database
instantly reboot itself and it had to perform some kind of crash
recovery. Is a kill -9 okay in some cases? I suppose a restart of the
database would have worked as well, but that was my last resort.