Thread: pg_cancel_backend() does not work with buzz queries

pg_cancel_backend() does not work with buzz queries

From

"Sergey Konoplev"

Date:

03 October 2007, 05:11:37

Hi all,

I often face with buzz queries (see below). I've looked through pg
manual and huge amount of forums and mail archives and found nothing.
The only solution  is to restart postgres server. Moreover I have to
terminate the process using HUP signal to stop the server.

transport=# select version();
                                       version
-------------------------------------------------------------------------------------
 PostgreSQL 8.2.4 on i686-pc-linux-gnu, compiled by GCC gcc (GCC)
3.3.3 (SuSE Linux)
(1 row)

transport=# select datname, usename, procpid, current_query::char(15),
waiting, query_start from pg_catalog.pg_stat_activity where procpid =
20530;
  datname  |     usename      | procpid |  current_query  | waiting |
        query_start
-----------+------------------+---------+-----------------+---------+-------------------------------
 transport | belostotskaya_la |   20530 | select * from c | f       |
2007-10-02 05:05:28.908687+04
(1 row)

transport=# select pg_catalog.pg_cancel_backend(20530);
 pg_cancel_backend
-------------------
 t
(1 row)

transport=# select datname, usename, procpid, current_query::char(15),
waiting, query_start from pg_catalog.pg_stat_activity where procpid =
20530;
  datname  |     usename      | procpid |  current_query  | waiting |
        query_start
-----------+------------------+---------+-----------------+---------+-------------------------------
 transport | belostotskaya_la |   20530 | select * from c | f       |
2007-10-02 05:05:28.908687+04
(1 row)

--
Regards,
Sergey Konoplev

Re: pg_cancel_backend() does not work with buzz queries

From

"Sergey Konoplev"

Date:

03 October 2007, 05:44:23

I'm sorry I mean not HUP but KILL

2007/10/3, Sergey Konoplev <gray.ru@gmail.com>:
> Hi all,
>
> I often face with buzz queries (see below). I've looked through pg
> manual and huge amount of forums and mail archives and found nothing.
> The only solution  is to restart postgres server. Moreover I have to
> terminate the process using HUP signal to stop the server.
>
> transport=# select version();
>                                       version
> -------------------------------------------------------------------------------------
>  PostgreSQL 8.2.4 on i686-pc-linux-gnu, compiled by GCC gcc (GCC)
> 3.3.3 (SuSE Linux)
> (1 row)
>
> transport=# select datname, usename, procpid, current_query::char(15),
> waiting, query_start from pg_catalog.pg_stat_activity where procpid =
> 20530;
>  datname  |     usename      | procpid |  current_query  | waiting |
>        query_start
> -----------+------------------+---------+-----------------+---------+-------------------------------
>  transport | belostotskaya_la |   20530 | select * from c | f       |
> 2007-10-02 05:05:28.908687+04
> (1 row)
>
> transport=# select pg_catalog.pg_cancel_backend(20530);
>  pg_cancel_backend
> -------------------
>  t
> (1 row)
>
> transport=# select datname, usename, procpid, current_query::char(15),
> waiting, query_start from pg_catalog.pg_stat_activity where procpid =
> 20530;
>  datname  |     usename      | procpid |  current_query  | waiting |
>        query_start
> -----------+------------------+---------+-----------------+---------+-------------------------------
>  transport | belostotskaya_la |   20530 | select * from c | f       |
> 2007-10-02 05:05:28.908687+04
> (1 row)
>
> --
> Regards,
> Sergey Konoplev
>


--
Regards,
Sergey Konoplev

Re: pg_cancel_backend() does not work with buzz queries

From

Richard Huxton

Date:

03 October 2007, 06:11:02

Sergey Konoplev wrote:
> I'm sorry I mean not HUP but KILL

Hmm...

>>  datname  |     usename      | procpid |  current_query  | waiting |
>>        query_start
>> -----------+------------------+---------+-----------------+---------+-------------------------------
>>  transport | belostotskaya_la |   20530 | select * from c | f       |
>> 2007-10-02 05:05:28.908687+04

>> transport=# select pg_catalog.pg_cancel_backend(20530);
>>  pg_cancel_backend
>> -------------------
>>  t

>>  datname  |     usename      | procpid |  current_query  | waiting |
>>        query_start
>> -----------+------------------+---------+-----------------+---------+-------------------------------
>>  transport | belostotskaya_la |   20530 | select * from c | f       |
>> 2007-10-02 05:05:28.908687+04

1. Is it always the same query?
2. Does the client still think it's connected?
3. Is that query using up CPU, or just idling?
4. Anything odd in pg_locks for the problem pid?

--
   Richard Huxton
   Archonet Ltd

Re: pg_cancel_backend() does not work with buzz queries

From

Richard Huxton

Date:

03 October 2007, 07:18:50

Don't forget to cc: the list.
Try not to top-post replies, it's easier to read if you reply below the
text you're replying to.

Sergey Konoplev wrote:
>> 1. Is it always the same query?
>> 2. Does the client still think it's connected?
>> 3. Is that query using up CPU, or just idling?
>> 4. Anything odd in pg_locks for the problem pid?

> 1. No it isn't. I have few functions (plpgsql, plpython) that cause
> such situations more often than another but they are called more often
> also.

OK, so there's no real pattern. That would suggest it's not a particular
query-plan that's got something wrong.

Do you always get this problem inside a function?

> 2. The client just waits for query and buzz.
> 3. They are using CPU in usual way and their pg_lock activity seems normal.

So the backend that appears "stuck" is still using CPU?

> 4. No I haven't noticed anything odd.

So - the symptoms are:
1. Client hangs, waiting for the result of a query
2. You notice this
3. You issue pg_cancel_backend() which sends a SIGINT which doesn't do
anything
4. You have to cancel the query from the command-line using "kill -9
<backend-pid>"

Are you happy that your hardware and drivers are OK? There aren't
problems with any other servers on this machine?

--
   Richard Huxton
   Archonet Ltd

Re: pg_cancel_backend() does not work with buzz queries

From

Magnus Hagander

Date:

03 October 2007, 07:57:30

On Wed, Oct 03, 2007 at 11:18:32AM +0100, Richard Huxton wrote:
> Don't forget to cc: the list.
> Try not to top-post replies, it's easier to read if you reply below the
> text you're replying to.
>
> Sergey Konoplev wrote:
> >>1. Is it always the same query?
> >>2. Does the client still think it's connected?
> >>3. Is that query using up CPU, or just idling?
> >>4. Anything odd in pg_locks for the problem pid?
>
> >1. No it isn't. I have few functions (plpgsql, plpython) that cause
> >such situations more often than another but they are called more often
> >also.
>
> OK, so there's no real pattern. That would suggest it's not a particular
> query-plan that's got something wrong.
>
> Do you always get this problem inside a function?

Does pl/python listen to SIGINT during execution of functions? If not,
that'd be an explanation - if it's stuck inside a pl/python function...

AFAIK, pl/pgsql does listen for SIGINT during execution, but I don't nkow
abuot plpython.

> 4. You have to cancel the query from the command-line using "kill -9
> <backend-pid>"

That's not cancel, that's taking a sledgehammer to your server :(

//Magnus

Re: pg_cancel_backend() does not work with buzz queries

From

"Sergey Konoplev"

Date:

03 October 2007, 08:45:11

> > Don't forget to cc: the list.
> > Try not to top-post replies, it's easier to read if you reply below the
> > text you're replying to.
> >
> > Sergey Konoplev wrote:
> > >>1. Is it always the same query?
> > >>2. Does the client still think it's connected?
> > >>3. Is that query using up CPU, or just idling?
> > >>4. Anything odd in pg_locks for the problem pid?
> >
> > >1. No it isn't. I have few functions (plpgsql, plpython) that cause
> > >such situations more often than another but they are called more often
> > >also.
> >
> > OK, so there's no real pattern. That would suggest it's not a particular
> > query-plan that's got something wrong.
> >
> > Do you always get this problem inside a function?
>
> Does pl/python listen to SIGINT during execution of functions? If not,
> that'd be an explanation - if it's stuck inside a pl/python function...
>
> AFAIK, pl/pgsql does listen for SIGINT during execution, but I don't nkow
> abuot plpython.

How can we find it out?

> > 4. You have to cancel the query from the command-line using "kill -9
> > <backend-pid>"
>
> That's not cancel, that's taking a sledgehammer to your server :(

Yes I know it but I have no choice :(

Re: pg_cancel_backend() does not work with buzz queries

From

Richard Huxton

Date:

03 October 2007, 08:47:39

Sergey Konoplev wrote:
>> Don't forget to cc: the list.
>> Try not to top-post replies, it's easier to read if you reply below the
>> text you're replying to.
>
> Thanx for your advice. I'm just absolutely worned out. Sorry.

Know that feeling - let's see if we can't sort this out.

>>>> 1. Is it always the same query?
>>>> 2. Does the client still think it's connected?
>>>> 3. Is that query using up CPU, or just idling?
>>>> 4. Anything odd in pg_locks for the problem pid?
>>> 1. No it isn't. I have few functions (plpgsql, plpython) that cause
>>> such situations more often than another but they are called more often
>>> also.
>> OK, so there's no real pattern. That would suggest it's not a particular
>> query-plan that's got something wrong.
>>
>> Do you always get this problem inside a function?
>
> As far as I remember I do.

Hmm - check Magnus' thoughts on pl/python. Can't comment on Python
myself. Are you sure it's not always the same few function(s) that cause
this problem?

>>> 2. The client just waits for query and buzz.
>>> 3. They are using CPU in usual way and their pg_lock activity seems normal.
>> So the backend that appears "stuck" is still using CPU?
>
> Yes but the metter is that this procedures usualy use CPU just a
> little so I can't find out if there is some oddity or not.

OK, so it's not that it's stuck in a loop wasting a lot of CPU

>> So - the symptoms are:
[snip]
> Exactly.

So - we need to solve two mysteries
1. Why are these functions not returning?
2. Why does SIGINT not interrupt them?

>> Are you happy that your hardware and drivers are OK? There aren't
>> problems with any other servers on this machine?
>
> Yes I'm quite happy. My hardware is: 2 double-core Xeon, 8Gb RAM,
> RAID5. What about other software... it's dedicated PG server so I have
> no problem with it.

Well, the places I'd look would be:
1. Hardware (you're happy that's fine, and it's not quite the problems
I'd expect)
2. Drivers (same as #1)
3. Client connectivity (but you say the client is fine)
4. External interactions (see below)
5. Bug in PG extension (pl/python)
6. Bug in PG core code

Do any of your functions interact with the outside world - fetch
webpages or similar? It could be they're waiting for that. If you're
using a library that could hang waiting for a response and also block
SIGINT at the same time that would explain everything.

--
   Richard Huxton
   Archonet Ltd

Re: pg_cancel_backend() does not work with buzz queries

From

Alvaro Herrera

Date:

03 October 2007, 09:12:53

Sergey Konoplev escribió:

> > AFAIK, pl/pgsql does listen for SIGINT during execution, but I don't nkow
> > abuot plpython.
>
> How can we find it out?

Let's see one of the functions to find out if anyone else can reproduce
the problem.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: pg_cancel_backend() does not work with buzz queries

From

Tom Lane

Date:

03 October 2007, 11:21:02

Magnus Hagander <magnus@hagander.net> writes:
> Does pl/python listen to SIGINT during execution of functions? If not,
> that'd be an explanation - if it's stuck inside a pl/python function...
> AFAIK, pl/pgsql does listen for SIGINT during execution, but I don't nkow
> abuot plpython.

It does not, unless the function contains SPI calls that'll return
control into PG code.

Short answer: don't write infinite loops in plpython.

            regards, tom lane

Re: pg_cancel_backend() does not work with buzz queries

From

Erik Jones

Date:

03 October 2007, 12:13:54

On Oct 3, 2007, at 6:47 AM, Richard Huxton wrote:

> Sergey Konoplev wrote:
>>> Don't forget to cc: the list.
>>> Try not to top-post replies, it's easier to read if you reply
>>> below the
>>> text you're replying to.
>> Thanx for your advice. I'm just absolutely worned out. Sorry.
>
> Know that feeling - let's see if we can't sort this out.
>
>>>>> 1. Is it always the same query?
>>>>> 2. Does the client still think it's connected?
>>>>> 3. Is that query using up CPU, or just idling?
>>>>> 4. Anything odd in pg_locks for the problem pid?
>>>> 1. No it isn't. I have few functions (plpgsql, plpython) that cause
>>>> such situations more often than another but they are called more
>>>> often
>>>> also.
>>> OK, so there's no real pattern. That would suggest it's not a
>>> particular
>>> query-plan that's got something wrong.
>>>
>>> Do you always get this problem inside a function?
>> As far as I remember I do.
>
> Hmm - check Magnus' thoughts on pl/python. Can't comment on Python
> myself. Are you sure it's not always the same few function(s) that
> cause this problem?
>
>>>> 2. The client just waits for query and buzz.
>>>> 3. They are using CPU in usual way and their pg_lock activity
>>>> seems normal.
>>> So the backend that appears "stuck" is still using CPU?
>> Yes but the metter is that this procedures usualy use CPU just a
>> little so I can't find out if there is some oddity or not.
>
> OK, so it's not that it's stuck in a loop wasting a lot of CPU

In order to get at least some idea of what these processes are (or,
are not) doing, run an strace (or your OS's equivalent) on the
process before killing it.  Let us know what you see there.

Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com

Re: pg_cancel_backend() does not work with buzz queries

From

"Sergey Konoplev"

Date:

17 October 2007, 08:36:43

Hello again,

Sorry for the deal with my answer it was realy hectic week so I
couldn't even check my mail.

2007/10/3, Richard Huxton <dev@archonet.com>:
> Sergey Konoplev wrote:
> >> Don't forget to cc: the list.
> >> Try not to top-post replies, it's easier to read if you reply below the
> >> text you're replying to.
> >
> > Thanx for your advice. I'm just absolutely worned out. Sorry.
>
> Know that feeling - let's see if we can't sort this out.
>
> >>>> 1. Is it always the same query?
> >>>> 2. Does the client still think it's connected?
> >>>> 3. Is that query using up CPU, or just idling?
> >>>> 4. Anything odd in pg_locks for the problem pid?
> >>> 1. No it isn't. I have few functions (plpgsql, plpython) that cause
> >>> such situations more often than another but they are called more often
> >>> also.
> >> OK, so there's no real pattern. That would suggest it's not a particular
> >> query-plan that's got something wrong.
> >>
> >> Do you always get this problem inside a function?
> >
> > As far as I remember I do.
>
> Hmm - check Magnus' thoughts on pl/python. Can't comment on Python
> myself. Are you sure it's not always the same few function(s) that cause
> this problem?

Yes I'm shure. I've noticed about 10 queries and procedure calls
buzzing at least.

>
> >>> 2. The client just waits for query and buzz.
> >>> 3. They are using CPU in usual way and their pg_lock activity seems normal.
> >> So the backend that appears "stuck" is still using CPU?
> >
> > Yes but the metter is that this procedures usualy use CPU just a
> > little so I can't find out if there is some oddity or not.
>
> OK, so it's not that it's stuck in a loop wasting a lot of CPU
>
> >> So - the symptoms are:
> [snip]
> > Exactly.
>
> So - we need to solve two mysteries
> 1. Why are these functions not returning?
> 2. Why does SIGINT not interrupt them?
>
> >> Are you happy that your hardware and drivers are OK? There aren't
> >> problems with any other servers on this machine?
> >
> > Yes I'm quite happy. My hardware is: 2 double-core Xeon, 8Gb RAM,
> > RAID5. What about other software... it's dedicated PG server so I have
> > no problem with it.
>
> Well, the places I'd look would be:
> 1. Hardware (you're happy that's fine, and it's not quite the problems
> I'd expect)
> 2. Drivers (same as #1)
> 3. Client connectivity (but you say the client is fine)
> 4. External interactions (see below)
> 5. Bug in PG extension (pl/python)

I think it's not only lp/python problem cos I saw pl/pgsql and simple
queries also. For example:
select * from  (
    select d.*, cto.full_name, cast(st.name || ', ' || r.name as
varchar) as cityname
    from
        drivers d
        join cars_trailers_owners cto on
            d.cars_trailers_owner_id = cto.id
        join settles st on
            d.settle_id = st.id
        join regions r on
            st.region_id = r.id
    order by sname, name, pname
) as sq

> 6. Bug in PG core code
>
> Do any of your functions interact with the outside world - fetch
> webpages or similar? It could be they're waiting for that. If you're
> using a library that could hang waiting for a response and also block
> SIGINT at the same time that would explain everything.

No, most of them don't.

--
Regards,
Sergey Konoplev

Re: pg_cancel_backend() does not work with buzz queries

From

"Sergey Konoplev"

Date:

17 October 2007, 08:38:16

2007/10/3, Alvaro Herrera <alvherre@commandprompt.com>:
> Sergey Konoplev escribió:
>
> > > AFAIK, pl/pgsql does listen for SIGINT during execution, but I don't nkow
> > > abuot plpython.
> >
> > How can we find it out?
>
> Let's see one of the functions to find out if anyone else can reproduce
> the problem.
>

It happens with simple queries also. For example:

select * from  (
    select d.*, cto.full_name, cast(st.name || ', ' || r.name as
varchar) as cityname
    from
        drivers d
        join cars_trailers_owners cto on
            d.cars_trailers_owner_id = cto.id
        join settles st on
            d.settle_id = st.id
        join regions r on
            st.region_id = r.id
    order by sname, name, pname
) as sq

--
Regards,
Sergey Konoplev

Re: pg_cancel_backend() does not work with buzz queries

From

"Sergey Konoplev"

Date:

17 October 2007, 08:46:34

2007/10/3, Erik Jones <erik@myemma.com>:
> On Oct 3, 2007, at 6:47 AM, Richard Huxton wrote:
>
> > Sergey Konoplev wrote:
> >>> Don't forget to cc: the list.
> >>> Try not to top-post replies, it's easier to read if you reply
> >>> below the
> >>> text you're replying to.
> >> Thanx for your advice. I'm just absolutely worned out. Sorry.
> >
> > Know that feeling - let's see if we can't sort this out.
> >
> >>>>> 1. Is it always the same query?
> >>>>> 2. Does the client still think it's connected?
> >>>>> 3. Is that query using up CPU, or just idling?
> >>>>> 4. Anything odd in pg_locks for the problem pid?
> >>>> 1. No it isn't. I have few functions (plpgsql, plpython) that cause
> >>>> such situations more often than another but they are called more
> >>>> often
> >>>> also.
> >>> OK, so there's no real pattern. That would suggest it's not a
> >>> particular
> >>> query-plan that's got something wrong.
> >>>
> >>> Do you always get this problem inside a function?
> >> As far as I remember I do.
> >
> > Hmm - check Magnus' thoughts on pl/python. Can't comment on Python
> > myself. Are you sure it's not always the same few function(s) that
> > cause this problem?
> >
> >>>> 2. The client just waits for query and buzz.
> >>>> 3. They are using CPU in usual way and their pg_lock activity
> >>>> seems normal.
> >>> So the backend that appears "stuck" is still using CPU?
> >> Yes but the metter is that this procedures usualy use CPU just a
> >> little so I can't find out if there is some oddity or not.
> >
> > OK, so it's not that it's stuck in a loop wasting a lot of CPU
>
> In order to get at least some idea of what these processes are (or,
> are not) doing, run an strace (or your OS's equivalent) on the
> process before killing it.  Let us know what you see there.
>

That is what I've got using strace with the buzzed process:

pgdb:~ # strace -dirfvx -p 19313
Process 19313 attached - interrupt to quit
 [wait(0x137f) = 19313]
pid 19313 stopped, [SIGSTOP]
 [wait(0x57f) = 19313]
pid 19313 stopped, [SIGTRAP]
     0.000000 [ffffe410] send(8,
"\x00\x01\x74\xff\xff\xff\xff\x00\x00\x00\x01\x66\xff\xff"..., 8192, 0

[Output stoped here and after half hour I interupted strace]

cleanup: looking at pid 19313
 <unfinished ...>
Process 19313 detached
pgdb:~ #

--
Regards,
Sergey Konoplev

Re: pg_cancel_backend() does not work with buzz queries

From

"Sergey Konoplev"

Date:

23 October 2007, 02:58:13

2007/10/17, Sergey Konoplev <gray.ru@gmail.com>:
> Hello again,
>
> Sorry for the deal with my answer it was realy hectic week so I
> couldn't even check my mail.
>
> 2007/10/3, Richard Huxton <dev@archonet.com>:
> > Sergey Konoplev wrote:
> > >> Don't forget to cc: the list.
> > >> Try not to top-post replies, it's easier to read if you reply below the
> > >> text you're replying to.
> > >
> > > Thanx for your advice. I'm just absolutely worned out. Sorry.
> >
> > Know that feeling - let's see if we can't sort this out.
> >
> > >>>> 1. Is it always the same query?
> > >>>> 2. Does the client still think it's connected?
> > >>>> 3. Is that query using up CPU, or just idling?
> > >>>> 4. Anything odd in pg_locks for the problem pid?
> > >>> 1. No it isn't. I have few functions (plpgsql, plpython) that cause
> > >>> such situations more often than another but they are called more often
> > >>> also.
> > >> OK, so there's no real pattern. That would suggest it's not a particular
> > >> query-plan that's got something wrong.
> > >>
> > >> Do you always get this problem inside a function?
> > >
> > > As far as I remember I do.
> >
> > Hmm - check Magnus' thoughts on pl/python. Can't comment on Python
> > myself. Are you sure it's not always the same few function(s) that cause
> > this problem?
>
> Yes I'm shure. I've noticed about 10 queries and procedure calls
> buzzing at least.
>
> >
> > >>> 2. The client just waits for query and buzz.
> > >>> 3. They are using CPU in usual way and their pg_lock activity seems normal.
> > >> So the backend that appears "stuck" is still using CPU?
> > >
> > > Yes but the metter is that this procedures usualy use CPU just a
> > > little so I can't find out if there is some oddity or not.
> >
> > OK, so it's not that it's stuck in a loop wasting a lot of CPU
> >
> > >> So - the symptoms are:
> > [snip]
> > > Exactly.
> >
> > So - we need to solve two mysteries
> > 1. Why are these functions not returning?
> > 2. Why does SIGINT not interrupt them?
> >
> > >> Are you happy that your hardware and drivers are OK? There aren't
> > >> problems with any other servers on this machine?
> > >
> > > Yes I'm quite happy. My hardware is: 2 double-core Xeon, 8Gb RAM,
> > > RAID5. What about other software... it's dedicated PG server so I have
> > > no problem with it.
> >
> > Well, the places I'd look would be:
> > 1. Hardware (you're happy that's fine, and it's not quite the problems
> > I'd expect)
> > 2. Drivers (same as #1)
> > 3. Client connectivity (but you say the client is fine)
> > 4. External interactions (see below)
> > 5. Bug in PG extension (pl/python)
>
> I think it's not only lp/python problem cos I saw pl/pgsql and simple
> queries also. For example:
> select * from  (
>    select d.*, cto.full_name, cast(st.name || ', ' || r.name as
> varchar) as cityname
>    from
>        drivers d
>        join cars_trailers_owners cto on
>            d.cars_trailers_owner_id = cto.id
>        join settles st on
>            d.settle_id = st.id
>        join regions r on
>            st.region_id = r.id
>    order by sname, name, pname
> ) as sq
>
> > 6. Bug in PG core code
> >
> > Do any of your functions interact with the outside world - fetch
> > webpages or similar? It could be they're waiting for that. If you're
> > using a library that could hang waiting for a response and also block
> > SIGINT at the same time that would explain everything.
>
> No, most of them don't.
>
> --
> Regards,
> Sergey Konoplev
>

I've managed to repeat the situation and found out where the problem is.

I run on of the heaviest queries on a client application (Delphi,
psqlodbc.8.01.0101) then without waiting for it (the query run time is
about 5 minutes) I interrupted the client using task manager. There
was backend process still running on my server:

pgdb:/base/PG-Data # ps -ef |awk '/postgres.*konoplev.*SELECT/'
postgres  8590  8073  2 15:46 ?        00:00:36 postgres: konoplev
transport localhost(35442) SELECT
root      8973  7642  0 16:10 pts/0    00:00:00 awk /postgres.*konoplev.*SELECT/
pgdb:/base/PG-Data #

I took a look at TCP state with netstat:

pgdb:/base/PG-Data # netstat -pna |grep 8590
tcp        1      0 127.0.0.1:5432          127.0.0.1:35442
CLOSE_WAIT  8590/postgres: kono
pgdb:/base/PG-Data #

Remote address is 127.0.0.1 because clients connect to postgres
through SSH-tunnel. Then I used strace to make shure that the process
is active and after few times of repeating it I noticed its hunging:

pgdb:/base/PG-Data # strace -dfirtvx -p 8590
Process 8590 attached - interrupt to quit
[wait(0x137f) = 8590]
pid 8590 stopped, [SIGSTOP]
[wait(0x57f) = 8590]
pid 8590 stopped, [SIGTRAP]
    0.000000 [ffffe410] send(11,
"\x30\x30\x37\x2d\x31\x30\x2d\x31\x39\x20\x32\x32\x3a\x31"...,
8192, 0

[Output stoped here and after few minutes I interrupted it by pressing Ctrl-C]

cleanup: looking at pid 8590
<unfinished ...>
Process 8590 detached
pgdb:/base/PG-Data #

Next I run gdb and took a look at callstack of the hung process:

pgdb:/base/PG-Data # gdb /opt/PostgreSQL/bin/postgres 8590
GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i586-suse-linux"...Using host libthread_db library
"/lib/tls/libthread_db.so.1".

Attaching to program: /opt/PostgreSQL/bin/postgres, process 8590
Reading symbols from /usr/lib/libssl.so.0.9.7...done.
Loaded symbols for /usr/lib/libssl.so.0.9.7
Reading symbols from /usr/lib/libcrypto.so.0.9.7...done.
Loaded symbols for /usr/lib/libcrypto.so.0.9.7
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /usr/lib/gconv/KOI8-R.so...done.
Loaded symbols for /usr/lib/gconv/KOI8-R.so
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /opt/PostgreSQL/lib/postgresql/plpgsql.so...done.
Loaded symbols for /opt/PostgreSQL/lib/postgresql/plpgsql.so
0xffffe410 in ?? ()
(gdb) bt
#0  0xffffe410 in ?? ()
#1  0xbfffd718 in ?? ()
#2  0x082f16c0 in sock_path ()
#3  0xbfffd6f0 in ?? ()
#4  0x40254781 in send () from /lib/tls/libc.so.6
#5  0x0814bdf2 in secure_write ()
#6  0x08151632 in internal_flush ()
#7  0x08151704 in internal_putbytes ()
#8  0x08151782 in pq_putmessage ()
#9  0x08152c6a in pq_endmessage ()
#10 0x0807e5fc in printtup ()
#11 0x08134417 in ExecutorRun ()
#12 0x081b50f4 in PortalRunSelect ()
#13 0x081b6271 in PortalRun ()
#14 0x081b1d96 in exec_simple_query ()
#15 0x081b3552 in PostgresMain ()
#16 0x0818e169 in ServerLoop ()
#17 0x0818ed10 in PostmasterMain ()
#18 0x08153841 in main ()
(gdb) quit
The program is running.  Quit anyway (and detach it)? (y or n) y
Detaching from program: /opt/PostgreSQL/bin/postgres, process 8590
pgdb:/base/PG-Data #

I checked the process's TCP stack state again and sent SIGINT to it
(the same as it would be pg_cancel_backend()):

pgdb:/base/PG-Data # netstat -pna |grep 8590
tcp        1      0 127.0.0.1:5432          127.0.0.1:35442
CLOSE_WAIT  8590/postgres: kono
pgdb:/base/PG-Data # kill -INT 8590
pgdb:/base/PG-Data # ps -ef |grep 8590
postgres  8590  8073  0 15:46 ?        00:00:36 postgres: konoplev
transport localhost(35442) SELECT
root      9869  7642  0 17:17 pts/0    00:00:00 grep 8590

So it was alive.

The only solution that allows me to finish it without using SIGQUIT or
SIGKILL was killing an oposit SSH process. I got the process's PID and
killed it using SIGHUP.

pgdb:/base/PG-Data # netstat -pna |grep 35442
tcp        1 131072 127.0.0.1:5432          127.0.0.1:35442
CLOSE_WAIT  8590/postgres: kono
tcp    65536      0 127.0.0.1:35442         127.0.0.1:5432
FIN_WAIT2   7944/sshd: dcsshcli
pgdb:/base/PG-Data # kill -HUP 7944
pgdb:/base/PG-Data # ps -ef |grep 8590
root      9951  7642  0 17:24 pts/0    00:00:00 grep 8590

So there was no hunging postgress backend.

I'm going to write a shell script for periodical checking hunging
processes with stack state CLOSE_WAIT and killing theirs oposit SSH
processes after some time but I would be pleased if somebody could
advise me better solution.

Thank you.

--
Regards,
Sergey Konoplev

Re: pg_cancel_backend() does not work with buzz queries

From

Martijn van Oosterhout

Date:

23 October 2007, 04:38:11

On Tue, Oct 23, 2007 at 09:56:26AM +0400, Sergey Konoplev wrote:
> I took a look at TCP state with netstat:
>
> pgdb:/base/PG-Data # netstat -pna |grep 8590
> tcp        1      0 127.0.0.1:5432          127.0.0.1:35442
> CLOSE_WAIT  8590/postgres: kono

CLOSE_WAIT means that the client (in this case SSH) has shutdown() its
end of the connection and evidently postgresql hasn't noticed. However,
SSH has not closed its socket entirely, because then the write would
fail.

Can you strace the SSH daemon, my bet is that it's also stuck on a
write(), to the original client. It would also be interesting to know
what the original client is doing, since it's obviously still alive.
Looks like somewhere along the chain a program called shutdown() but is
no longer reading incoming data...

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

signature.asc

Re: pg_cancel_backend() does not work with buzz queries

From

"Sergey Konoplev"

Date:

23 October 2007, 08:53:13

2007/10/23, Martijn van Oosterhout <kleptog@svana.org>:
> On Tue, Oct 23, 2007 at 09:56:26AM +0400, Sergey Konoplev wrote:
> > I took a look at TCP state with netstat:
> >
> > pgdb:/base/PG-Data # netstat -pna |grep 8590
> > tcp        1      0 127.0.0.1:5432          127.0.0.1:35442
> > CLOSE_WAIT  8590/postgres: kono
>
> CLOSE_WAIT means that the client (in this case SSH) has shutdown() its
> end of the connection and evidently postgresql hasn't noticed. However,
> SSH has not closed its socket entirely, because then the write would
> fail.
>
> Can you strace the SSH daemon, my bet is that it's also stuck on a
> write(), to the original client. It would also be interesting to know
> what the original client is doing, since it's obviously still alive.
> Looks like somewhere along the chain a program called shutdown() but is
> no longer reading incoming data...
>
> Hope this helps,

I've done strace and noticed that this SHH daemon is repeating next output:

pid 10511 stopped, [SIGTRAP]
    10.485688 [ffffe410] rt_sigprocmask(SIG_BLOCK, [CHLD],
[wait(0x57f) = 10511]
pid 10511 stopped, [SIGTRAP]
[], 8) = 0
 [wait(0x57f) = 10511]
pid 10511 stopped, [SIGTRAP]
     0.000578 [ffffe410] rt_sigprocmask(SIG_SETMASK, [],  [wait(0x57f) = 10511]
pid 10511 stopped, [SIGTRAP]
NULL, 8) = 0
 [wait(0x57f) = 10511]
pid 10511 stopped, [SIGTRAP]
     0.000468 [ffffe410] read(4,  [wait(0x57f) = 10511]
pid 10511 stopped, [SIGTRAP]
"\x0e\xd7\x62\xdb\xc8\x97\xbb\xbc\x52\xe6\xe1\xab\x6a\xcc"..., 16384) = 32
 [wait(0x57f) = 10511]
pid 10511 stopped, [SIGTRAP]
     0.000533 [ffffe410] select(25, [4 5 19 21 22], [22], NULL, NULL
[wait(0x57f) = 10511]
pid 10511 stopped, [SIGTRAP]
) = 1 (out [22])
 [wait(0x57f) = 10511]
pid 10511 stopped, [SIGTRAP]
     0.000705 [ffffe410] rt_sigprocmask(SIG_BLOCK, [CHLD],
[wait(0x57f) = 10511]
pid 10511 stopped, [SIGTRAP]
[], 8) = 0
 [wait(0x57f) = 10511]
pid 10511 stopped, [SIGTRAP]
     0.000481 [ffffe410] rt_sigprocmask(SIG_SETMASK, [],  [wait(0x57f) = 10511]
pid 10511 stopped, [SIGTRAP]
NULL, 8) = 0
 [wait(0x57f) = 10511]
pid 10511 stopped, [SIGTRAP]
     0.000447 [ffffe410] write(22,
"\x51\x00\x00\x00\x11\x53\x65\x6c\x65\x63\x74\x20\x27\x4f"..., 18
[wait(0x57f) = 10511]
pid 10511 stopped, [SIGTRAP]
) = 18
 [wait(0x57f) = 10511]
pid 10511 stopped, [SIGTRAP]
     0.000876 [ffffe410] select(25, [4 5 19 21 22], [], NULL, NULL
[wait(0x57f) = 10511]
pid 10511 stopped, [SIGTRAP]
) = 1 (in [22])
 [wait(0x57f) = 10511]

Original client was interrupted by me before the backend has hung and
I've found neither its process (in task manager) nor corresponding
netstat output line.

By the way, I've also noticed that the SSH daemon's (which is in
FIN_WAIT2) timer is "off" and took a look at tcp_fin_timeout...

pgdb:~ # netstat -pnao |grep 37465
tcp        1 131072 127.0.0.1:5432          127.0.0.1:37465
CLOSE_WAIT  24855/postgres: kon unkn-4 (41.30/0/0)
tcp    73728      0 127.0.0.1:37465         127.0.0.1:5432
FIN_WAIT2   10511/sshd: dcsshcl off (0.00/0/0)
pgdb:~ # cat /proc/sys/net/ipv4/tcp_fin_timeout
60

...which is 60. I wonder are there another variables which affect on
FIN_WAIT2 timeout?

--
Regards,
Sergey Konoplev