Thread: BUG #6342: libpq blocks forever in "poll" function

BUG #6342: libpq blocks forever in "poll" function

From
andreagrassi@sogeasoft.com
Date:
The following bug has been logged on the website:

Bug reference:      6342
Logged by:          Andrea Grassi
Email address:      andreagrassi@sogeasoft.com
PostgreSQL version: 8.4.8
Operating system:   SUSE SLES 10 SP4 64 BIT
Description:=20=20=20=20=20=20=20=20

Hi,=20
I have a big and strange problem. Sometimes, libpq remains blocked in =E2=
=80=9Cpoll=E2=80=9D
function even if the server has already answered to the query. If I attach
to the process using kdbg I found this stack:

__kernel_vsyscall()
poll()                          from /lib/libc.so.6
pqSocketCheck()  from /home/pg/pgsql/lib-32/libpq.so.5
pqWaitTimed()      from /home/pg/pgsql/lib-32/libpq.so.5
pqWait()                  from /home/pg/pgsql/lib-32/libpq.so.5
PQgetResult()       from /home/pg/pgsql/lib-32/libpq.so.5
PQexecFinish()     from /home/pg/pgsql/lib-32/libpq.so.5
=E2=80=A6


To simplify the context and to reproduce the bug, I wrote a test program
(that I attach below) that uses only libpq interface (no other strange
libraries) to read my database at localhost.=20
It loop on a table of 64000 rows and for each row it reads another table.
Generally it take 1 minute to work. I put this program in a loop, so once it
finishes, it restarts.=20
Usually it works fine but sometimes (without any rule) it blocks. It blocks
always (with the stack above) executing PQexec function (=E2=80=9CCLOSE CUR=
SOR xx=E2=80=9D
or =E2=80=9CFETCH ALL IN xx=E2=80=9D).
If I press =E2=80=9Ccontinue=E2=80=9D on kdbg after attaching the process, =
the programs
continue its execution and exit with success.
Here the specifics of the platform (a SLES 10 SP4 64-bit WITHOUT any
VMWARE)

Server
HP DL 580 G7
4 CPU INTEL XEON X7550
64 GB RAM
8 HD 600GB SAS DP 6G 2,5=E2=80=9D RAID 1 e RAID5

S.O.=20
SUSE SLES 10 SP4 64 BIT

Kernel=20
Linux linuxspanesi 2.6.16.60-0.85.1-smp #1 SMP Thu Mar 17 11:45:06 UTC 2011
x86_64 x86_64 x86_64 GNU/Linux

Server Postgres=20
8.4.8 - 64-bit=20

Libpq
8.4.8 =E2=80=93 32-bit=20

I try to recompile libpq in=20
-    debug mode
-    on a 64-bit machine with =E2=80=93m32 option
-    on a 32-bit machine=20
-    setting HAVE_POLL to false at line 1053 in fe-misc.c to force to execute
the other branch of =E2=80=9C#ifdef/else=E2=80=9D using the function =E2=80=
=9Cselect()=E2=80=9D instead of
=E2=80=9Cpoll()=E2=80=9D
but none fixes the bug. I had the same stack as above, except for the last
case in which I had =E2=80=9C___newselect_nocancel()=E2=80=9D instead of =
=E2=80=9Cpoll()=E2=80=9D.

If I check the state of the connection using the =E2=80=9Cnetstat=E2=80=9D =
command I get
this output:

tcp         24      0    127.0.0.1:49007        127.0.0.1:5432=20=20=20=20=
=20=20=20
ESTABLISHED        17415/pq_example.e

where the second field (recv-Q) is always blocked to a non-zero value.
It seems as the server has already answered but the libpq or poll function
don=E2=80=99t realize it.=20
Consider that the machine is very good and very fast.
It seems that the answer of the server arrives before the libpq starts
waiting for it (calling poll). Could be ?=20

I try to install a VMware this the same version of Linux and same version of
the kernel on a machine much less powerful: my program works fine and never
blocks.

Here below the code of the example program:

/*
 * testlibpq.c
 *
 *      Test the C version of libpq, the PostgreSQL frontend library.
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "libpq-fe.h"

static void
exit_nicely(PGconn *conn)
{
    PQfinish(conn);
    exit(1);
}

int
main(int argc, char **argv)
{
    const char *conninfo;
    PGconn     *conn;
    PGresult   *res;
    int         i,
                j;
    /*
     * If the user supplies a parameter on the command line, use it as the
     * conninfo string; otherwise default to setting dbname=3Dpostgres and
using
     * environment variables or defaults for all other connection
parameters.
     */

    /* Make a connection to the database */
#ifdef CASE1
       conn =3D PQsetdbLogin( getenv("SQLSERVER"),             // pghost
                            0,                               // pgport
                            0,                               // pgoptions
                            0,                               // pgtty
                            "OSA",                           // dbName
                            0,                               // login
                            0                                // pwd
                           );
#else
      conn =3D PQconnectdb("dbname =3D OSA");
#endif

    /* Check to see that the backend connection was successfully made */
    if (PQstatus(conn) !=3D CONNECTION_OK)
    {
        fprintf(stderr, "Connection to database failed: %s",
                PQerrorMessage(conn));
        exit_nicely(conn);
    }

    res =3D PQexec (conn, "SET datestyle=3D'ISO'");
    switch (PQresultStatus (res))
     {
      case PGRES_BAD_RESPONSE:
      case PGRES_NONFATAL_ERROR:
      case PGRES_FATAL_ERROR:
         fprintf(stderr, "SET DATESTYLE command failed: %s",
PQresultErrorMessage(res));
         break;
     }
    PQclear(res);


    /*
     * Our test case here involves using a cursor, for which we must be
inside
     * a transaction block.  We could do the whole thing with a single
     * PQexec() of "select * from pg_database", but that's too trivial to
make
     * a good example.
     */

    /* Start a transaction block */
    res =3D PQexec(conn, "BEGIN");
    if (PQresultStatus(res) !=3D PGRES_COMMAND_OK)
    {
        fprintf(stderr, "BEGIN command failed: %s", PQerrorMessage(conn));
        PQclear(res);
        exit_nicely(conn);
    }

    /*
     * Should PQclear PGresult whenever it is no longer needed to avoid
memory
     * leaks
     */
    PQclear(res);

    /*
     * Fetch rows from pg_database, the system catalog of databases
     */
    res =3D PQexec(conn, "DECLARE articoli CURSOR FOR select cdart from
base_a_artico ORDER BY cdart");
    if (PQresultStatus(res) !=3D PGRES_COMMAND_OK)
    {
        fprintf(stderr, "DECLARE CURSOR failed: %s", PQerrorMessage(conn));
        PQclear(res);
        exit_nicely(conn);
    }
    PQclear(res);

    res =3D PQexec(conn, "FETCH ALL in articoli");
    if (PQresultStatus(res) !=3D PGRES_TUPLES_OK)
    {
        fprintf(stderr, "FETCH ALL failed: %s", PQerrorMessage(conn));
        PQclear(res);
        exit_nicely(conn);
    }

    /* next, print out the rows */
    for (i =3D 0; i < PQntuples(res); i++)
    {
        read_rigpia(conn, PQgetvalue(res, i, 0));
    }

    PQclear(res);

    /* close the portal ... we don't bother to check for errors ... */
    res =3D PQexec(conn, "CLOSE articoli");
    PQclear(res);

    /* end the transaction */
    res =3D PQexec(conn, "END");
    PQclear(res);

    /* close the connection to the database and cleanup */
    PQfinish(conn);

    return 0;
}

int read_rigpia(PGconn* conn, char* cdart)
{
    PGresult   *res; char sql[1024]; int i;
    char* dtfab;
    char* sum;

    memset(sql,0,sizeof(sql));
    sprintf(sql,"DECLARE rigpia CURSOR FOR select dtfab,sum(qtfan-qtpro)
from adp_d_rigpia where flsta=3D'' and cdart=3D'%s' and qtfan>qtpro and cdd=
pu
not in ('04','05','06','07','08','09',
'91','92','93','94','95','96','97','98','A0','B8','C2','LF','SC') group by
dtfab", cdart);

    res =3D PQexec(conn, sql);=20
    if (PQresultStatus(res) !=3D PGRES_COMMAND_OK)
    {
        fprintf(stderr, "DECLARE CURSOR rigpia failed: %s *** %s",
PQerrorMessage(conn),sql);
        PQclear(res);
        return 0;=20
    }

    PQclear(res);
    res =3D PQexec(conn, "FETCH ALL in rigpia");
    if (PQresultStatus(res) !=3D PGRES_TUPLES_OK)
    {
        fprintf(stderr, "FETCH ALL failed in rigpia: %s",
PQerrorMessage(conn));
        PQclear(res);
        return 0;
    }

    /* next, print out the rows */
    for (i =3D 0; i < PQntuples(res); i++)
    {
        dtfab =3D PQgetvalue(res, i, 0);
        sum   =3D PQgetvalue(res, i, 1);
    }

    PQclear(res); res =3D PQexec(conn, "CLOSE rigpia"); PQclear(res);
}

Regards,=20
Andrea=20

Re: BUG #6342: libpq blocks forever in "poll" function

From
Craig Ringer
Date:
On 12/16/2011 03:45 PM, andreagrassi@sogeasoft.com wrote:
> The following bug has been logged on the website:
>
> Bug reference:      6342
> Logged by:          Andrea Grassi
> Email address:      andreagrassi@sogeasoft.com
> PostgreSQL version: 8.4.8
> Operating system:   SUSE SLES 10 SP4 64 BIT
> Description:
>
> Hi,
> I have a big and strange problem. Sometimes, libpq remains blocked in “poll”
> function even if the server has already answered to the query. If I attach
> to the process using kdbg I found this stack:
>
> __kernel_vsyscall()
> poll()                          from /lib/libc.so.6
Given all you described (thanks for all the detail!) then if client and
server are on different hosts this could easily be a NIC driver or HW
issue. It'd be interesting to see the output of:

ps -C testprogramname -o wchan:80=

--
Craig Ringer

Re: BUG #6342: libpq blocks forever in "poll" function

From
Craig Ringer
Date:
On 12/16/2011 03:45 PM, andreagrassi@sogeasoft.com wrote:
> The following bug has been logged on the website:
>
> Bug reference:      6342
> Logged by:          Andrea Grassi
> Email address:      andreagrassi@sogeasoft.com
> PostgreSQL version: 8.4.8
> Operating system:   SUSE SLES 10 SP4 64 BIT
> Description:
>
> Hi,
> I have a big and strange problem. Sometimes, libpq remains blocked in “poll”
> function even if the server has already answered to the query. If I attach
> to the process using kdbg I found this stack:
>
> __kernel_vsyscall()
> poll()                          from /lib/libc.so.6
>
Actually, even better, can you show the full kernel stack of your test
program when it's blocked? cat /proc/$pid/stack .

R: BUG #6342: libpq blocks forever in "poll" function

From
"Andrea Grassi"
Date:
The client program and the postgres server are on the same host, client con=
nects to 127.0.0.1.
In the meantime, my original program blocks (not my example but very probab=
ly the reasons are the same).=20

I typed "ps -C testprogramname -o wchan:80=3D" and the output was only a si=
ngle dash ( "-" ).
I searched for the complete stack in /proc/$pid/stack (where $pid) was the =
pid of my process but this file doesn't exists !! Why ?=20

I'm waiting for my example to block in order to send you what you request m=
e about the test program (even if the output almost surely will be the same=
).

Regards,=20
Andrea Grassi

-----Messaggio originale-----
Da: Craig Ringer [mailto:ringerc@ringerc.id.au]=20
Inviato: venerd=C3=AC 16 dicembre 2011 9.24
A: andreagrassi@sogeasoft.com
Cc: pgsql-bugs@postgresql.org
Oggetto: Re: [BUGS] BUG #6342: libpq blocks forever in "poll" function

On 12/16/2011 03:45 PM, andreagrassi@sogeasoft.com wrote:
> The following bug has been logged on the website:
>
> Bug reference:      6342
> Logged by:          Andrea Grassi
> Email address:      andreagrassi@sogeasoft.com
> PostgreSQL version: 8.4.8
> Operating system:   SUSE SLES 10 SP4 64 BIT
> Description:
>
> Hi,
> I have a big and strange problem. Sometimes, libpq remains blocked in =E2=
=80=9Cpoll=E2=80=9D
> function even if the server has already answered to the query. If I attach
> to the process using kdbg I found this stack:
>
> __kernel_vsyscall()
> poll()                          from /lib/libc.so.6
Given all you described (thanks for all the detail!) then if client and=20
server are on different hosts this could easily be a NIC driver or HW=20
issue. It'd be interesting to see the output of:

ps -C testprogramname -o wchan:80=3D

--
Craig Ringer

Re: R: BUG #6342: libpq blocks forever in "poll" function

From
Craig Ringer
Date:
On 16/12/2011 10:10 PM, Andrea Grassi wrote:
> The client program and the postgres server are on the same host, client connects to 127.0.0.1.
> In the meantime, my original program blocks (not my example but very probably the reasons are the same).
>
> I typed "ps -C testprogramname -o wchan:80=" and the output was only a single dash ( "-" ).
That means it's not waiting in a kernel call right now. Was the program
in the hung state you've observed at the time you ran the command? Its
output would only be interesting when it's hung.
> I searched for the complete stack in /proc/$pid/stack (where $pid) was the pid of my process but this file doesn't
exists!! Why ? 
Old kernel, maybe? You're running on some kind of enterprise-y distro,
so who knows how ancient half the stuff in there is.

--
Craig Ringer

R: R: BUG #6342: libpq blocks forever in "poll" function

From
"Andrea Grassi"
Date:
1. Yes, obviously I launched that command when the program was blocked.

2. The system release is "SUSE SLES 10 SP4 64 BIT" with kernel 2.6.16.60.=
=20
Maybe it's not the last release but it can justify the blocks ? In other sy=
stems (VMWARE) the same system work fine.
What you suspect ? libpq ? linux release ? hardware ?=20

If you need other informations let me know.=20=20

Regards, Andrea


-----Messaggio originale-----
Da: Craig Ringer [mailto:ringerc@ringerc.id.au]=20
Inviato: sabato 17 dicembre 2011 7.19
A: Andrea Grassi
Cc: pgsql-bugs@postgresql.org
Oggetto: Re: R: [BUGS] BUG #6342: libpq blocks forever in "poll" function

On 16/12/2011 10:10 PM, Andrea Grassi wrote:
> The client program and the postgres server are on the same host, client c=
onnects to 127.0.0.1.
> In the meantime, my original program blocks (not my example but very prob=
ably the reasons are the same).
>
> I typed "ps -C testprogramname -o wchan:80=3D" and the output was only a =
single dash ( "-" ).
That means it's not waiting in a kernel call right now. Was the program=20
in the hung state you've observed at the time you ran the command? Its=20
output would only be interesting when it's hung.
> I searched for the complete stack in /proc/$pid/stack (where $pid) was th=
e pid of my process but this file doesn't exists !! Why ?
Old kernel, maybe? You're running on some kind of enterprise-y distro,=20
so who knows how ancient half the stuff in there is.

--
Craig Ringer

R: R: BUG #6342: libpq blocks forever in "poll" function

From
"Andrea Grassi"
Date:
Hi, Craig
Now my process is blocked and I have the case in my hands.
Do you have something to ask me in order to have more details ?

Regards, Andrea

-----Messaggio originale-----
Da: Craig Ringer [mailto:ringerc@ringerc.id.au]=20
Inviato: sabato 17 dicembre 2011 7.19
A: Andrea Grassi
Cc: pgsql-bugs@postgresql.org
Oggetto: Re: R: [BUGS] BUG #6342: libpq blocks forever in "poll" function

On 16/12/2011 10:10 PM, Andrea Grassi wrote:
> The client program and the postgres server are on the same host, client c=
onnects to 127.0.0.1.
> In the meantime, my original program blocks (not my example but very prob=
ably the reasons are the same).
>
> I typed "ps -C testprogramname -o wchan:80=3D" and the output was only a =
single dash ( "-" ).
That means it's not waiting in a kernel call right now. Was the program=20
in the hung state you've observed at the time you ran the command? Its=20
output would only be interesting when it's hung.
> I searched for the complete stack in /proc/$pid/stack (where $pid) was th=
e pid of my process but this file doesn't exists !! Why ?
Old kernel, maybe? You're running on some kind of enterprise-y distro,=20
so who knows how ancient half the stuff in there is.

--
Craig Ringer

R: R: BUG #6342: libpq blocks forever in "poll" function

From
"Andrea Grassi"
Date:
Sorry if I insist, but now I have the case at hand (my test program is now =
blocked), so I can check and verify all what you want.
I would like to know if it can be a libpq bug or if you think the fault is =
due to a system bug or to a machine issue and in this case I would be grate=
ful if you could give me a hint on what could be.

Regards, Andrea

-----Messaggio originale-----
Da: Craig Ringer [mailto:ringerc@ringerc.id.au]=20
Inviato: sabato 17 dicembre 2011 7.19
A: Andrea Grassi
Cc: pgsql-bugs@postgresql.org
Oggetto: Re: R: [BUGS] BUG #6342: libpq blocks forever in "poll" function

On 16/12/2011 10:10 PM, Andrea Grassi wrote:
> The client program and the postgres server are on the same host, client c=
onnects to 127.0.0.1.
> In the meantime, my original program blocks (not my example but very prob=
ably the reasons are the same).
>
> I typed "ps -C testprogramname -o wchan:80=3D" and the output was only a =
single dash ( "-" ).
That means it's not waiting in a kernel call right now. Was the program=20
in the hung state you've observed at the time you ran the command? Its=20
output would only be interesting when it's hung.
> I searched for the complete stack in /proc/$pid/stack (where $pid) was th=
e pid of my process but this file doesn't exists !! Why ?
Old kernel, maybe? You're running on some kind of enterprise-y distro,=20
so who knows how ancient half the stuff in there is.

--
Craig Ringer

Re: R: R: BUG #6342: libpq blocks forever in "poll" function

From
Harry Rossignol
Date:
I also have a client suffering an occasional 'application hang' running
Suse 11.2 and postgressql 8.4 on an 8 core box which is not reproducable
in a VMWare test environment.
Access to postgres is libpq 127.0.0.1 as well. Unfortunately the client
must restart ASAP and I have not produced a 'test case'.


On 12/20/2011 1:01 AM, Andrea Grassi wrote:
> Sorry if I insist, but now I have the case at hand (my test program is now blocked), so I can check and verify all
whatyou want. 
> I would like to know if it can be a libpq bug or if you think the fault is due to a system bug or to a machine issue
andin this case I would be grateful if you could give me a hint on what could be. 
>
> Regards, Andrea
>
> -----Messaggio originale-----
> Da: Craig Ringer [mailto:ringerc@ringerc.id.au]
> Inviato: sabato 17 dicembre 2011 7.19
> A: Andrea Grassi
> Cc: pgsql-bugs@postgresql.org
> Oggetto: Re: R: [BUGS] BUG #6342: libpq blocks forever in "poll" function
>
> On 16/12/2011 10:10 PM, Andrea Grassi wrote:
>> The client program and the postgres server are on the same host, client connects to 127.0.0.1.
>> In the meantime, my original program blocks (not my example but very probably the reasons are the same).
>>
>> I typed "ps -C testprogramname -o wchan:80=" and the output was only a single dash ( "-" ).
> That means it's not waiting in a kernel call right now. Was the program
> in the hung state you've observed at the time you ran the command? Its
> output would only be interesting when it's hung.
>> I searched for the complete stack in /proc/$pid/stack (where $pid) was the pid of my process but this file doesn't
exists!! Why ? 
> Old kernel, maybe? You're running on some kind of enterprise-y distro,
> so who knows how ancient half the stuff in there is.
>
> --
> Craig Ringer
>
>

Re: R: R: BUG #6342: libpq blocks forever in "poll" function

From
Alvaro Herrera
Date:
Excerpts from Andrea Grassi's message of mar dic 20 06:01:55 -0300 2011:
> Sorry if I insist, but now I have the case at hand (my test program is no=
w blocked), so I can check and verify all what you want.
> I would like to know if it can be a libpq bug or if you think the fault i=
s due to a system bug or to a machine issue and in this case I would be gra=
teful if you could give me a hint on what could be.

Please attach GDB to the stuck process (gdb -p `pidof testprogram`) and
grab a backtrace (bt full).

--=20
=C3=81lvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

R: R: R: BUG #6342: libpq blocks forever in "poll" function

From
"Andrea Grassi"
Date:
This is the output of "bt full" command in gdb of my test program.=20
In this case the libpqs was not compiled in debug-mode.=20

(gdb) bt full
#0  0xffffe410 in __kernel_vsyscall ()
No symbol table info available.
#1  0xf76539cb in poll () from /lib/libc.so.6
No symbol table info available.
#2  0xf770d39a in pqSocketCheck () from /home/pg/pgsql/lib-32/libpq.so.5
No symbol table info available.
#3  0xf770d49d in pqWaitTimed () from /home/pg/pgsql/lib-32/libpq.so.5
No symbol table info available.
#4  0xf770d513 in pqWait () from /home/pg/pgsql/lib-32/libpq.so.5
No symbol table info available.
#5  0xf770c6d6 in PQgetResult () from /home/pg/pgsql/lib-32/libpq.so.5
No symbol table info available.
#6  0xf770c89c in PQexecFinish () from /home/pg/pgsql/lib-32/libpq.so.5
No symbol table info available.
#7  0x08048c3f in read_rigpia ()
No symbol table info available.
#8  0x08048ae9 in main ()

Here below I add the complete stack of my business application (blocked als=
o it) if it can be useful.
In this case, the libpq was compiled in debug-mode and we can see the value=
 parameter of function and the lines numbers of code.
Consider that the stack until PQexecFinish is the same as my test program.

(gdb) bt full
#0  0xffffe410 in __kernel_vsyscall ()
No symbol table info available.
#1  0xf6cdb9cb in poll () from /lib/libc.so.6
No symbol table info available.
#2  0xf766a39a in pqSocketPoll (conn=3D0x90e0838, forRead=3D1, forWrite=3D0=
, end_time=3D-1) at fe-misc.c:1082
No locals.
#3  pqSocketCheck (conn=3D0x90e0838, forRead=3D1, forWrite=3D0, end_time=3D=
-1)
    at fe-misc.c:1024
        result =3D -1
#4  0xf766a49d in pqWaitTimed (forRead=3D1, forWrite=3D0, conn=3D0x90e0838,=
=20
        finish_time=3D-1) at fe-misc.c:956
        result =3D <value optimized out>
#5  0xf766a513 in pqWait (forRead=3D1, forWrite=3D0, conn=3D0x90e0838)
    at fe-misc.c:939
No locals.
#6  0xf76696d6 in PQgetResult (conn=3D0x90e0838) at fe-exec.c:1554
    flushResult =3D 1
    res =3D 0x0
#7  0xf766989c in PQexecFinish (conn=3D0x90e0838) at fe-exec.c:1807
    result =3D 0x23
    lastResult =3D 0x0
#8  0xf767c3ec in pos_fetch (cur_dta=3D0x9485c80) at possup.c:930
   cmd =3D "FETCH 100 IN cur038_00063", '\000' <repeats 23 times>, "=C3=9Eh=
\031\b\230\021&\bl_R\t(\t=C3=9C=C3=BFm\216\027\b=C3=B8=C2=B8\016\t", '\000'=
 <repeats 12 times>, "=C3=B8=C2=B8\016\t\000\000\000\000\020\000\000\000\23=
0\021&\bl_R\t=C3=90=C2=B8\016\tX\t=C3=9C=C3=BF/\221\027\bl_R\t=C3=90=C2=B8\=
016\t\001\000\000"
   res =3D 0x0
   bind =3D 0x0
   buf =3D 0x0
   colinfo =3D 0x0
   colnum =3D 136712600
   len =3D 156393324
   type =3D 1
   row =3D -2356856
   null =3D 135921165
#9  0xf767b147 in dm_possup (request=3D35) at possup.c:216
   retcode =3D 135268645
   l =3D 156393324
   eliminata =3D 0
#10 0x081076f3 in dm_call_fnc ()
   No symbol table info available.
#11 0x080fda3d in dm_do_a_fetch ()
   No symbol table info available.
#12 0x080fd913 in dm_fetch ()
   No symbol table info available.
#13 0x08102974 in dm_execute ()
   No symbol table info available.
#14 0x080f96de in execute_cursor ()
   No symbol table info available.
#15 0x080f8556 in dm_do_dbms ()
   No symbol table info available.
#16 0x080ff22f in dm_call ()
   No symbol table info available.
#17 0x080f7edd in dm_dbms ()
   No symbol table info available.
#18 0xf76a655e in dm_dbms_drv (
    command=3D0xffdc0fa0 "with cursor cur038_00063 execute ")
    at r_sqlutifunc.c:1090
    No locals.
#19 0xf76ba4f6 in fetchCursorDb (curName=3D0xffdc1050 "cur038_00063")
    at sqlPanth.c:895
    buffer =3D "with cursor cur038_00063 execute ", '\000' <repeats 35 time=
s>, "=C3=A0*=C3=96=C3=B6\000\000\000\000\000\000\000\000=C3=B4/=C3=96=C3=B6=
=C3=87=C2=A5l=C3=B7X=C2=A2=C3=9C=C3=BF\030\020=C3=9C=C3=BF=C2=BE\220=C3=86=
=C3=B6P\020=C3=9C=C3=BF\r=C2=B6l=C3=B7(\020=C3=9C=C3=BF(\020=C3=9C=C3=BF\00=
4N<\b=C3=98\020=C3=9C=C3=BF+ k=C3=B7"
    app =3D 0x0
    retcode =3D 0
    command =3D 0xf76cb6e4 "execute"
    using_app =3D '\000' <repeats 1023 times>
#20 0xf76ba03d in fetchCursor (f0_file=3D38, curName=3D0xf76cbb40 "")
    at sqlPanth.c:759
    cursor =3D "cur038_00063", '\000' <repeats 115 times>
    app =3D 0x0
    retcode =3D 0
#21 0xf76bc020 in sqlRead (f0_file=3D38, w_dat=3D0xffdcc23c "", mode=3D7)
    at sqlRead.c:109
    msg =3D "\000\000\000\000\000\000\000\000\\^=C3=8D=C3=B6\000\000\000\00=
0\035\000\000\000\000\000\000\b\000\000\000\000^\002=C3=9D=C3=BF\000\000\00=
0\000ph\021\th\221=C3=9C=C3=BF\022=C3=ADj=C3=B7\224.o=C3=B7=C3=A0*=C3=96=C3=
=B6=C3=BF=C3=BF=C3=BF=C3=BF=C2=AB=C2=B7\005\b"
    s_where =3D "     \"cdart\" =3D '50110725'  ", '\000' <repeats 32740 ti=
mes>
    tslock =3D 0
    tpOrd =3D 68
    id_rec =3D 0
    Failed =3D 0
    failed_lock =3D 0
    old_w_dat =3D 0x0
    init_col =3D 0
    ret =3D 0
#22 0xf7699159 in ISREAD (f0_file=3D38, w_dat=3D0xffdcc23c "", mode=3D7)
    at r_dbswsql7.c:75
    ret =3D -2321976
    environ_save =3D '\000' <repeats 31 times>
#23 0x0807f6a4 in cal_prodat ()
    No symbol table info available.
#24 0x08057650 in read_mrp () at /home/uwrk/pgsai/WRKUNX/g_mrprun.c:465
    i =3D 0
    idx =3D 0
    dub =3D 4.8873862481069038e-313
    dub1 =3D -1.209991882770505e+266
    RFPO =3D {id =3D -153734240, cdart =3D "\027\000\000\000=C2=A8=C3=88=C3=
=9C=C3=BF=C3=A9=C3=AF=C3=84=C3=B6=C3=A0=C3=88=C3=9C=C3=BF\000",=20
    descr =3D "\000\000\n\000\000\000\000\000\000\000H=C3=89=C3=9C=C3=BF$\\=
\a\b=C3=A0=C3=88=C3=9C=C3=BF\004:9\b\031\000\000", bkini =3D 0, bkfin =3D 0=
, stato =3D 85 'U',=20
    cdpeg =3D '\000' <repeats 17 times>,=20
    grpeg =3D "\000\000\000\000\000\000\000\000\000\065\000\000\000\000\000=
\000\000\000\000", tscon =3D 0 '\000', fillc =3D "\000\000\000", qtfan =3D =
0,=20
    qtpro =3D 0, bkpeg =3D 0, lnuti =3D 0, anpia =3D "\000",=20
    nupia =3D "\000\000\000\000\000", nurig =3D "\000\000\000\000\000",=20
    nupid =3D "\000\000\000\000",=20
    fill =3D '\000' <repeats 45 times>, "\b=C3=8A=C3=9C=C3=BF=C3=8F=C2=A3\0=
05\bU\000\000\000=C2=B0\000\071\br=C3=89=C3=9C=C3=BFph\021\t\034=C3=9D2\b\0=
00\000\000\000=C2=B8=C3=89=C3=9C=C3=BF\215=C3=97,=C3=B7\220=C3=89PEZZI", '\=
000' <repeats 16 times>}
#25 0x08056f93 in cal_net () at /home/uwrk/pgsai/WRKUNX/g_mrprun.c:340
    lotlt =3D 0
    qta_app =3D 2.1220281700514382e-314
    qta_incr =3D 8.2383361536408976e-315
    n_volte =3D 121
    it =3D 136
    tsmax =3D 0
#26 0x08056f3c in mrp_run () at /home/uwrk/pgsai/WRKUNX/g_mrprun.c:317
    counter =3D 23036
#27 0x08056cf9 in conf_dat () at /home/uwrk/pgsai/WRKUNX/g_mrprun.c:250
    No locals.
#28 0x08056d56 in r_keybot (bm=3D36) at /home/uwrk/pgsai/WRKUNX/g_mrprun.c:=
265
    No locals.
#29 0x08056ccb in s_entry (nsc=3D0x9116870 "g_mrprun.jam", con=3D128)
    at /home/uwrk/pgsai/WRKUNX/g_mrprun.c:240
    result =3D 137551132
    stat_master =3D {st_dev =3D 651884900619439004, __pad1 =3D 62940,=20
    st_ino =3D 1, st_mode =3D 4141232116, st_nlink =3D 24, st_uid =3D 1,=20
    st_gid =3D 4292660168, st_rdev =3D 583229779520846990, __pad2 =3D 0,=20
    st_size =3D -2307184, st_blksize =3D 136712600, st_blocks =3D 152135792=
,=20
    st_atim =3D {tv_sec =3D 151778824, tv_nsec =3D -2307160}, st_mtim =3D {
    tv_sec =3D 134928144, tv_nsec =3D 136512150}, st_ctim =3D {
    tv_sec =3D 135726702, tv_nsec =3D 137550944}, __unused4 =3D 136712600,=
=20
    __unused5 =3D 152135792}
    stat_artico =3D {st_dev =3D 0, __pad1 =3D 4272, st_ino =3D 152140972,=
=20
    st_mode =3D 1, st_nlink =3D 151778780, st_uid =3D 136712600,=20
    st_gid =3D 4141232116, st_rdev =3D 128984812658, __pad2 =3D 12,
    st_size =3D 0, st_blksize =3D -153735180, st_blocks =3D -2307224,=20
    st_atim =3D {tv_sec =3D -154863084, tv_nsec =3D 151778768}, st_mtim =3D=
 {
    tv_sec =3D -2307184, tv_nsec =3D 12}, st_ctim =3D {tv_sec =3D 136250915=
,=20
    tv_nsec =3D 40}, __unused4 =3D 1, __unused5 =3D 151778768}
    stat_disbas =3D {st_dev =3D 135667194, __pad1 =3D 51960,=20
    st_ino =3D 4292659912, st_mode =3D 136712600, st_nlink =3D 0, st_uid =
=3D 3,=20
    st_gid =3D 4292659928, st_rdev =3D 5514874160500, __pad2 =3D 48,=20
    st_size =3D -153730728, st_blksize =3D 136152200, st_blocks =3D 1367126=
00,=20
    st_atim =3D {tv_sec =3D 152153576, tv_nsec =3D -2307304}, st_mtim =3D {
    tv_sec =3D 136147953, tv_nsec =3D 151914368}, st_ctim =3D {
    tv_sec =3D 151916720, tv_nsec =3D 152151728}, __unused4 =3D 4140104046,=
=20
    __unused5 =3D 0}
#30 0x08170570 in sm_call_fnc ()
    No symbol table info available.
#31 0x081a7ddf in sm_fmfunc ()
    No symbol table info available.
#32 0x081a7a20 in sm_frmtail ()
    No symbol table info available.
#33 0x081a7772 in sm_frmfile ()
    No symbol table info available.
#34 0x081a19ca in sm_r_window ()
    No symbol table info available.
#35 0x081a18c8 in sm_r_there ()
    No symbol table info available.
#36 0x08190cd4 in sm_jform ()
    No symbol table info available.
#37 0x08190135 in sm_jexec_top ()
    No symbol table info available.
#38 0x08190015 in sm_jtop ()
    No symbol table info available.
#39 0x0807d11b in start_up ()
    No symbol table info available.
#40 0x08056a71 in common_main (argc=3D2, argv=3D0xffdcd444)
    at /home/pgsai/include/p_mainfi.h:11
    No locals.
#41 0x08056998 in main (argc=3D2, argv=3D0xffdcd444)
    at /home/pgsai/include/p_mainnm.h:17
    No locals.

I am available for further requests.=20

Regards, Andrea=20


-----Messaggio originale-----
Da: Alvaro Herrera [mailto:alvherre@commandprompt.com]=20
Inviato: marted=C3=AC 20 dicembre 2011 16.43
A: Andrea Grassi
Cc: 'Craig Ringer'; Tom Lane; Pg Bugs
Oggetto: Re: R: R: [BUGS] BUG #6342: libpq blocks forever in "poll" function


Excerpts from Andrea Grassi's message of mar dic 20 06:01:55 -0300 2011:
> Sorry if I insist, but now I have the case at hand (my test program is no=
w blocked), so I can check and verify all what you want.
> I would like to know if it can be a libpq bug or if you think the fault i=
s due to a system bug or to a machine issue and in this case I would be gra=
teful if you could give me a hint on what could be.

Please attach GDB to the stuck process (gdb -p `pidof testprogram`) and
grab a backtrace (bt full).

--=20
=C3=81lvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

R: R: R: BUG #6342: libpq blocks forever in "poll" function

From
"Andrea Grassi"
Date:
You wrote:

> I also have a client suffering an occasional 'application hang' running S=
use 11.2 and postgressql 8.4=20
> on an 8 core box which is not reproducable in a VMWare test environment.=
=20
> Access to postgres is libpq 127.0.0.1 as well.=20
> Unfortunately the client must restart ASAP and I have not produced a 'tes=
t case'.

But you examined the stack ? It's similar to mine ?

#0  0xffffe410 in __kernel_vsyscall ()
No symbol table info available.
#1  0xf76539cb in poll () from /lib/libc.so..
#2  0xf770d39a in pqSocketCheck () from /home/pg/pgsql/lib-32/libpq.so.5=20
#3  0xf770d49d in pqWaitTimed () from /home/pg/pgsql/lib-32/libpq.so.5=20
#4  0xf770d513 in pqWait () from /home/pg/pgsql/lib-32/libpq.so.5=20
#5  0xf770c6d6 in PQgetResult () from /home/pg/pgsql/lib-32/libpq.so.5=20
#6  0xf770c89c in PQexecFinish () from /home/pg/pgsql/lib-32/libpq.so.5=20

Can you specify the details of hardware and platform of your machine to und=
erstand if it can have something in common with the mine and so to understa=
nd the reason/origin of the bug?=20
Thanks.

Andrea

Re: R: R: R: BUG #6342: libpq blocks forever in "poll" function

From
Tom Lane
Date:
"Andrea Grassi" <andreagrassi@sogeasoft.com> writes:
> #0  0xffffe410 in __kernel_vsyscall ()
> No symbol table info available.
> #1  0xf76539cb in poll () from /lib/libc.so..
> #2  0xf770d39a in pqSocketCheck () from /home/pg/pgsql/lib-32/libpq.so.5
> #3  0xf770d49d in pqWaitTimed () from /home/pg/pgsql/lib-32/libpq.so.5
> #4  0xf770d513 in pqWait () from /home/pg/pgsql/lib-32/libpq.so.5
> #5  0xf770c6d6 in PQgetResult () from /home/pg/pgsql/lib-32/libpq.so.5
> #6  0xf770c89c in PQexecFinish () from /home/pg/pgsql/lib-32/libpq.so.5

What about a stack trace from the connected server process?  libpq
clearly thinks it's waiting for a message from the server, but I wonder
what the server thinks.  Also, what connection status does netstat
show on each side?

            regards, tom lane

R: R: R: R: BUG #6342: libpq blocks forever in "poll" function

From
"Andrea Grassi"
Date:
This is the server side stack kept by gdb:

(gdb) bt full
#0  0x00002b6488588ae5 in recv () from /lib64/libc.so.6
No symbol table info available.
#1  0x0000000000550cd2 in secure_read ()
No symbol table info available.
#2  0x00000000005563a4 in pq_recvbuf ()
No symbol table info available.
#3  0x00000000005567a7 in pq_getbyte ()
No symbol table info available.
#4  0x00000000005d33e6 in PostgresMain ()
No symbol table info available.
#5  0x00000000005a9708 in ServerLoop ()
No symbol table info available.
#6  0x00000000005aa2b7 in PostmasterMain ()
No symbol table info available.
#7  0x00000000005580be in main ()
No symbol table info available.

The netstat command on client and server connection has this output:
The first line should be the server, the second the client.

Proto Recv-Q Send-Q Local Address           Foreign Address         State
PID/Program name
tcp        0      0 127.0.0.1:5432          127.0.0.1:53129
ESTABLISHED -
tcp       48      0 127.0.0.1:53129         127.0.0.1:5432
ESTABLISHED 29802/g_mrprun.e


Regards, Andrea

-----Messaggio originale-----
Da: Tom Lane [mailto:tgl@sss.pgh.pa.us]=20
Inviato: marted=EC 20 dicembre 2011 17.38
A: Andrea Grassi
Cc: harrywr2@comcast.net; 'Craig Ringer'; 'Pg Bugs'; 'Alvaro Herrera'
Oggetto: Re: R: R: R: [BUGS] BUG #6342: libpq blocks forever in "poll"
function=20

"Andrea Grassi" <andreagrassi@sogeasoft.com> writes:
> #0  0xffffe410 in __kernel_vsyscall ()
> No symbol table info available.
> #1  0xf76539cb in poll () from /lib/libc.so..
> #2  0xf770d39a in pqSocketCheck () from /home/pg/pgsql/lib-32/libpq.so.5=
=20
> #3  0xf770d49d in pqWaitTimed () from /home/pg/pgsql/lib-32/libpq.so.5=20
> #4  0xf770d513 in pqWait () from /home/pg/pgsql/lib-32/libpq.so.5=20
> #5  0xf770c6d6 in PQgetResult () from /home/pg/pgsql/lib-32/libpq.so.5=20
> #6  0xf770c89c in PQexecFinish () from /home/pg/pgsql/lib-32/libpq.so.5=
=20

What about a stack trace from the connected server process?  libpq
clearly thinks it's waiting for a message from the server, but I wonder
what the server thinks.  Also, what connection status does netstat
show on each side?

            regards, tom lane

Re: R: R: R: R: BUG #6342: libpq blocks forever in "poll" function

From
Tom Lane
Date:
"Andrea Grassi" <andreagrassi@sogeasoft.com> writes:
> This is the server side stack kept by gdb:
> [ server is waiting to receive something from client ]

> The netstat command on client and server connection has this output:
> The first line should be the server, the second the client.

> Proto Recv-Q Send-Q Local Address           Foreign Address         State
> PID/Program name
> tcp        0      0 127.0.0.1:5432          127.0.0.1:53129
> ESTABLISHED -
> tcp       48      0 127.0.0.1:53129         127.0.0.1:5432
> ESTABLISHED 29802/g_mrprun.e

Hrm.  What's with the 48 bytes in the client's receive queue?  Surely
the kernel should be reporting that the socket is read-ready, if it's
got some data.  I think you've found an obscure kernel bug ---- somehow
it's failing to wake the poll() caller.

            regards, tom lane

Re: R: R: R: R: BUG #6342: libpq blocks forever in "poll" function

From
Craig Ringer
Date:
On 21/12/2011 1:42 AM, Tom Lane wrote:
> Hrm.  What's with the 48 bytes in the client's receive queue?  Surely
> the kernel should be reporting that the socket is read-ready, if it's
> got some data.  I think you've found an obscure kernel bug ---- somehow
> it's failing to wake the poll() caller.
>
I've been leaning that way too; that's why I was asking him for
/proc/$pid/stack and `wchan -C programname -o wchan:80=` output - to get
some idea of what function in the kernel it's sitting in.

Unfortunately the OP is on some enterprise distro that doesn't have
/proc/$pid/stack . wchan info would still be useful. I wonder how old
their kernel is? The bug could've already been fixed. /proc/pid/stack
has been around since 2008 so it must be pretty elderly.

OP: You can also get a kernel stack for a process by enabling the magic
SysRQ key (see Google) then using Alt-SysRq-T . This requires a physical
keyboard directly connected to the server. It emits the stack
information via dmesg. See:

http://en.wikipedia.org/wiki/Magic_SysRq_key

There's a "sysrqd" that apparently lets you use these features remotely,
but I've never tried it.

--
Craig Ringer

Re: R: R: BUG #6342: libpq blocks forever in "poll" function

From
Craig Ringer
Date:
On 19/12/2011 11:14 PM, Andrea Grassi wrote:
> Hi, Craig
> Now my process is blocked and I have the case in my hands.
> Do you have something to ask me in order to have more details ?
>
As I tend to agree with Tom re this being a kernel issue, try (as root):

# Enable stack dumps etc via sysrq
echo 8 > /proc/sys/kernel/sysrq
# Trigger kernel stack dump of all processes via sysrq mechanism
echo t > /proc/sysrq-trigger

... then search the kernel log files to find the kernel stack dump
associated with your test program.

If you're not on the latest kernel for your OS, you should update it.

--
Craig Ringer

R: R: R: R: R: BUG #6342: libpq blocks forever in "poll" function

From
"Andrea Grassi"
Date:
Then I meet my colleague who is the systems engineer that takes care of the
machine and I explain your hints (suggested by Craig Ringer) about how
detect and log kernel issues.
If it can be useful, the content of file /proc/$pid/wchan in the moment of
block is "_stext".

In the meantime, to be sure that it could not been a libpq bug, I ask you
one thing.
In internet I searched for detailed specifications of poll/select system
functions but I didn't understand one thing, that is which one of the 2
statement is true:
1) poll/select wait only for FUTURE modifications of ready-read state of
sockets
2) poll/select check if there is something to read at the moment of the call
and otherwise wait for FUTURE modifications of ready-read state
=20
Because if it was true the first statement, it could be that the answer of
the server arrives between the request and the call of poll (this time is
surely very short but however strictly greater than 0 and in this interval
the server answer could arrive).=20
Theoretical sequence:=20
1) Client request to server=20
2) Server answer to client
3) client wait calling poll
In this case client and server go in a sort of deadlock because server and
client wait each other for the other and could be a libpq bug.

What do you think about ? This scenario could be possible or the true
statement is the second ?

Regard, Andrea



-----Messaggio originale-----
Da: Craig Ringer [mailto:ringerc@ringerc.id.au]=20
Inviato: mercoled=EC 21 dicembre 2011 0.56
A: Tom Lane
Cc: Andrea Grassi; harrywr2@comcast.net; 'Pg Bugs'; 'Alvaro Herrera'
Oggetto: Re: R: R: R: R: [BUGS] BUG #6342: libpq blocks forever in "poll"
function

On 21/12/2011 1:42 AM, Tom Lane wrote:
> Hrm.  What's with the 48 bytes in the client's receive queue?  Surely
> the kernel should be reporting that the socket is read-ready, if it's
> got some data.  I think you've found an obscure kernel bug ---- somehow
> it's failing to wake the poll() caller.
>
I've been leaning that way too; that's why I was asking him for=20
/proc/$pid/stack and `wchan -C programname -o wchan:80=3D` output - to get=
=20
some idea of what function in the kernel it's sitting in.

Unfortunately the OP is on some enterprise distro that doesn't have=20
/proc/$pid/stack . wchan info would still be useful. I wonder how old=20
their kernel is? The bug could've already been fixed. /proc/pid/stack=20
has been around since 2008 so it must be pretty elderly.

OP: You can also get a kernel stack for a process by enabling the magic=20
SysRQ key (see Google) then using Alt-SysRq-T . This requires a physical=20
keyboard directly connected to the server. It emits the stack=20
information via dmesg. See:

http://en.wikipedia.org/wiki/Magic_SysRq_key

There's a "sysrqd" that apparently lets you use these features remotely,=20
but I've never tried it.

--
Craig Ringer

Re: R: R: R: R: R: BUG #6342: libpq blocks forever in "poll" function

From
Tom Lane
Date:
"Andrea Grassi" <andreagrassi@sogeasoft.com> writes:
> In internet I searched for detailed specifications of poll/select system
> functions but I didn't understand one thing, that is which one of the 2
> statement is true:
> 1) poll/select wait only for FUTURE modifications of ready-read state of
> sockets
> 2) poll/select check if there is something to read at the moment of the call
> and otherwise wait for FUTURE modifications of ready-read state

#1 is nonsense.  If poll worked like that, it would be impossible for
anyone to use it without suffering from race conditions.  But if you
don't see where exactly the poll() specification says so, I observe
that it says first that poll sets the bit(s) if the requested condition
is true, and second that *if* none of the events have occurred yet,
poll should wait.  See for instance
http://pubs.opengroup.org/onlinepubs/007908799/xsh/poll.html

            regards, tom lane

R: R: R: R: R: R: BUG #6342: libpq blocks forever in "poll" function

From
"Andrea Grassi"
Date:
To try to solve this problem my colleague (the systems engineer that takes
care of the machine) did:
1. Disable drivers HP : NOT SOLVED
2. Disable Hyperthreading: NOT SOLVED
3. Reduced the phisical CPU (enabled on boot) from 32  to 16: THIS SOLVED
THE PROBLEM.

Now 2 weeks passed without blocking and the problem seems temporary solved.
We have made an accurate test on the hardware but it's seems to be ok.

It's seems to be a kernel bug, so I posted the problem to Novell support.
Thanks for the help. Regards=20

Andrea Grassi
=20
-----Messaggio originale-----
Da: Tom Lane [mailto:tgl@sss.pgh.pa.us]=20
Inviato: mercoled=EC 21 dicembre 2011 17.01
A: Andrea Grassi
Cc: 'Craig Ringer'; harrywr2@comcast.net; 'Pg Bugs'; 'Alvaro Herrera'
Oggetto: Re: R: R: R: R: R: [BUGS] BUG #6342: libpq blocks forever in "poll"
function=20

"Andrea Grassi" <andreagrassi@sogeasoft.com> writes:
> In internet I searched for detailed specifications of poll/select system
> functions but I didn't understand one thing, that is which one of the 2
> statement is true:
> 1) poll/select wait only for FUTURE modifications of ready-read state of
> sockets
> 2) poll/select check if there is something to read at the moment of the
call
> and otherwise wait for FUTURE modifications of ready-read state

#1 is nonsense.  If poll worked like that, it would be impossible for
anyone to use it without suffering from race conditions.  But if you
don't see where exactly the poll() specification says so, I observe
that it says first that poll sets the bit(s) if the requested condition
is true, and second that *if* none of the events have occurred yet,
poll should wait.  See for instance
http://pubs.opengroup.org/onlinepubs/007908799/xsh/poll.html

            regards, tom lane