Thread: BUG #5804: Connection aborted after many queries.

BUG #5804: Connection aborted after many queries.

From

"Paul J. Davis"

Date:

29 December 2010, 11:30:13

The following bug has been logged online:

Bug reference:      5804
Logged by:          Paul J. Davis
Email address:      paul.joseph.davis@gmail.com
PostgreSQL version: 9.0.2
Operating system:   OS X 10.6.5, Ubuntu 10.04
Description:        Connection aborted after many queries.
Details:

After running many queries (millions) a connection will report an error that
the server has unexpectedly closed the connection. I first noticed this
through psycopg2, but I've been able to reproduce it with a small C program
using only libpq which I've included below. I compiled this against a libpq
built by Homebrew (after upgrading the formula to use a 9.0.2 tarball) on OS
X 10.6.5. The server was installed from 9.0.2 package available from
https://launchpad.net/~pitti/+archive/postgresql

My next step is to try building libpq with --enable-cassert to see if that
triggers anything client side. Let me know if there's something else I
should be doing to debug this.

This test has been bailing between 2.6 and 2.7M queries:


#include <stdio.h>
#include <stdlib.h>
#include "libpq-fe.h"

static void
fail(PGconn* conn, PGresult* res)
{
    if(res != NULL) PQclear(res);
    PQfinish(conn);
    exit(1);
}

static void
check(PGconn* conn, PGresult* res, const char* fmt)
{
    ExecStatusType status = PQresultStatus(res);

    if(status != PGRES_COMMAND_OK && status != PGRES_TUPLES_OK)
    {
        fprintf(stderr, fmt, PQerrorMessage(conn));
        fail(conn, res);
    }
}

void
run_query(PGconn* conn, PGresult* res)
{
    int nFields, i, j;

    res = PQexec(conn, "DECLARE myportal CURSOR FOR select 1");
    check(conn, res, "DECLARE CURSOR failed: %s");
    PQclear(res);

    res = PQexec(conn, "FETCH ALL in myportal");
    check(conn, res, "FETCH ALL failed: %s");

    nFields = PQnfields(res);
    for(i = 0; i < PQntuples(res); i++)
    {
        for(j = 0; j < nFields; j++)
        {
            PQgetvalue(res, i, j);
        }
    }

    PQclear(res);

    res = PQexec(conn, "CLOSE myportal");
    check(conn, res, "CLOSE failed: %s");
    PQclear(res);
}

int
main(int argc, char **argv)
{
    PGconn* conn;
    PGresult* res;
    int i;

    if(argc != 2)
    {
        fprintf(stderr, "usage: %s DSN\n", argv[0]);
        exit(1);
    }

    conn = PQconnectdb(argv[1]);

    if(PQstatus(conn) != CONNECTION_OK)
    {
        fprintf(stderr, "Connection failed: %s", PQerrorMessage(conn));
        fail(conn, NULL);
    }

    res = PQexec(conn, "BEGIN");
    check(conn, res, "BEGIN failed: %s");
    PQclear(res);

    for(i = 0; i < 10000000; i++)
    {
        if((i+1) % 100000 == 0)
        {
            fprintf(stderr, "I: %d\n", i);
        }
        run_query(conn, res);
    }

    res = PQexec(conn, "END");
    check(conn, res, "END failed: %s");
    PQclear(res);

    PQfinish(conn);

    return 0;
}

Re: BUG #5804: Connection aborted after many queries.

From

Paul Davis

Date:

29 December 2010, 11:55:36

On Wed, Dec 29, 2010 at 10:30 AM, Paul J. Davis
<paul.joseph.davis@gmail.com> wrote:
>
> The following bug has been logged online:
>
> Bug reference: =A0 =A0 =A05804
> Logged by: =A0 =A0 =A0 =A0 =A0Paul J. Davis
> Email address: =A0 =A0 =A0paul.joseph.davis@gmail.com
> PostgreSQL version: 9.0.2
> Operating system: =A0 OS X 10.6.5, Ubuntu 10.04
> Description: =A0 =A0 =A0 =A0Connection aborted after many queries.
> Details:
>
> After running many queries (millions) a connection will report an error t=
hat
> the server has unexpectedly closed the connection. I first noticed this
> through psycopg2, but I've been able to reproduce it with a small C progr=
am
> using only libpq which I've included below. I compiled this against a lib=
pq
> built by Homebrew (after upgrading the formula to use a 9.0.2 tarball) on=
 OS
> X 10.6.5. The server was installed from 9.0.2 package available from
> https://launchpad.net/~pitti/+archive/postgresql
>
> My next step is to try building libpq with --enable-cassert to see if that
> triggers anything client side. Let me know if there's something else I
> should be doing to debug this.
>
> This test has been bailing between 2.6 and 2.7M queries:
>
>
> #include <stdio.h>
> #include <stdlib.h>
> #include "libpq-fe.h"
>
> static void
> fail(PGconn* conn, PGresult* res)
> {
> =A0 =A0if(res !=3D NULL) PQclear(res);
> =A0 =A0PQfinish(conn);
> =A0 =A0exit(1);
> }
>
> static void
> check(PGconn* conn, PGresult* res, const char* fmt)
> {
> =A0 =A0ExecStatusType status =3D PQresultStatus(res);
>
> =A0 =A0if(status !=3D PGRES_COMMAND_OK && status !=3D PGRES_TUPLES_OK)
> =A0 =A0{
> =A0 =A0 =A0 =A0fprintf(stderr, fmt, PQerrorMessage(conn));
> =A0 =A0 =A0 =A0fail(conn, res);
> =A0 =A0}
> }
>
> void
> run_query(PGconn* conn, PGresult* res)
> {
> =A0 =A0int nFields, i, j;
>
> =A0 =A0res =3D PQexec(conn, "DECLARE myportal CURSOR FOR select 1");
> =A0 =A0check(conn, res, "DECLARE CURSOR failed: %s");
> =A0 =A0PQclear(res);
>
> =A0 =A0res =3D PQexec(conn, "FETCH ALL in myportal");
> =A0 =A0check(conn, res, "FETCH ALL failed: %s");
>
> =A0 =A0nFields =3D PQnfields(res);
> =A0 =A0for(i =3D 0; i < PQntuples(res); i++)
> =A0 =A0{
> =A0 =A0 =A0 =A0for(j =3D 0; j < nFields; j++)
> =A0 =A0 =A0 =A0{
> =A0 =A0 =A0 =A0 =A0 =A0PQgetvalue(res, i, j);
> =A0 =A0 =A0 =A0}
> =A0 =A0}
>
> =A0 =A0PQclear(res);
>
> =A0 =A0res =3D PQexec(conn, "CLOSE myportal");
> =A0 =A0check(conn, res, "CLOSE failed: %s");
> =A0 =A0PQclear(res);
> }
>
> int
> main(int argc, char **argv)
> {
> =A0 =A0PGconn* conn;
> =A0 =A0PGresult* res;
> =A0 =A0int i;
>
> =A0 =A0if(argc !=3D 2)
> =A0 =A0{
> =A0 =A0 =A0 =A0fprintf(stderr, "usage: %s DSN\n", argv[0]);
> =A0 =A0 =A0 =A0exit(1);
> =A0 =A0}
>
> =A0 =A0conn =3D PQconnectdb(argv[1]);
>
> =A0 =A0if(PQstatus(conn) !=3D CONNECTION_OK)
> =A0 =A0{
> =A0 =A0 =A0 =A0fprintf(stderr, "Connection failed: %s", PQerrorMessage(co=
nn));
> =A0 =A0 =A0 =A0fail(conn, NULL);
> =A0 =A0}
>
> =A0 =A0res =3D PQexec(conn, "BEGIN");
> =A0 =A0check(conn, res, "BEGIN failed: %s");
> =A0 =A0PQclear(res);
>
> =A0 =A0for(i =3D 0; i < 10000000; i++)
> =A0 =A0{
> =A0 =A0 =A0 =A0if((i+1) % 100000 =3D=3D 0)
> =A0 =A0 =A0 =A0{
> =A0 =A0 =A0 =A0 =A0 =A0fprintf(stderr, "I: %d\n", i);
> =A0 =A0 =A0 =A0}
> =A0 =A0 =A0 =A0run_query(conn, res);
> =A0 =A0}
>
> =A0 =A0res =3D PQexec(conn, "END");
> =A0 =A0check(conn, res, "END failed: %s");
> =A0 =A0PQclear(res);
>
> =A0 =A0PQfinish(conn);
>
> =A0 =A0return 0;
> }
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs
>

I should've mentioned the various version combinations I tried this with.

Originally the client was 8.2 ish on OS X 10.5.something (it was an
old MacBook I don't have anymore) against the 8.2 server package in
Ubuntu 9.04. The python scripts where I noticed this issue would run
fine against that combination. After upgrading my MacBook to a Mac
Pro, I ended up installing Postgres 9.0.1 on the client (and building
psycopg2 against that) which is when I started getting errors. The
original error in the 9.0.1 client against the older server was that
libpq would get stuck on a poll() call down when trying to fetch
tuples or execute a command.

After a bit of narrowing down what was the cause I ended up trying to
upgrade the server to see if it was just a weird interplay between
9.0.1 and the older server. After upgrading to Ubuntu 10.04 and
installing Postgres 8.4 (from apt) the error turned into the current
manifestation in that libpq would give an error saying that the server
had unexpectedly closed the connection (instead of blocking on the
poll() call).

At some point I upgraded my client install to 9.0.2 and started a
server locally. Running the test program against a local database
failed to trigger the bug. I then tried to downgrade my local client
to 8.4 and tested that against the 8.4 install on Ubuntu which showed
the bug. And finally I upgraded both the server and the client to
9.0.2 and I can trigger the bug.

Thanks,
Paul Davis

Re: BUG #5804: Connection aborted after many queries.

From

"Kevin Grittner"

Date:

29 December 2010, 11:58:52

"Paul J. Davis" <paul.joseph.davis@gmail.com> wrote:

> After running many queries (millions) a connection will report an
> error that the server has unexpectedly closed the connection.

What message are you getting?  (Copy/paste is a good thing.)

What do you see in the server log at the time of failure?

Are you using a connection pool?

-Kevin

Re: BUG #5804: Connection aborted after many queries.

From

Paul Davis

Date:

29 December 2010, 12:06:56

On Wed, Dec 29, 2010 at 10:58 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> "Paul J. Davis" <paul.joseph.davis@gmail.com> wrote:
>
>> After running many queries (millions) a connection will report an
>> error that the server has unexpectedly closed the connection.
>
> What message are you getting? =A0(Copy/paste is a good thing.)
>
> What do you see in the server log at the time of failure?
>
> Are you using a connection pool?
>
> -Kevin
>

Doh, that was on my list of things to add but managed to forget.

The error message reported by the client:

DECLARE CURSOR failed: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

And this intriguing error in the server logs from around that time:

2010-12-28 18:40:02 EST LOG:  SSL renegotiation failure
2010-12-28 18:40:02 EST LOG:  SSL failed to send renegotiation request
2010-12-28 18:40:02 EST LOG:  SSL renegotiation failure
2010-12-28 18:40:02 EST LOG:  SSL error: unsafe legacy renegotiation disabl=
ed
2010-12-28 18:40:02 EST LOG:  could not send data to client:
Connection reset by peer
2010-12-28 18:40:02 EST LOG:  SSL error: unsafe legacy renegotiation disabl=
ed
2010-12-28 18:40:02 EST LOG:  could not receive data from client:
Connection reset by peer
2010-12-28 18:40:02 EST LOG:  unexpected EOF on client connection

Googling, I see something that suggests turning off SSL renegotiation
which I'll try next.

Re: BUG #5804: Connection aborted after many queries.

From

Tom Lane

Date:

29 December 2010, 12:28:05

Paul Davis <paul.joseph.davis@gmail.com> writes:
> And this intriguing error in the server logs from around that time:

> 2010-12-28 18:40:02 EST LOG:  SSL renegotiation failure
> 2010-12-28 18:40:02 EST LOG:  SSL failed to send renegotiation request
> 2010-12-28 18:40:02 EST LOG:  SSL renegotiation failure
> 2010-12-28 18:40:02 EST LOG:  SSL error: unsafe legacy renegotiation disabled
> 2010-12-28 18:40:02 EST LOG:  could not send data to client:
> Connection reset by peer
> 2010-12-28 18:40:02 EST LOG:  SSL error: unsafe legacy renegotiation disabled
> 2010-12-28 18:40:02 EST LOG:  could not receive data from client:
> Connection reset by peer
> 2010-12-28 18:40:02 EST LOG:  unexpected EOF on client connection

> Googling, I see something that suggests turning off SSL renegotiation
> which I'll try next.

In all cases, you were testing a client against a server on a different
machine, right?  This looks to me like you've got two different openssl
libraries, one of which has a bogus partial fix for the recent SSL
renegotiation security issue.  I'm not sure what the state of play is
in Apple's shipping version of openssl --- you might have to get an
up-to-date source distribution and compile it yourself to have non-bogus
renegotiation behavior.  Or you could just disable renegotiation on the
PG server.

            regards, tom lane

Re: BUG #5804: Connection aborted after many queries.

From

Paul Davis

Date:

29 December 2010, 13:05:42

On Wed, Dec 29, 2010 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Paul Davis <paul.joseph.davis@gmail.com> writes:
>> And this intriguing error in the server logs from around that time:
>
>> 2010-12-28 18:40:02 EST LOG: =A0SSL renegotiation failure
>> 2010-12-28 18:40:02 EST LOG: =A0SSL failed to send renegotiation request
>> 2010-12-28 18:40:02 EST LOG: =A0SSL renegotiation failure
>> 2010-12-28 18:40:02 EST LOG: =A0SSL error: unsafe legacy renegotiation d=
isabled
>> 2010-12-28 18:40:02 EST LOG: =A0could not send data to client:
>> Connection reset by peer
>> 2010-12-28 18:40:02 EST LOG: =A0SSL error: unsafe legacy renegotiation d=
isabled
>> 2010-12-28 18:40:02 EST LOG: =A0could not receive data from client:
>> Connection reset by peer
>> 2010-12-28 18:40:02 EST LOG: =A0unexpected EOF on client connection
>
>> Googling, I see something that suggests turning off SSL renegotiation
>> which I'll try next.
>
> In all cases, you were testing a client against a server on a different
> machine, right? =A0This looks to me like you've got two different openssl
> libraries, one of which has a bogus partial fix for the recent SSL
> renegotiation security issue. =A0I'm not sure what the state of play is
> in Apple's shipping version of openssl --- you might have to get an
> up-to-date source distribution and compile it yourself to have non-bogus
> renegotiation behavior. =A0Or you could just disable renegotiation on the
> PG server.
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0regards, tom lane
>

Yeah, all failures were between separate machines with various
versions of OpenSSL that I never thought to keep track of. After more
Googling I've found that OS X "fixed" the renegotiation issue by
disabling it in a security fix [1].
For the time being I'll just disable it server side as traffic isn't
ever routed across a public network.

Thanks for the help.

Paul Davis


[1] http://support.apple.com/kb/HT4004

Re: BUG #5804: Connection aborted after many queries.

From

Tom Lane

Date:

29 December 2010, 13:34:30

Paul Davis <paul.joseph.davis@gmail.com> writes:
> On Wed, Dec 29, 2010 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In all cases, you were testing a client against a server on a different
>> machine, right?

> Yeah, all failures were between separate machines with various
> versions of OpenSSL that I never thought to keep track of. After more
> Googling I've found that OS X "fixed" the renegotiation issue by
> disabling it in a security fix [1].

Yeah, I can reproduce the failure pretty quickly on my Mac laptop,
if I use SSL and reduce the ssl_renegotiation_limit setting to
100MB or so.  The server's log looks a bit different from what you
showed:

LOG:  SSL renegotiation failure
LOG:  SSL renegotiation failure
LOG:  SSL error: internal error
LOG:  could not send data to client: Connection reset by peer
LOG:  SSL error: internal error
LOG:  could not receive data from client: Connection reset by peer
LOG:  unexpected EOF on client connection

which confirms my thought that you were working with two different
openssl libraries, one of which was willing to do renegotiation
and the other not.

This machine is fully up-to-date, so Apple still hasn't fixed the
renegotiation problem beyond the brain-dead "fix" of aborting the
connection :-(.  Get on the stick, Cupertino ... everybody else
had this fixed six months ago.

            regards, tom lane