Thread: BUG: possible busy loop when connection is closed while trying to establish SSL connection

We were bitten by the following bug a few times, when our server tried
to reestablish connections under bad network conditions:


if connection is closed while trying to get response to SSL setup packet
(i.e. conn->status is CONNECTION_SSL_STARTUP), we get a busy loop, as
line 1035 in 8.0.0.beta2:
   if (pqWaitTimed(1, 0, conn, finish_time)  {

tells that there is data to read (returns 0) while actually it is error 
(POLLERR & POLLHUP) and not POLLIN returned from  poll() and 
   nread = recv(conn->sock, &SSLok, 1, 0);

on line 1449 sets nread to 0, as there is in fact nothing to read from a
closed socket
after that the check on line 1462:
   if (nread == 0)       /* caller failed to wait for data */       return PGRES_POLLING_READING;

resumes the busy loop

in all other places recv()'ing is done throught pqsecure_read, which
checks for such condition.


I'm not familiar enough with code to suggest a proper fix (and I
quickfix our app by disabling SSL (PGSSLMODE=disable) - the default was
prefer ;), but this may come back to haunt others

--------------
Hannu




Hannu Krosing <hannu@tm.ee> writes:
> We were bitten by the following bug a few times, when our server tried
> to reestablish connections under bad network conditions:
>
> if connection is closed while trying to get response to SSL setup packet
> (i.e. conn->status is CONNECTION_SSL_STARTUP), we get a busy loop, as
> line 1035 in 8.0.0.beta2:
>
>     if (pqWaitTimed(1, 0, conn, finish_time)  {
>
> tells that there is data to read (returns 0) while actually it is error 
> (POLLERR & POLLHUP) and not POLLIN returned from  poll() and 

This is intentional: the idea is that we should go ahead and do the read
(or write), which will detect the error condition on the socket.  poll()
in itself doesn't give enough information to determine what the error
condition is, so it's not appropriate to fail here.

> after that the check on line 1462:
>
>     if (nread == 0)
>         /* caller failed to wait for data */
>         return PGRES_POLLING_READING;
>
> resumes the busy loop

This seems to me to be the bug.  pqReadData jumps through hoops to
determine whether a zero-length read means EOF or not, and I think we
need to expend some effort to determine that here too.

One possibility is to forget the direct call to recv() and use
pqReadData --- since conn->ssl isn't set yet, and we aren't expecting
the server to send more than one byte, this should in theory be safe.

Comments?
        regards, tom lane


Re: BUG: possible busy loop when connection is closed

From
Hannu Krosing
Date:
On N, 2004-09-23 at 06:41, Tom Lane wrote:
> Hannu Krosing <hannu@tm.ee> writes:
> > We were bitten by the following bug a few times, when our server tried
> > to reestablish connections under bad network conditions:
> >
> > if connection is closed while trying to get response to SSL setup packet
> > (i.e. conn->status is CONNECTION_SSL_STARTUP), we get a busy loop, as
> > line 1035 in 8.0.0.beta2:
> >
> >     if (pqWaitTimed(1, 0, conn, finish_time)  {
> >
> > tells that there is data to read (returns 0) while actually it is error 
> > (POLLERR & POLLHUP) and not POLLIN returned from  poll() and 

at least on linux it does, we got the following trace:
poll([{fd=11, events=POLLIN|POLLERR, revents=POLLIN|POLLERR|POLLHUP}],
1, -1) = 1 
recv(11, "", 1, 0)   = 0 
poll([{fd=11, events=POLLIN|POLLERR, revents=POLLIN|POLLERR|POLLHUP}],
1, -1) = 1 
recv(11, "", 1, 0)   = 0 
poll([{fd=11, events=POLLIN|POLLERR, revents=POLLIN|POLLERR|POLLHUP}],
1, -1) = 1 
recv(11, "", 1, 0)   = 0 
which seems to say that poll came back on POLLHUP, and as there is just
one fd, it must mean that this one fd is closed. But this may be
non-portable

> This is intentional: the idea is that we should go ahead and do the read
> (or write), which will detect the error condition on the socket.  poll()
> in itself doesn't give enough information to determine what the error
> condition is, so it's not appropriate to fail here.
> 
> > after that the check on line 1462:
> >
> >     if (nread == 0)
> >         /* caller failed to wait for data */
> >         return PGRES_POLLING_READING;
> >
> > resumes the busy loop
> 
> This seems to me to be the bug.  pqReadData jumps through hoops to
> determine whether a zero-length read means EOF or not, and I think we
> need to expend some effort to determine that here too.
> 
> One possibility is to forget the direct call to recv() and use
> pqReadData --- since conn->ssl isn't set yet, and we aren't expecting
> the server to send more than one byte, this should in theory be safe.

I was scared by the comment before recv(...,1,0) which said we must be
careful not to read more than 1 byte

Is it impossible to not accidentally get more than one and screw up SSL
handshake ?

-------------
Hannu



Hannu Krosing <hannu@tm.ee> writes:
>> One possibility is to forget the direct call to recv() and use
>> pqReadData --- since conn->ssl isn't set yet, and we aren't expecting
>> the server to send more than one byte, this should in theory be safe.

> I was scared by the comment before recv(...,1,0) which said we must be
> careful not to read more than 1 byte

When I wrote that, I was trying to assume as little as possible about
the SSL protocol.  The only way there could be a problem is if the
server is first to send during the SSL negotiation handshake; which
seems odd but not impossible.  Anyone know for sure?
        regards, tom lane


Re: BUG: possible busy loop when connection is closed

From
Fabien COELHO
Date:
Dear Tom,

> When I wrote that, I was trying to assume as little as possible about
> the SSL protocol.  The only way there could be a problem is if the
> server is first to send during the SSL negotiation handshake; which
> seems odd but not impossible.  Anyone know for sure?

As for the RFC, the client is the first to speak. In the server handcheck
response (server hello), the server must tell what version of the protocol
is to be used (best that it knows of, ceiled by the client version),
whether it accepts the previous session of the client and what algorithms
are chosen among those suggested by the client.

ISTM that this cannot be sent before the client hello message is
received... or the server does not really implement SSL.

Now if you connect to some other server with some other protocol, that is
another issue... Also, I do not know how the postgresql protocol interacts
with SSL... I guess the server waits for the first packet to decided
whether it is a SSL connection or a non-SSL connection?

My 0.02 EUR.

Have a nice day,

-- 
Fabien Coelho - coelho@cri.ensmp.fr


Fabien COELHO <coelho@cri.ensmp.fr> writes:
> Now if you connect to some other server with some other protocol, that is
> another issue...

But the code in question is only for SSL connection to PG, so that's
a red herring I think.

> Also, I do not know how the postgresql protocol interacts
> with SSL... I guess the server waits for the first packet to decided
> whether it is a SSL connection or a non-SSL connection?

It'sClient sends SSL request message
Server sends back 1 byte indicating if it can do SSL or not
<< SSL startup protocol happens here >>
Client sends normal PG connection request message
Authentication exchange proceeds as usual

If the server were first to send during the SSL startup protocol, then
it's possible that more than one byte would be waiting for the client
when it reads the server response.  We'd have no easy way to "push back"
those bytes and allow them to be re-read by the client-side SSL library.
So that was the danger I was concerned about.

If the client is first to send during the SSL startup protocol, then
there's no problem: there can only be one byte waiting for us.
        regards, tom lane


Re: BUG: possible busy loop when connection is closed

From
Fabien COELHO
Date:
> If the client is first to send during the SSL startup protocol, then
> there's no problem: there can only be one byte waiting for us.

So it looks good from the SSL perspective.

-- 
Fabien Coelho - coelho@cri.ensmp.fr