Thread: Issue with retry_count in socket.c and fix for the same

Issue with retry_count in socket.c and fix for the same

From
"Itnal, Prakash (NSN - IN/Bangalore)"
Date:

I am not sure whether last mail delivered correctly! Resending again.

 

--

Regards,

Prakash I.

 

From: pgsql-odbc-owner@postgresql.org [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of ext Itnal, Prakash (NSN - IN/Bangalore)
Sent: Thursday, May 15, 2014 10:20 AM
To: pgsql-odbc@postgresql.org
Subject: [ODBC] Issue with retry_count in socket.c and fix for the same
Importance: High

 

Hi,

 

Recently we encountered a case where clients connected to postgres server through psqlODBC driver got strucked (hanged) when server unexpectedly shutdown (hard power off or LAN cable removed). The clients were hanged forever. I analyzed the code and found a potential issue with retry count logic in function SOCK_wait_for_ready() (in socket.c).

 

The no_timeout variable is initialized to TRUE and is not modified if retry_count is positive number. That means if retry_count is passed then it will treat it as “no timeout”. If I understood it correctly it should work other way. If retry_count is positive number then socket should wait with timeout and not indefinitely.

 

I have changed the default initialization of no_timeout to FALSE. Attached is the patch. Kindly review the same and let me know comments.

 

Copied the function below for quick reference: socket.c: SOCK_wait_for_ready()

 

528 static int SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count)

529 {

530     int ret, gerrno;

531 #ifdef  HAVE_POLL

532     struct pollfd   fds;

533 #else

534     fd_set  fds, except_fds;

535     struct  timeval tm;

536 #endif /* HAVE_POLL */

537     BOOL    no_timeout = TRUE;

538

539     if (0 == retry_count)

540         no_timeout = FALSE;

541     else if (0  > retry_count)

542         no_timeout = TRUE;

543 #ifdef  USE_SSL

544     else if (sock->ssl == NULL)

545         no_timeout = TRUE;

546 #endif /* USE_SSL */

547     do {

548 #ifdef  HAVE_POLL

549         fds.fd = sock->socket;

550         fds.events = output ? POLLOUT : POLLIN;

551         fds.revents = 0;

552         ret = poll(&fds, 1, no_timeout ? -1 : retry_count * 1000);

553 mylog("!!!  poll ret=%d revents=%x\n", ret, fds.revents);

554 #else

555         FD_ZERO(&fds);

556         FD_ZERO(&except_fds);

557         FD_SET(sock->socket, &fds);

558         FD_SET(sock->socket, &except_fds);

559         if (!no_timeout)

560         {

561             tm.tv_sec = retry_count;

562             tm.tv_usec = 0;

563         }

564         ret = select((int) sock->socket + 1, output ? NULL : &fds, output ? &fds : NULL, &except_fds, no_timeout ? NULL : &tm);

565 #endif /* HAVE_POLL */

566         gerrno = SOCK_ERRNO;

567     } while (ret < 0 && EINTR == gerrno);

568     if (retry_count < 0)

569         retry_count *= -1;

570     if (0 == ret && retry_count > MAX_RETRY_COUNT)

571     {

572         ret = -1;

573         SOCK_set_error(sock, output ? SOCKET_WRITE_TIMEOUT : SOCKET_READ_TIMEOUT, "SOCK_wait_for_ready timeout");

574     }

575     return ret;

576 }

 

--

Regards,

Prakash I.

 

 

 

Attachment

Re: Issue with retry_count in socket.c and fix for the same

From
Heikki Linnakangas
Date:
On 05/15/2014 11:10 AM, Itnal, Prakash (NSN - IN/Bangalore) wrote:
> Recently we encountered a case where clients connected to postgres server through psqlODBC driver got strucked
(hanged)when server unexpectedly shutdown (hard power off or LAN cable removed). The clients were hanged forever. I
analyzedthe code and found a potential issue with retry count logic in function SOCK_wait_for_ready() (in socket.c). 
>
> The no_timeout variable is initialized to TRUE and is not modified if retry_count is positive number. That means if
retry_countis passed then it will treat it as "no timeout". If I understood it correctly it should work other way. If
retry_countis positive number then socket should wait with timeout and not indefinitely. 
>
> I have changed the default initialization of no_timeout to FALSE. Attached is the patch. Kindly review the same and
letme know comments. 
>
> Copied the function below for quick reference: socket.c: SOCK_wait_for_ready()
>
> 528 static int SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count)
> 529 {
> 530     int ret, gerrno;
> 531 #ifdef  HAVE_POLL
> 532     struct pollfd   fds;
> 533 #else
> 534     fd_set  fds, except_fds;
> 535     struct  timeval tm;
> 536 #endif /* HAVE_POLL */
> 537     BOOL    no_timeout = TRUE;
> 538
> 539     if (0 == retry_count)
> 540         no_timeout = FALSE;
> 541     else if (0  > retry_count)
> 542         no_timeout = TRUE;
> 543 #ifdef  USE_SSL
> 544     else if (sock->ssl == NULL)
> 545         no_timeout = TRUE;
> 546 #endif /* USE_SSL */
> 547     do {
> 548 #ifdef  HAVE_POLL
> 549         fds.fd = sock->socket;
> 550         fds.events = output ? POLLOUT : POLLIN;
> 551         fds.revents = 0;
> 552         ret = poll(&fds, 1, no_timeout ? -1 : retry_count * 1000);
> 553 mylog("!!!  poll ret=%d revents=%x\n", ret, fds.revents);
> 554 #else
> 555         FD_ZERO(&fds);
> 556         FD_ZERO(&except_fds);
> 557         FD_SET(sock->socket, &fds);
> 558         FD_SET(sock->socket, &except_fds);
> 559         if (!no_timeout)
> 560         {
> 561             tm.tv_sec = retry_count;
> 562             tm.tv_usec = 0;
> 563         }
> 564         ret = select((int) sock->socket + 1, output ? NULL : &fds, output ? &fds : NULL, &except_fds, no_timeout
?NULL : &tm); 
> 565 #endif /* HAVE_POLL */
> 566         gerrno = SOCK_ERRNO;
> 567     } while (ret < 0 && EINTR == gerrno);
> 568     if (retry_count < 0)
> 569         retry_count *= -1;
> 570     if (0 == ret && retry_count > MAX_RETRY_COUNT)
> 571     {
> 572         ret = -1;
> 573         SOCK_set_error(sock, output ? SOCKET_WRITE_TIMEOUT : SOCKET_READ_TIMEOUT, "SOCK_wait_for_ready timeout");
> 574     }
> 575     return ret;
> 576 }

Ugh. Yes, it's clearly bogus as it is. no_timeout is only ever set to
FALSE, when retry_count is 0. And when retry_count is 0, the timeout is
also set to 0. So the whole timeout and retry_count thing is a very
complicated way of saying that when retry_count==0, just check if there
is any outstanding data to read / room for writing without blocking,
otherwise wait forever.

I'm not so sure it's supposed to behave like you described either,
though. For example, if we normally a timeout by default, why would we
not want a timeout when SSL is enabled? And the "0 > retry_count" check
is never going to be true (unless retry_count wraps around)

I wonder why we need any timeout or retry counters, ever. Frankly, I'm
inclined to just rip all that out. A timeout would be a useful feature
for many applications, to detect a broken connection more quickly, but
the way it's currently implemented is pretty much useless for that. For
starters, it would have to be configurable, because many applications
would not want to have a timeout at all. A built-in timeout of 60
seconds or so, which I think you'd get with your patch, would be too
short for OLAP kind of queries that can run for hours before returning
any results.

- Heikki


Re: Issue with retry_count in socket.c and fix for the same

From
"Itnal, Prakash (NSN - IN/Bangalore)"
Date:
I agree, it looks so complex. I will try to reorganize the retry logic and make it simple. Once done I will share the
patch.

Meantime I am new to psql-odbc driver code, so not sure how to make this timeout as configurable parameter from
odbc.inifile? As per the keywords (http://psqlodbc.projects.pgfoundry.org/docs/config-opt.html), I don't find any
existingparameter which can be used. Looks like we need to add a new parameter something like "ConnectionTimeout"! 

--
Regards,
Prakash I.

-----Original Message-----
From: ext Heikki Linnakangas [mailto:hlinnakangas@vmware.com]
Sent: Thursday, May 15, 2014 4:55 PM
To: Itnal, Prakash (NSN - IN/Bangalore); pgsql-odbc@postgresql.org
Subject: Re: [ODBC] Issue with retry_count in socket.c and fix for the same

On 05/15/2014 11:10 AM, Itnal, Prakash (NSN - IN/Bangalore) wrote:
> Recently we encountered a case where clients connected to postgres server through psqlODBC driver got strucked
(hanged)when server unexpectedly shutdown (hard power off or LAN cable removed). The clients were hanged forever. I
analyzedthe code and found a potential issue with retry count logic in function SOCK_wait_for_ready() (in socket.c). 
>
> The no_timeout variable is initialized to TRUE and is not modified if retry_count is positive number. That means if
retry_countis passed then it will treat it as "no timeout". If I understood it correctly it should work other way. If
retry_countis positive number then socket should wait with timeout and not indefinitely. 
>
> I have changed the default initialization of no_timeout to FALSE. Attached is the patch. Kindly review the same and
letme know comments. 
>
> Copied the function below for quick reference: socket.c: SOCK_wait_for_ready()
>
> 528 static int SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count)
> 529 {
> 530     int ret, gerrno;
> 531 #ifdef  HAVE_POLL
> 532     struct pollfd   fds;
> 533 #else
> 534     fd_set  fds, except_fds;
> 535     struct  timeval tm;
> 536 #endif /* HAVE_POLL */
> 537     BOOL    no_timeout = TRUE;
> 538
> 539     if (0 == retry_count)
> 540         no_timeout = FALSE;
> 541     else if (0  > retry_count)
> 542         no_timeout = TRUE;
> 543 #ifdef  USE_SSL
> 544     else if (sock->ssl == NULL)
> 545         no_timeout = TRUE;
> 546 #endif /* USE_SSL */
> 547     do {
> 548 #ifdef  HAVE_POLL
> 549         fds.fd = sock->socket;
> 550         fds.events = output ? POLLOUT : POLLIN;
> 551         fds.revents = 0;
> 552         ret = poll(&fds, 1, no_timeout ? -1 : retry_count * 1000);
> 553 mylog("!!!  poll ret=%d revents=%x\n", ret, fds.revents);
> 554 #else
> 555         FD_ZERO(&fds);
> 556         FD_ZERO(&except_fds);
> 557         FD_SET(sock->socket, &fds);
> 558         FD_SET(sock->socket, &except_fds);
> 559         if (!no_timeout)
> 560         {
> 561             tm.tv_sec = retry_count;
> 562             tm.tv_usec = 0;
> 563         }
> 564         ret = select((int) sock->socket + 1, output ? NULL : &fds, output ? &fds : NULL, &except_fds, no_timeout
?NULL : &tm); 
> 565 #endif /* HAVE_POLL */
> 566         gerrno = SOCK_ERRNO;
> 567     } while (ret < 0 && EINTR == gerrno);
> 568     if (retry_count < 0)
> 569         retry_count *= -1;
> 570     if (0 == ret && retry_count > MAX_RETRY_COUNT)
> 571     {
> 572         ret = -1;
> 573         SOCK_set_error(sock, output ? SOCKET_WRITE_TIMEOUT : SOCKET_READ_TIMEOUT, "SOCK_wait_for_ready timeout");
> 574     }
> 575     return ret;
> 576 }

Ugh. Yes, it's clearly bogus as it is. no_timeout is only ever set to
FALSE, when retry_count is 0. And when retry_count is 0, the timeout is
also set to 0. So the whole timeout and retry_count thing is a very
complicated way of saying that when retry_count==0, just check if there
is any outstanding data to read / room for writing without blocking,
otherwise wait forever.

I'm not so sure it's supposed to behave like you described either,
though. For example, if we normally a timeout by default, why would we
not want a timeout when SSL is enabled? And the "0 > retry_count" check
is never going to be true (unless retry_count wraps around)

I wonder why we need any timeout or retry counters, ever. Frankly, I'm
inclined to just rip all that out. A timeout would be a useful feature
for many applications, to detect a broken connection more quickly, but
the way it's currently implemented is pretty much useless for that. For
starters, it would have to be configurable, because many applications
would not want to have a timeout at all. A built-in timeout of 60
seconds or so, which I think you'd get with your patch, would be too
short for OLAP kind of queries that can run for hours before returning
any results.

- Heikki


Re: Issue with retry_count in socket.c and fix for the same

From
Malcolm MacLeod
Date:
On Thu, 2014-05-15 at 14:25 +0300, Heikki Linnakangas wrote:
> On 05/15/2014 11:10 AM, Itnal, Prakash (NSN - IN/Bangalore) wrote:
> > Recently we encountered a case where clients connected to postgres server through psqlODBC driver got strucked
(hanged)when server unexpectedly shutdown (hard power off or LAN cable removed). The clients were hanged forever. I
analyzedthe code and found a potential issue with retry count logic in function SOCK_wait_for_ready() (in socket.c). 
> >
> > The no_timeout variable is initialized to TRUE and is not modified if retry_count is positive number. That means if
retry_countis passed then it will treat it as "no timeout". If I understood it correctly it should work other way. If
retry_countis positive number then socket should wait with timeout and not indefinitely. 
> >
> > I have changed the default initialization of no_timeout to FALSE. Attached is the patch. Kindly review the same and
letme know comments. 
> >
> > Copied the function below for quick reference: socket.c: SOCK_wait_for_ready()
> >
> > 528 static int SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count)
> > 529 {
> > 530     int ret, gerrno;
> > 531 #ifdef  HAVE_POLL
> > 532     struct pollfd   fds;
> > 533 #else
> > 534     fd_set  fds, except_fds;
> > 535     struct  timeval tm;
> > 536 #endif /* HAVE_POLL */
> > 537     BOOL    no_timeout = TRUE;
> > 538
> > 539     if (0 == retry_count)
> > 540         no_timeout = FALSE;
> > 541     else if (0  > retry_count)
> > 542         no_timeout = TRUE;
> > 543 #ifdef  USE_SSL
> > 544     else if (sock->ssl == NULL)
> > 545         no_timeout = TRUE;
> > 546 #endif /* USE_SSL */
> > 547     do {
> > 548 #ifdef  HAVE_POLL
> > 549         fds.fd = sock->socket;
> > 550         fds.events = output ? POLLOUT : POLLIN;
> > 551         fds.revents = 0;
> > 552         ret = poll(&fds, 1, no_timeout ? -1 : retry_count * 1000);
> > 553 mylog("!!!  poll ret=%d revents=%x\n", ret, fds.revents);
> > 554 #else
> > 555         FD_ZERO(&fds);
> > 556         FD_ZERO(&except_fds);
> > 557         FD_SET(sock->socket, &fds);
> > 558         FD_SET(sock->socket, &except_fds);
> > 559         if (!no_timeout)
> > 560         {
> > 561             tm.tv_sec = retry_count;
> > 562             tm.tv_usec = 0;
> > 563         }
> > 564         ret = select((int) sock->socket + 1, output ? NULL : &fds, output ? &fds : NULL, &except_fds,
no_timeout? NULL : &tm); 
> > 565 #endif /* HAVE_POLL */
> > 566         gerrno = SOCK_ERRNO;
> > 567     } while (ret < 0 && EINTR == gerrno);
> > 568     if (retry_count < 0)
> > 569         retry_count *= -1;
> > 570     if (0 == ret && retry_count > MAX_RETRY_COUNT)
> > 571     {
> > 572         ret = -1;
> > 573         SOCK_set_error(sock, output ? SOCKET_WRITE_TIMEOUT : SOCKET_READ_TIMEOUT, "SOCK_wait_for_ready
timeout");
> > 574     }
> > 575     return ret;
> > 576 }
>
> Ugh. Yes, it's clearly bogus as it is. no_timeout is only ever set to
> FALSE, when retry_count is 0. And when retry_count is 0, the timeout is
> also set to 0. So the whole timeout and retry_count thing is a very
> complicated way of saying that when retry_count==0, just check if there
> is any outstanding data to read / room for writing without blocking,
> otherwise wait forever.
>
> I'm not so sure it's supposed to behave like you described either,
> though. For example, if we normally a timeout by default, why would we
> not want a timeout when SSL is enabled? And the "0 > retry_count" check
> is never going to be true (unless retry_count wraps around)
>
> I wonder why we need any timeout or retry counters, ever. Frankly, I'm
> inclined to just rip all that out. A timeout would be a useful feature
> for many applications, to detect a broken connection more quickly, but
> the way it's currently implemented is pretty much useless for that. For
> starters, it would have to be configurable, because many applications
> would not want to have a timeout at all. A built-in timeout of 60
> seconds or so, which I think you'd get with your patch, would be too
> short for OLAP kind of queries that can run for hours before returning
> any results.

Just wanted to confirm that I have tracked this down as the cause for a
freeze/crash in our software as well.
I can get it to block forever on the select call just by
disconnecting/reconnecting network on a client a few times while it is
querying lots of data from the server.
I've tested the original patch and it does fix the issue, would be great
to see a final fix for this.

Thanks,
Malcolm MacLeod




Re: Issue with retry_count in socket.c and fix for the same

From
Heikki Linnakangas
Date:
On 05/15/2014 02:25 PM, Heikki Linnakangas wrote:
> I wonder why we need any timeout or retry counters, ever. Frankly, I'm
> inclined to just rip all that out. A timeout would be a useful feature
> for many applications, to detect a broken connection more quickly, but
> the way it's currently implemented is pretty much useless for that. For
> starters, it would have to be configurable, because many applications
> would not want to have a timeout at all. A built-in timeout of 60
> seconds or so, which I think you'd get with your patch, would be too
> short for OLAP kind of queries that can run for hours before returning
> any results.

Ok, I removed the broken logic, for the sake of cleaning up the code. It
still won't give the kind of a timeout that you hoped for, but at least
it's less bogus now.

- Heikki


Re: Issue with retry_count in socket.c and fix for the same

From
Heikki Linnakangas
Date:
On 05/26/2014 07:42 PM, Malcolm MacLeod wrote:
> Just wanted to confirm that I have tracked this down as the cause for a
> freeze/crash in our software as well.
> I can get it to block forever on the select call just by
> disconnecting/reconnecting network on a client a few times while it is
> querying lots of data from the server.
> I've tested the original patch and it does fix the issue, would be great
> to see a final fix for this.

Hmm, the driver enables TCP keepalives by default, which should
eventually notice a broken connection. It can take a long time, though;
I believe the default on most systems is two hours. Unfortunately there
are no options in psqlODBC to change the defaults, but e.g. on Linux you
can change the system-wide default with the tcp_keepalive_time sysctl.

I think the missing feature here is to add psqlODBC configuration
variables similar to libpq's keepalives_idle, keepalives_interval and
keepalives_count. I'm not going to work on that myself, but patches are
welcome.

- Heikki