Re: Issue with retry_count in socket.c and fix for the same - Mailing list pgsql-odbc

From Malcolm MacLeod
Subject Re: Issue with retry_count in socket.c and fix for the same
Date
Msg-id 1401122546.14277.8.camel@watchmen.homenetwork
Whole thread Raw
In response to Re: Issue with retry_count in socket.c and fix for the same  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Issue with retry_count in socket.c and fix for the same
List pgsql-odbc
On Thu, 2014-05-15 at 14:25 +0300, Heikki Linnakangas wrote:
> On 05/15/2014 11:10 AM, Itnal, Prakash (NSN - IN/Bangalore) wrote:
> > Recently we encountered a case where clients connected to postgres server through psqlODBC driver got strucked
(hanged)when server unexpectedly shutdown (hard power off or LAN cable removed). The clients were hanged forever. I
analyzedthe code and found a potential issue with retry count logic in function SOCK_wait_for_ready() (in socket.c). 
> >
> > The no_timeout variable is initialized to TRUE and is not modified if retry_count is positive number. That means if
retry_countis passed then it will treat it as "no timeout". If I understood it correctly it should work other way. If
retry_countis positive number then socket should wait with timeout and not indefinitely. 
> >
> > I have changed the default initialization of no_timeout to FALSE. Attached is the patch. Kindly review the same and
letme know comments. 
> >
> > Copied the function below for quick reference: socket.c: SOCK_wait_for_ready()
> >
> > 528 static int SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count)
> > 529 {
> > 530     int ret, gerrno;
> > 531 #ifdef  HAVE_POLL
> > 532     struct pollfd   fds;
> > 533 #else
> > 534     fd_set  fds, except_fds;
> > 535     struct  timeval tm;
> > 536 #endif /* HAVE_POLL */
> > 537     BOOL    no_timeout = TRUE;
> > 538
> > 539     if (0 == retry_count)
> > 540         no_timeout = FALSE;
> > 541     else if (0  > retry_count)
> > 542         no_timeout = TRUE;
> > 543 #ifdef  USE_SSL
> > 544     else if (sock->ssl == NULL)
> > 545         no_timeout = TRUE;
> > 546 #endif /* USE_SSL */
> > 547     do {
> > 548 #ifdef  HAVE_POLL
> > 549         fds.fd = sock->socket;
> > 550         fds.events = output ? POLLOUT : POLLIN;
> > 551         fds.revents = 0;
> > 552         ret = poll(&fds, 1, no_timeout ? -1 : retry_count * 1000);
> > 553 mylog("!!!  poll ret=%d revents=%x\n", ret, fds.revents);
> > 554 #else
> > 555         FD_ZERO(&fds);
> > 556         FD_ZERO(&except_fds);
> > 557         FD_SET(sock->socket, &fds);
> > 558         FD_SET(sock->socket, &except_fds);
> > 559         if (!no_timeout)
> > 560         {
> > 561             tm.tv_sec = retry_count;
> > 562             tm.tv_usec = 0;
> > 563         }
> > 564         ret = select((int) sock->socket + 1, output ? NULL : &fds, output ? &fds : NULL, &except_fds,
no_timeout? NULL : &tm); 
> > 565 #endif /* HAVE_POLL */
> > 566         gerrno = SOCK_ERRNO;
> > 567     } while (ret < 0 && EINTR == gerrno);
> > 568     if (retry_count < 0)
> > 569         retry_count *= -1;
> > 570     if (0 == ret && retry_count > MAX_RETRY_COUNT)
> > 571     {
> > 572         ret = -1;
> > 573         SOCK_set_error(sock, output ? SOCKET_WRITE_TIMEOUT : SOCKET_READ_TIMEOUT, "SOCK_wait_for_ready
timeout");
> > 574     }
> > 575     return ret;
> > 576 }
>
> Ugh. Yes, it's clearly bogus as it is. no_timeout is only ever set to
> FALSE, when retry_count is 0. And when retry_count is 0, the timeout is
> also set to 0. So the whole timeout and retry_count thing is a very
> complicated way of saying that when retry_count==0, just check if there
> is any outstanding data to read / room for writing without blocking,
> otherwise wait forever.
>
> I'm not so sure it's supposed to behave like you described either,
> though. For example, if we normally a timeout by default, why would we
> not want a timeout when SSL is enabled? And the "0 > retry_count" check
> is never going to be true (unless retry_count wraps around)
>
> I wonder why we need any timeout or retry counters, ever. Frankly, I'm
> inclined to just rip all that out. A timeout would be a useful feature
> for many applications, to detect a broken connection more quickly, but
> the way it's currently implemented is pretty much useless for that. For
> starters, it would have to be configurable, because many applications
> would not want to have a timeout at all. A built-in timeout of 60
> seconds or so, which I think you'd get with your patch, would be too
> short for OLAP kind of queries that can run for hours before returning
> any results.

Just wanted to confirm that I have tracked this down as the cause for a
freeze/crash in our software as well.
I can get it to block forever on the select call just by
disconnecting/reconnecting network on a client a few times while it is
querying lots of data from the server.
I've tested the original patch and it does fix the issue, would be great
to see a final fix for this.

Thanks,
Malcolm MacLeod




pgsql-odbc by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Typo fixes in Solution.pm, part of MSVC scripts
Next
From: Heikki Linnakangas
Date:
Subject: Re: [ANNOUNCE] psqlODBC 09.03.0300 Released