Thread: Issue with retry_count in socket.c and fix for the same
Issue with retry_count in socket.c and fix for the same
I am not sure whether last mail delivered correctly! Resending again.
--
Regards,
Prakash I.
From: pgsql-odbc-owner@postgresql.org [mailto:pgsql-odbc-owner@postgresql.org] On Behalf Of ext Itnal, Prakash (NSN - IN/Bangalore)
Sent: Thursday, May 15, 2014 10:20 AM
To: pgsql-odbc@postgresql.org
Subject: [ODBC] Issue with retry_count in socket.c and fix for the same
Importance: High
Hi,
Recently we encountered a case where clients connected to postgres server through psqlODBC driver got strucked (hanged) when server unexpectedly shutdown (hard power off or LAN cable removed). The clients were hanged forever. I analyzed the code and found a potential issue with retry count logic in function SOCK_wait_for_ready() (in socket.c).
The no_timeout variable is initialized to TRUE and is not modified if retry_count is positive number. That means if retry_count is passed then it will treat it as “no timeout”. If I understood it correctly it should work other way. If retry_count is positive number then socket should wait with timeout and not indefinitely.
I have changed the default initialization of no_timeout to FALSE. Attached is the patch. Kindly review the same and let me know comments.
Copied the function below for quick reference: socket.c: SOCK_wait_for_ready()
528 static int SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count)
529 {
530 int ret, gerrno;
531 #ifdef HAVE_POLL
532 struct pollfd fds;
533 #else
534 fd_set fds, except_fds;
535 struct timeval tm;
536 #endif /* HAVE_POLL */
537 BOOL no_timeout = TRUE;
538
539 if (0 == retry_count)
540 no_timeout = FALSE;
541 else if (0 > retry_count)
542 no_timeout = TRUE;
543 #ifdef USE_SSL
544 else if (sock->ssl == NULL)
545 no_timeout = TRUE;
546 #endif /* USE_SSL */
547 do {
548 #ifdef HAVE_POLL
549 fds.fd = sock->socket;
550 fds.events = output ? POLLOUT : POLLIN;
551 fds.revents = 0;
552 ret = poll(&fds, 1, no_timeout ? -1 : retry_count * 1000);
553 mylog("!!! poll ret=%d revents=%x\n", ret, fds.revents);
554 #else
555 FD_ZERO(&fds);
556 FD_ZERO(&except_fds);
557 FD_SET(sock->socket, &fds);
558 FD_SET(sock->socket, &except_fds);
559 if (!no_timeout)
560 {
561 tm.tv_sec = retry_count;
562 tm.tv_usec = 0;
563 }
564 ret = select((int) sock->socket + 1, output ? NULL : &fds, output ? &fds : NULL, &except_fds, no_timeout ? NULL : &tm);
565 #endif /* HAVE_POLL */
566 gerrno = SOCK_ERRNO;
567 } while (ret < 0 && EINTR == gerrno);
568 if (retry_count < 0)
569 retry_count *= -1;
570 if (0 == ret && retry_count > MAX_RETRY_COUNT)
571 {
572 ret = -1;
573 SOCK_set_error(sock, output ? SOCKET_WRITE_TIMEOUT : SOCKET_READ_TIMEOUT, "SOCK_wait_for_ready timeout");
574 }
575 return ret;
576 }
--
Regards,
Prakash I.
Attachment
On 05/15/2014 11:10 AM, Itnal, Prakash (NSN - IN/Bangalore) wrote: > Recently we encountered a case where clients connected to postgres server through psqlODBC driver got strucked (hanged)when server unexpectedly shutdown (hard power off or LAN cable removed). The clients were hanged forever. I analyzedthe code and found a potential issue with retry count logic in function SOCK_wait_for_ready() (in socket.c). > > The no_timeout variable is initialized to TRUE and is not modified if retry_count is positive number. That means if retry_countis passed then it will treat it as "no timeout". If I understood it correctly it should work other way. If retry_countis positive number then socket should wait with timeout and not indefinitely. > > I have changed the default initialization of no_timeout to FALSE. Attached is the patch. Kindly review the same and letme know comments. > > Copied the function below for quick reference: socket.c: SOCK_wait_for_ready() > > 528 static int SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count) > 529 { > 530 int ret, gerrno; > 531 #ifdef HAVE_POLL > 532 struct pollfd fds; > 533 #else > 534 fd_set fds, except_fds; > 535 struct timeval tm; > 536 #endif /* HAVE_POLL */ > 537 BOOL no_timeout = TRUE; > 538 > 539 if (0 == retry_count) > 540 no_timeout = FALSE; > 541 else if (0 > retry_count) > 542 no_timeout = TRUE; > 543 #ifdef USE_SSL > 544 else if (sock->ssl == NULL) > 545 no_timeout = TRUE; > 546 #endif /* USE_SSL */ > 547 do { > 548 #ifdef HAVE_POLL > 549 fds.fd = sock->socket; > 550 fds.events = output ? POLLOUT : POLLIN; > 551 fds.revents = 0; > 552 ret = poll(&fds, 1, no_timeout ? -1 : retry_count * 1000); > 553 mylog("!!! poll ret=%d revents=%x\n", ret, fds.revents); > 554 #else > 555 FD_ZERO(&fds); > 556 FD_ZERO(&except_fds); > 557 FD_SET(sock->socket, &fds); > 558 FD_SET(sock->socket, &except_fds); > 559 if (!no_timeout) > 560 { > 561 tm.tv_sec = retry_count; > 562 tm.tv_usec = 0; > 563 } > 564 ret = select((int) sock->socket + 1, output ? NULL : &fds, output ? &fds : NULL, &except_fds, no_timeout ?NULL : &tm); > 565 #endif /* HAVE_POLL */ > 566 gerrno = SOCK_ERRNO; > 567 } while (ret < 0 && EINTR == gerrno); > 568 if (retry_count < 0) > 569 retry_count *= -1; > 570 if (0 == ret && retry_count > MAX_RETRY_COUNT) > 571 { > 572 ret = -1; > 573 SOCK_set_error(sock, output ? SOCKET_WRITE_TIMEOUT : SOCKET_READ_TIMEOUT, "SOCK_wait_for_ready timeout"); > 574 } > 575 return ret; > 576 } Ugh. Yes, it's clearly bogus as it is. no_timeout is only ever set to FALSE, when retry_count is 0. And when retry_count is 0, the timeout is also set to 0. So the whole timeout and retry_count thing is a very complicated way of saying that when retry_count==0, just check if there is any outstanding data to read / room for writing without blocking, otherwise wait forever. I'm not so sure it's supposed to behave like you described either, though. For example, if we normally a timeout by default, why would we not want a timeout when SSL is enabled? And the "0 > retry_count" check is never going to be true (unless retry_count wraps around) I wonder why we need any timeout or retry counters, ever. Frankly, I'm inclined to just rip all that out. A timeout would be a useful feature for many applications, to detect a broken connection more quickly, but the way it's currently implemented is pretty much useless for that. For starters, it would have to be configurable, because many applications would not want to have a timeout at all. A built-in timeout of 60 seconds or so, which I think you'd get with your patch, would be too short for OLAP kind of queries that can run for hours before returning any results. - Heikki
Re: Issue with retry_count in socket.c and fix for the same
I agree, it looks so complex. I will try to reorganize the retry logic and make it simple. Once done I will share the patch. Meantime I am new to psql-odbc driver code, so not sure how to make this timeout as configurable parameter from odbc.inifile? As per the keywords (http://psqlodbc.projects.pgfoundry.org/docs/config-opt.html), I don't find any existingparameter which can be used. Looks like we need to add a new parameter something like "ConnectionTimeout"! -- Regards, Prakash I. -----Original Message----- From: ext Heikki Linnakangas [mailto:hlinnakangas@vmware.com] Sent: Thursday, May 15, 2014 4:55 PM To: Itnal, Prakash (NSN - IN/Bangalore); pgsql-odbc@postgresql.org Subject: Re: [ODBC] Issue with retry_count in socket.c and fix for the same On 05/15/2014 11:10 AM, Itnal, Prakash (NSN - IN/Bangalore) wrote: > Recently we encountered a case where clients connected to postgres server through psqlODBC driver got strucked (hanged)when server unexpectedly shutdown (hard power off or LAN cable removed). The clients were hanged forever. I analyzedthe code and found a potential issue with retry count logic in function SOCK_wait_for_ready() (in socket.c). > > The no_timeout variable is initialized to TRUE and is not modified if retry_count is positive number. That means if retry_countis passed then it will treat it as "no timeout". If I understood it correctly it should work other way. If retry_countis positive number then socket should wait with timeout and not indefinitely. > > I have changed the default initialization of no_timeout to FALSE. Attached is the patch. Kindly review the same and letme know comments. > > Copied the function below for quick reference: socket.c: SOCK_wait_for_ready() > > 528 static int SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count) > 529 { > 530 int ret, gerrno; > 531 #ifdef HAVE_POLL > 532 struct pollfd fds; > 533 #else > 534 fd_set fds, except_fds; > 535 struct timeval tm; > 536 #endif /* HAVE_POLL */ > 537 BOOL no_timeout = TRUE; > 538 > 539 if (0 == retry_count) > 540 no_timeout = FALSE; > 541 else if (0 > retry_count) > 542 no_timeout = TRUE; > 543 #ifdef USE_SSL > 544 else if (sock->ssl == NULL) > 545 no_timeout = TRUE; > 546 #endif /* USE_SSL */ > 547 do { > 548 #ifdef HAVE_POLL > 549 fds.fd = sock->socket; > 550 fds.events = output ? POLLOUT : POLLIN; > 551 fds.revents = 0; > 552 ret = poll(&fds, 1, no_timeout ? -1 : retry_count * 1000); > 553 mylog("!!! poll ret=%d revents=%x\n", ret, fds.revents); > 554 #else > 555 FD_ZERO(&fds); > 556 FD_ZERO(&except_fds); > 557 FD_SET(sock->socket, &fds); > 558 FD_SET(sock->socket, &except_fds); > 559 if (!no_timeout) > 560 { > 561 tm.tv_sec = retry_count; > 562 tm.tv_usec = 0; > 563 } > 564 ret = select((int) sock->socket + 1, output ? NULL : &fds, output ? &fds : NULL, &except_fds, no_timeout ?NULL : &tm); > 565 #endif /* HAVE_POLL */ > 566 gerrno = SOCK_ERRNO; > 567 } while (ret < 0 && EINTR == gerrno); > 568 if (retry_count < 0) > 569 retry_count *= -1; > 570 if (0 == ret && retry_count > MAX_RETRY_COUNT) > 571 { > 572 ret = -1; > 573 SOCK_set_error(sock, output ? SOCKET_WRITE_TIMEOUT : SOCKET_READ_TIMEOUT, "SOCK_wait_for_ready timeout"); > 574 } > 575 return ret; > 576 } Ugh. Yes, it's clearly bogus as it is. no_timeout is only ever set to FALSE, when retry_count is 0. And when retry_count is 0, the timeout is also set to 0. So the whole timeout and retry_count thing is a very complicated way of saying that when retry_count==0, just check if there is any outstanding data to read / room for writing without blocking, otherwise wait forever. I'm not so sure it's supposed to behave like you described either, though. For example, if we normally a timeout by default, why would we not want a timeout when SSL is enabled? And the "0 > retry_count" check is never going to be true (unless retry_count wraps around) I wonder why we need any timeout or retry counters, ever. Frankly, I'm inclined to just rip all that out. A timeout would be a useful feature for many applications, to detect a broken connection more quickly, but the way it's currently implemented is pretty much useless for that. For starters, it would have to be configurable, because many applications would not want to have a timeout at all. A built-in timeout of 60 seconds or so, which I think you'd get with your patch, would be too short for OLAP kind of queries that can run for hours before returning any results. - Heikki
On Thu, 2014-05-15 at 14:25 +0300, Heikki Linnakangas wrote: > On 05/15/2014 11:10 AM, Itnal, Prakash (NSN - IN/Bangalore) wrote: > > Recently we encountered a case where clients connected to postgres server through psqlODBC driver got strucked (hanged)when server unexpectedly shutdown (hard power off or LAN cable removed). The clients were hanged forever. I analyzedthe code and found a potential issue with retry count logic in function SOCK_wait_for_ready() (in socket.c). > > > > The no_timeout variable is initialized to TRUE and is not modified if retry_count is positive number. That means if retry_countis passed then it will treat it as "no timeout". If I understood it correctly it should work other way. If retry_countis positive number then socket should wait with timeout and not indefinitely. > > > > I have changed the default initialization of no_timeout to FALSE. Attached is the patch. Kindly review the same and letme know comments. > > > > Copied the function below for quick reference: socket.c: SOCK_wait_for_ready() > > > > 528 static int SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count) > > 529 { > > 530 int ret, gerrno; > > 531 #ifdef HAVE_POLL > > 532 struct pollfd fds; > > 533 #else > > 534 fd_set fds, except_fds; > > 535 struct timeval tm; > > 536 #endif /* HAVE_POLL */ > > 537 BOOL no_timeout = TRUE; > > 538 > > 539 if (0 == retry_count) > > 540 no_timeout = FALSE; > > 541 else if (0 > retry_count) > > 542 no_timeout = TRUE; > > 543 #ifdef USE_SSL > > 544 else if (sock->ssl == NULL) > > 545 no_timeout = TRUE; > > 546 #endif /* USE_SSL */ > > 547 do { > > 548 #ifdef HAVE_POLL > > 549 fds.fd = sock->socket; > > 550 fds.events = output ? POLLOUT : POLLIN; > > 551 fds.revents = 0; > > 552 ret = poll(&fds, 1, no_timeout ? -1 : retry_count * 1000); > > 553 mylog("!!! poll ret=%d revents=%x\n", ret, fds.revents); > > 554 #else > > 555 FD_ZERO(&fds); > > 556 FD_ZERO(&except_fds); > > 557 FD_SET(sock->socket, &fds); > > 558 FD_SET(sock->socket, &except_fds); > > 559 if (!no_timeout) > > 560 { > > 561 tm.tv_sec = retry_count; > > 562 tm.tv_usec = 0; > > 563 } > > 564 ret = select((int) sock->socket + 1, output ? NULL : &fds, output ? &fds : NULL, &except_fds, no_timeout? NULL : &tm); > > 565 #endif /* HAVE_POLL */ > > 566 gerrno = SOCK_ERRNO; > > 567 } while (ret < 0 && EINTR == gerrno); > > 568 if (retry_count < 0) > > 569 retry_count *= -1; > > 570 if (0 == ret && retry_count > MAX_RETRY_COUNT) > > 571 { > > 572 ret = -1; > > 573 SOCK_set_error(sock, output ? SOCKET_WRITE_TIMEOUT : SOCKET_READ_TIMEOUT, "SOCK_wait_for_ready timeout"); > > 574 } > > 575 return ret; > > 576 } > > Ugh. Yes, it's clearly bogus as it is. no_timeout is only ever set to > FALSE, when retry_count is 0. And when retry_count is 0, the timeout is > also set to 0. So the whole timeout and retry_count thing is a very > complicated way of saying that when retry_count==0, just check if there > is any outstanding data to read / room for writing without blocking, > otherwise wait forever. > > I'm not so sure it's supposed to behave like you described either, > though. For example, if we normally a timeout by default, why would we > not want a timeout when SSL is enabled? And the "0 > retry_count" check > is never going to be true (unless retry_count wraps around) > > I wonder why we need any timeout or retry counters, ever. Frankly, I'm > inclined to just rip all that out. A timeout would be a useful feature > for many applications, to detect a broken connection more quickly, but > the way it's currently implemented is pretty much useless for that. For > starters, it would have to be configurable, because many applications > would not want to have a timeout at all. A built-in timeout of 60 > seconds or so, which I think you'd get with your patch, would be too > short for OLAP kind of queries that can run for hours before returning > any results. Just wanted to confirm that I have tracked this down as the cause for a freeze/crash in our software as well. I can get it to block forever on the select call just by disconnecting/reconnecting network on a client a few times while it is querying lots of data from the server. I've tested the original patch and it does fix the issue, would be great to see a final fix for this. Thanks, Malcolm MacLeod
On 05/15/2014 02:25 PM, Heikki Linnakangas wrote: > I wonder why we need any timeout or retry counters, ever. Frankly, I'm > inclined to just rip all that out. A timeout would be a useful feature > for many applications, to detect a broken connection more quickly, but > the way it's currently implemented is pretty much useless for that. For > starters, it would have to be configurable, because many applications > would not want to have a timeout at all. A built-in timeout of 60 > seconds or so, which I think you'd get with your patch, would be too > short for OLAP kind of queries that can run for hours before returning > any results. Ok, I removed the broken logic, for the sake of cleaning up the code. It still won't give the kind of a timeout that you hoped for, but at least it's less bogus now. - Heikki
On 05/26/2014 07:42 PM, Malcolm MacLeod wrote: > Just wanted to confirm that I have tracked this down as the cause for a > freeze/crash in our software as well. > I can get it to block forever on the select call just by > disconnecting/reconnecting network on a client a few times while it is > querying lots of data from the server. > I've tested the original patch and it does fix the issue, would be great > to see a final fix for this. Hmm, the driver enables TCP keepalives by default, which should eventually notice a broken connection. It can take a long time, though; I believe the default on most systems is two hours. Unfortunately there are no options in psqlODBC to change the defaults, but e.g. on Linux you can change the system-wide default with the tcp_keepalive_time sysctl. I think the missing feature here is to add psqlODBC configuration variables similar to libpq's keepalives_idle, keepalives_interval and keepalives_count. I'm not going to work on that myself, but patches are welcome. - Heikki