Re: Issue with retry_count in socket.c and fix for the same - Mailing list pgsql-odbc
From | Malcolm MacLeod |
---|---|
Subject | Re: Issue with retry_count in socket.c and fix for the same |
Date | |
Msg-id | 1401122546.14277.8.camel@watchmen.homenetwork Whole thread Raw |
In response to | Re: Issue with retry_count in socket.c and fix for the same (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Issue with retry_count in socket.c and fix for the same
|
List | pgsql-odbc |
On Thu, 2014-05-15 at 14:25 +0300, Heikki Linnakangas wrote: > On 05/15/2014 11:10 AM, Itnal, Prakash (NSN - IN/Bangalore) wrote: > > Recently we encountered a case where clients connected to postgres server through psqlODBC driver got strucked (hanged)when server unexpectedly shutdown (hard power off or LAN cable removed). The clients were hanged forever. I analyzedthe code and found a potential issue with retry count logic in function SOCK_wait_for_ready() (in socket.c). > > > > The no_timeout variable is initialized to TRUE and is not modified if retry_count is positive number. That means if retry_countis passed then it will treat it as "no timeout". If I understood it correctly it should work other way. If retry_countis positive number then socket should wait with timeout and not indefinitely. > > > > I have changed the default initialization of no_timeout to FALSE. Attached is the patch. Kindly review the same and letme know comments. > > > > Copied the function below for quick reference: socket.c: SOCK_wait_for_ready() > > > > 528 static int SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count) > > 529 { > > 530 int ret, gerrno; > > 531 #ifdef HAVE_POLL > > 532 struct pollfd fds; > > 533 #else > > 534 fd_set fds, except_fds; > > 535 struct timeval tm; > > 536 #endif /* HAVE_POLL */ > > 537 BOOL no_timeout = TRUE; > > 538 > > 539 if (0 == retry_count) > > 540 no_timeout = FALSE; > > 541 else if (0 > retry_count) > > 542 no_timeout = TRUE; > > 543 #ifdef USE_SSL > > 544 else if (sock->ssl == NULL) > > 545 no_timeout = TRUE; > > 546 #endif /* USE_SSL */ > > 547 do { > > 548 #ifdef HAVE_POLL > > 549 fds.fd = sock->socket; > > 550 fds.events = output ? POLLOUT : POLLIN; > > 551 fds.revents = 0; > > 552 ret = poll(&fds, 1, no_timeout ? -1 : retry_count * 1000); > > 553 mylog("!!! poll ret=%d revents=%x\n", ret, fds.revents); > > 554 #else > > 555 FD_ZERO(&fds); > > 556 FD_ZERO(&except_fds); > > 557 FD_SET(sock->socket, &fds); > > 558 FD_SET(sock->socket, &except_fds); > > 559 if (!no_timeout) > > 560 { > > 561 tm.tv_sec = retry_count; > > 562 tm.tv_usec = 0; > > 563 } > > 564 ret = select((int) sock->socket + 1, output ? NULL : &fds, output ? &fds : NULL, &except_fds, no_timeout? NULL : &tm); > > 565 #endif /* HAVE_POLL */ > > 566 gerrno = SOCK_ERRNO; > > 567 } while (ret < 0 && EINTR == gerrno); > > 568 if (retry_count < 0) > > 569 retry_count *= -1; > > 570 if (0 == ret && retry_count > MAX_RETRY_COUNT) > > 571 { > > 572 ret = -1; > > 573 SOCK_set_error(sock, output ? SOCKET_WRITE_TIMEOUT : SOCKET_READ_TIMEOUT, "SOCK_wait_for_ready timeout"); > > 574 } > > 575 return ret; > > 576 } > > Ugh. Yes, it's clearly bogus as it is. no_timeout is only ever set to > FALSE, when retry_count is 0. And when retry_count is 0, the timeout is > also set to 0. So the whole timeout and retry_count thing is a very > complicated way of saying that when retry_count==0, just check if there > is any outstanding data to read / room for writing without blocking, > otherwise wait forever. > > I'm not so sure it's supposed to behave like you described either, > though. For example, if we normally a timeout by default, why would we > not want a timeout when SSL is enabled? And the "0 > retry_count" check > is never going to be true (unless retry_count wraps around) > > I wonder why we need any timeout or retry counters, ever. Frankly, I'm > inclined to just rip all that out. A timeout would be a useful feature > for many applications, to detect a broken connection more quickly, but > the way it's currently implemented is pretty much useless for that. For > starters, it would have to be configurable, because many applications > would not want to have a timeout at all. A built-in timeout of 60 > seconds or so, which I think you'd get with your patch, would be too > short for OLAP kind of queries that can run for hours before returning > any results. Just wanted to confirm that I have tracked this down as the cause for a freeze/crash in our software as well. I can get it to block forever on the select call just by disconnecting/reconnecting network on a client a few times while it is querying lots of data from the server. I've tested the original patch and it does fix the issue, would be great to see a final fix for this. Thanks, Malcolm MacLeod
pgsql-odbc by date: