Thread: CC_send_query_append crash
Hello, While stress testing our software to handle network reconnects/disconnects I have run into a crash in your ODBC driver. The crash seems to occur because CC_send_query_append crash takes a local copy of the pointer 'self->sock' at the top of the function, 'self' is then passed around to various functions (some of which have the side effect of setting self->sock to NULL (and deleting) if there is a lost connection) and then the local copy of the pointer (which is now dangling) is dereferenced lower down in the function. Essentially if there is a disconnect while CC_send_query_append is running there is a risk of crash. I hope that you can fix this soon. Regards, Malcolm MacLeod
Attachment
On Wed, Apr 30, 2014 at 1:44 AM, Malcolm MacLeod <malcolm.macleod@tshwanedje.com> wrote: > The crash seems to occur because CC_send_query_append crash takes a > local copy of the pointer 'self->sock' at the top of the function, > 'self' is then passed around to various functions (some of which have > the side effect of setting self->sock to NULL (and deleting) if there is > a lost connection) and then the local copy of the pointer (which is now > dangling) is dereferenced lower down in the function. > Essentially if there is a disconnect while CC_send_query_append is > running there is a risk of crash. Looking at the code, I am seeing that the problem is related to CC_on_abort where conn->sock is set to NULL when the connection is considered as dead. And I am indeed seeing two code paths (when sending the 'C' message there is an ABORT check and in cleanup section) that could use this NULL socket afterwards. Your patch is perhaps a bit too much. So I am proposing the attached patch instead. Let me know if this fixes your issue as well. Regards, -- Michael
Attachment
> <malcolm.macleod@tshwanedje.com> wrote: > > The crash seems to occur because CC_send_query_append crash takes a > > local copy of the pointer 'self->sock' at the top of the function, > > 'self' is then passed around to various functions (some of which have > > the side effect of setting self->sock to NULL (and deleting) if there is > > a lost connection) and then the local copy of the pointer (which is now > > dangling) is dereferenced lower down in the function. > > Essentially if there is a disconnect while CC_send_query_append is > > running there is a risk of crash. > Looking at the code, I am seeing that the problem is related to > CC_on_abort where conn->sock is set to NULL when the connection is > considered as dead. And I am indeed seeing two code paths (when > sending the 'C' message there is an ABORT check and in cleanup > section) that could use this NULL socket afterwards. Your patch is > perhaps a bit too much. So I am proposing the attached patch instead. > Let me know if this fixes your issue as well. Thanks for the fast response! Your proposed patch would also fix the issue, so I have no problem with it being used instead. I guess from my side I just don't personally understand the point of keeping the local pointer copy at all (it just seems like an invitation for this sort of thing to occur) - so it made more sense to me to remove it entirely to prevent future occurrences of similar issues - although I suppose also the less code disturbed the better. I am not overly familiar with the code so can't say what is best. Thanks, Malcolm
On 05/01/2014 02:47 PM, Malcolm MacLeod wrote: > >> <malcolm.macleod@tshwanedje.com> wrote: >>> The crash seems to occur because CC_send_query_append crash takes a >>> local copy of the pointer 'self->sock' at the top of the function, >>> 'self' is then passed around to various functions (some of which have >>> the side effect of setting self->sock to NULL (and deleting) if there is >>> a lost connection) and then the local copy of the pointer (which is now >>> dangling) is dereferenced lower down in the function. >>> Essentially if there is a disconnect while CC_send_query_append is >>> running there is a risk of crash. >> Looking at the code, I am seeing that the problem is related to >> CC_on_abort where conn->sock is set to NULL when the connection is >> considered as dead. And I am indeed seeing two code paths (when >> sending the 'C' message there is an ABORT check and in cleanup >> section) that could use this NULL socket afterwards. Your patch is >> perhaps a bit too much. So I am proposing the attached patch instead. >> Let me know if this fixes your issue as well. > > Thanks for the fast response! > Your proposed patch would also fix the issue, so I have no problem with > it being used instead. > > I guess from my side I just don't personally understand the point of > keeping the local pointer copy at all (it just seems like an invitation > for this sort of thing to occur) - so it made more sense to me to remove > it entirely to prevent future occurrences of similar issues - although I > suppose also the less code disturbed the better. I am not overly > familiar with the code so can't say what is best. I agree that keeping the local pointer copy is an accident waiting to happen. Michael's more surgical approach would make sense if we had to carefully track the places where self->sock might become NULL. But we don't, as all the SOCK_* macros and functions contain NULL checks. I think the extra null check that both patch versions added to the cleanup code path is also unnecessary and wrong. SOCK_get_errcode() returns SOCKET_CLOSED error if the argument is null, which will cause the connection to be marked as dead, which is the right thing to do. However, when I removed that check, the "diagnostics" regression test started to fail. What happens is that when the socket is closed while processing results in CC_send_query_append, the incomplete QResult object is discarded, even though it contained a useful error message. So we lose some information that way. But that in turn is pretty simple to fix: if we don't set ReadyToReturn in the cleanup routine, then the partial QResult object is returned again. Looking closer at the loop in CC_send_query_append(), I think it's basically wrong to set ReadyToReturn when an error occurs. ReadyToReturn indicates that the whole result set has been processed, and we're ready to send a new query. In many of the error conditions where we currently set ReadyToReturn, we're clearly not ready to process the next query. For example, in the 'T' case, if QR_Constructor() fails because of out-of-memory, we set ReadyToReturn=true and just bail out of the loop. That's wrong - we must process the rest of the result set before returning. Anyway, I committed your original patch with minor changes, which fixes your original problem (commit 22b151e). But the error handling in that function could use some more work... - Heikki
On Mon, 2014-05-05 at 11:07 +0300, Heikki Linnakangas wrote: > On 05/01/2014 02:47 PM, Malcolm MacLeod wrote: > > > >> <malcolm.macleod@tshwanedje.com> wrote: > >>> The crash seems to occur because CC_send_query_append crash takes a > >>> local copy of the pointer 'self->sock' at the top of the function, > >>> 'self' is then passed around to various functions (some of which have > >>> the side effect of setting self->sock to NULL (and deleting) if there is > >>> a lost connection) and then the local copy of the pointer (which is now > >>> dangling) is dereferenced lower down in the function. > >>> Essentially if there is a disconnect while CC_send_query_append is > >>> running there is a risk of crash. > >> Looking at the code, I am seeing that the problem is related to > >> CC_on_abort where conn->sock is set to NULL when the connection is > >> considered as dead. And I am indeed seeing two code paths (when > >> sending the 'C' message there is an ABORT check and in cleanup > >> section) that could use this NULL socket afterwards. Your patch is > >> perhaps a bit too much. So I am proposing the attached patch instead. > >> Let me know if this fixes your issue as well. > > > > Thanks for the fast response! > > Your proposed patch would also fix the issue, so I have no problem with > > it being used instead. > > > > I guess from my side I just don't personally understand the point of > > keeping the local pointer copy at all (it just seems like an invitation > > for this sort of thing to occur) - so it made more sense to me to remove > > it entirely to prevent future occurrences of similar issues - although I > > suppose also the less code disturbed the better. I am not overly > > familiar with the code so can't say what is best. > > I agree that keeping the local pointer copy is an accident waiting to > happen. Michael's more surgical approach would make sense if we had to > carefully track the places where self->sock might become NULL. But we > don't, as all the SOCK_* macros and functions contain NULL checks. > > I think the extra null check that both patch versions added to the > cleanup code path is also unnecessary and wrong. SOCK_get_errcode() > returns SOCKET_CLOSED error if the argument is null, which will cause > the connection to be marked as dead, which is the right thing to do. > However, when I removed that check, the "diagnostics" regression test > started to fail. What happens is that when the socket is closed while > processing results in CC_send_query_append, the incomplete QResult > object is discarded, even though it contained a useful error message. So > we lose some information that way. But that in turn is pretty simple to > fix: if we don't set ReadyToReturn in the cleanup routine, then the > partial QResult object is returned again. > > Looking closer at the loop in CC_send_query_append(), I think it's > basically wrong to set ReadyToReturn when an error occurs. ReadyToReturn > indicates that the whole result set has been processed, and we're ready > to send a new query. In many of the error conditions where we currently > set ReadyToReturn, we're clearly not ready to process the next query. > For example, in the 'T' case, if QR_Constructor() fails because of > out-of-memory, we set ReadyToReturn=true and just bail out of the loop. > That's wrong - we must process the rest of the result set before returning. > > Anyway, I committed your original patch with minor changes, which fixes > your original problem (commit 22b151e). But the error handling in that > function could use some more work... Thanks! Will get a copy of the latest code and use it for testing. Unfortunately I don't have the time to get familiar with the code myself or I would try help a bit more. I will throw our stress tests at the latest code and see if anything else turns up though. Any idea on when the next official psqlodbc build is likely to take place? Would be nice to know for planning purposes so we can get our clients onto a fixed version. - Malcolm MacLeod
On 05/07/2014 05:37 PM, Malcolm MacLeod wrote: > Any idea on when the next official psqlodbc build is likely to take > place? Would be nice to know for planning purposes so we can get our > clients onto a fixed version. CC'ing Hiroshi Saito. He has usually prepared the releases, I don't know if he has any imminent plans. There have been a few bug fixes since last release, including this one, so it would be nice to get a release out. Then again, none of the bugs fixed have been very serious, so there's no big hurry. - Heikki
Heikki Linnakangas wrote: > On 05/07/2014 05:37 PM, Malcolm MacLeod wrote: > >Any idea on when the next official psqlodbc build is likely to take > >place? Would be nice to know for planning purposes so we can get our > >clients onto a fixed version. > > CC'ing Hiroshi Saito. He has usually prepared the releases, I don't > know if he has any imminent plans. > > There have been a few bug fixes since last release, including this > one, so it would be nice to get a release out. Then again, none of > the bugs fixed have been very serious, so there's no big hurry. Can I direct your attention towards http://www.postgresql.org/message-id/20140318183906.GW6899@eldon.alvh.no-ip.org ? I'm not in a position to produce a tested patch, but it should be fairly trivial in nature. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services