Thread: Fix for OpenSSL error queue bug
Attached patch fixes an issue reported by William Felipe Welter about a year ago [1]. It is loosely based on his original patch. As Heikki goes into on that thread, the appropriate action seems to be to constantly reset the error queue, and to make sure that we ourselves clear the queue consistently. (Note that we might not have consistently called ERR_get_error() in the event of an OOM within SSLerrmessage(), for example). I have not changed backend code in the patch, though; I felt that we had enough control of the general situation there for it to be unnecessary to lock everything down. The interface that OpenSSL exposes for all of this is very poorly thought out. It's not exactly clear how a client of OpenSSL can be a "good citizen" in handling the error queue. Correctly using the library is only ever described in terms of a very exact thing that must happen or must not happen. There is no overarching concept of how things fit together so that each OpenSSL client doesn't clobber the other. It's all rather impractical. Clearly, this patch needs careful review. [1] http://www.postgresql.org/message-id/flat/20150224030956.2529.83279@wrigleys.postgresql.org#20150224030956.2529.83279@wrigleys.postgresql.org -- Peter Geoghegan
Attachment
On 2/5/16 5:04 AM, Peter Geoghegan wrote: > As Heikki goes into on that thread, the appropriate action seems to be > to constantly reset the error queue, and to make sure that we > ourselves clear the queue consistently. (Note that we might not have > consistently called ERR_get_error() in the event of an OOM within > SSLerrmessage(), for example). I have not changed backend code in the > patch, though; I felt that we had enough control of the general > situation there for it to be unnecessary to lock everything down. I think clearing the error after a call is not necessary. The API clearly requires that you should clear the error queue before a call, so clearing it afterwards does not accomplish anything, except maybe make broken code work sometimes, for a while. Also, there is nothing that says that an error produces exactly one entry in the error queue; it could be multiple. Or that errors couldn't arise at random times between the reset and whatever happens next. I think this is analogous to clearing errno before a C library call. You could clear it afterwards as well, to be nice to the next guy, but the next guy should really take care of that themselves, and we can't rely on what happens in between anyway. The places that you identified for change look correct as far as libpq goes. I do think that the backend should be updated in the same way, because it's a) correct, b) easy enough, and c) there could well be interactions with postgres_fdw, plproxy, plperl, or who knows what.
On Thu, Mar 3, 2016 at 7:15 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 2/5/16 5:04 AM, Peter Geoghegan wrote: >> As Heikki goes into on that thread, the appropriate action seems to be >> to constantly reset the error queue, and to make sure that we >> ourselves clear the queue consistently. (Note that we might not have >> consistently called ERR_get_error() in the event of an OOM within >> SSLerrmessage(), for example). I have not changed backend code in the >> patch, though; I felt that we had enough control of the general >> situation there for it to be unnecessary to lock everything down. > > I think clearing the error after a call is not necessary. The API > clearly requires that you should clear the error queue before a call, so > clearing it afterwards does not accomplish anything, except maybe make > broken code work sometimes, for a while. Also, there is nothing that > says that an error produces exactly one entry in the error queue; it > could be multiple. Or that errors couldn't arise at random times > between the reset and whatever happens next. > > I think this is analogous to clearing errno before a C library call. > You could clear it afterwards as well, to be nice to the next guy, but > the next guy should really take care of that themselves, and we can't > rely on what happens in between anyway. > > The places that you identified for change look correct as far as libpq > goes. I do think that the backend should be updated in the same way, > because it's a) correct, b) easy enough, and c) there could well be > interactions with postgres_fdw, plproxy, plperl, or who knows what. So what's the next step here? Peter G, are you planning to update the patch based on this review from Peter E? If not, Peter E, do you want to update the patch and commit? If neither, I'm going to mark this Returned with Feedback in the CF and move on, which seems a bit of a shame since this appears to be a bona fide bug, but if nobody's willing to work on it, it ain't gettin' fixed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 10, 2016 at 2:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > So what's the next step here? Peter G, are you planning to update the > patch based on this review from Peter E? If not, Peter E, do you want > to update the patch and commit? If neither, I'm going to mark this > Returned with Feedback in the CF and move on, which seems a bit of a > shame since this appears to be a bona fide bug, but if nobody's > willing to work on it, it ain't gettin' fixed. Getting to it very soon. Just really busy right this moment. -- Peter Geoghegan
On Thu, Mar 10, 2016 at 3:09 PM, Peter Geoghegan <pg@heroku.com> wrote: > Getting to it very soon. Just really busy right this moment. That said, I agree with Peter's remarks about doing this frontend and backend. So, while I'm not sure, I think we're in agreement on all issues. I would have no problem with Peter E following through with final steps + commit as Robert outlined, if that works for him. -- Peter Geoghegan
On 3/10/16 6:10 PM, Peter Geoghegan wrote: > On Thu, Mar 10, 2016 at 3:09 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Getting to it very soon. Just really busy right this moment. > > That said, I agree with Peter's remarks about doing this frontend and > backend. So, while I'm not sure, I think we're in agreement on all > issues. I would have no problem with Peter E following through with > final steps + commit as Robert outlined, if that works for him. My proposal is the attached patch.
Attachment
Looked at your proposed patch. Will respond to your original mail on the matter. On Thu, Mar 3, 2016 at 4:15 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > I think clearing the error after a call is not necessary. The API > clearly requires that you should clear the error queue before a call, so > clearing it afterwards does not accomplish anything, except maybe make > broken code work sometimes, for a while. Uh, if it's so clear, then why haven't we been doing it all along? The API doesn't require you to take *any* specific practical measure (for example, the specific practical measure of resetting the queue before calling an I/O function). It simply says "this exact thing cannot be allowed to happen; the consequences are undefined", with nothing in the way of guidance on what that means in the real world. It's a shockingly bad API, but that's the reality. Part of the problem is that various scripting language OpenSSL wrappers are only very thin wrappers. They effectively pass the buck on to PHP and Ruby devs. If we cannot get it right, what chance have they? I've personally experienced a bit uptick in complaints about this recently. I think there are 3 separate groups within Heroku that regularly ask me how this patch is doing. > Also, there is nothing that > says that an error produces exactly one entry in the error queue; it > could be multiple. Or that errors couldn't arise at random times > between the reset and whatever happens next. I think that it's kind of implied, since calling ERR_get_error() pops the stack. But even if that isn't so, it might be worth preventing bad things from happening to client applications only sometimes. > I think this is analogous to clearing errno before a C library call. > You could clear it afterwards as well, to be nice to the next guy, but > the next guy should really take care of that themselves, and we can't > rely on what happens in between anyway. It sounds like you're saying "well, we cannot be expected to bend over backwards to make broken code work". But that broken code includes every single version of libpq + OpenSSL currently distributed. Seems like a very high standard. I'm not saying that that means we definitely should clear the error queue reliably ourselves, but doesn't it give you pause? Heikki seemed to think that clearing our own queue was important when he looked at this a year ago: http://www.postgresql.org/message-id/54EDD30D.5050107@vmware.com Again, not conclusive, but I would like to hear a rationale for why you think it's okay to not consistently clear our own queue for the benefit of others. Is this informed by a concern about some specific downside to taking that extra precaution? Thanks -- Peter Geoghegan
On 3/10/16 9:38 PM, Peter Geoghegan wrote: > Looked at your proposed patch. Will respond to your original mail on the matter. > > On Thu, Mar 3, 2016 at 4:15 PM, Peter Eisentraut <peter_e@gmx.net> wrote: >> I think clearing the error after a call is not necessary. The API >> clearly requires that you should clear the error queue before a call, so >> clearing it afterwards does not accomplish anything, except maybe make >> broken code work sometimes, for a while. > > Uh, if it's so clear, then why haven't we been doing it all along? The issue only happens if two interleaving trains of execution, one of which is libpq, use OpenSSL. Not many applications do that. And you also need to provoke the errors in a certain order. And even then, in some cases you might just see a false positive error, rather than a crash. So it's an edge case. > Part of the problem is that various scripting language OpenSSL > wrappers are only very thin wrappers. They effectively pass the buck > on to PHP and Ruby devs. If we cannot get it right, what chance have > they? I've personally experienced a bit uptick in complaints about > this recently. I think there are 3 separate groups within Heroku that > regularly ask me how this patch is doing. I think they have been getting away with it for so long for the same reasons. Arguably, if everyone followed "my" approach, this should be very easy to fix everywhere. Instead, reading through the PHP bug report, they are coming up with a fairly complex solution for clearing the error queue afterwards so as to not leave "landmines" for the next guy. But their code will (AFAICT) still be wrong because they are not clearing the error *before* the API calls where it is required per documentation.So "everyone" (sample of 2) is scrambling to cleanup for the next guy instead of doing the straightforward fix of following the API documentation and cleaning up before their own calls. I also see the clean-up-afterwards approach in the Python ssl module. I fear there is de facto a second API specification that requires you to clean up errors after yourself and gives an implicit guarantee that the error queue is empty whenever you want to make any SSL calls. I don't think this actually works in all cases, but maybe if everyone else is convinced of that (in plain violation of the published OpenSSL documentation, AFAICT) we need to get on board with that for interoperability. >> Also, there is nothing that >> says that an error produces exactly one entry in the error queue; it >> could be multiple. Or that errors couldn't arise at random times >> between the reset and whatever happens next. > > I think that it's kind of implied, since calling ERR_get_error() pops > the stack. But even if that isn't so, it might be worth preventing bad > things from happening to client applications only sometimes. The lore on the internet suggests that multiple errors could definitely happen. So popping one error afterwards isn't going to fix it, it just moves the edge case around. At least what we should do is clear the entire queue afterwards instead of just the first error. > doesn't it give you pause? Heikki seemed to think that clearing our > own queue was important when he looked at this a year ago: > > http://www.postgresql.org/message-id/54EDD30D.5050107@vmware.com I think that message suggests that we should clear the queue before each call, not after.
On Thu, Mar 10, 2016 at 7:22 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > Arguably, if everyone followed "my" approach, this should be very easy > to fix everywhere. I don't think that there is any clear indication that the OpenSSL people would share that view. Or my view. Or anything that's sensible or practical or actionable. > Instead, reading through the PHP bug report, they > are coming up with a fairly complex solution for clearing the error > queue afterwards so as to not leave "landmines" for the next guy. But > their code will (AFAICT) still be wrong because they are not clearing > the error *before* the API calls where it is required per documentation. > So "everyone" (sample of 2) is scrambling to clean up for the next guy > instead of doing the straightforward fix of following the API > documentation and cleaning up before their own calls. It will be less wrong, though. PostgreSQL is the project that doesn't trust a C90 standard library function to not blithely write passed the bounds of a passed buffer, all because of a bug in certain versions of Solaris based systems that was several years old at the time. See commit be8b06c364. My view that that wasn't really worth worrying about was clearly the minority view when this was discussed (a minority of 1, and a majority of 4 or 5, IIRC). I think that this case vastly exceeds that standard for worrying about other people's broken code; in this case, we ourselves made the same mistake for years and years. > I also see the clean-up-afterwards approach in the Python ssl module. I > fear there is de facto a second API specification that requires you to > clean up errors after yourself and gives an implicit guarantee that the > error queue is empty whenever you want to make any SSL calls. I don't > think this actually works in all cases, but maybe if everyone else is > convinced of that (in plain violation of the published OpenSSL > documentation, AFAICT) we need to get on board with that for > interoperability. I didn't know that Python's ssl module did that. That seems to lend support to my view, which is that we should similarly clear the thread's queue lest anyone else be affected. Yes, this approach is fairly scatter-gun, but frankly that's just the situation we find ourselves in. >>> Also, there is nothing that >>> says that an error produces exactly one entry in the error queue; it >>> could be multiple. Or that errors couldn't arise at random times >>> between the reset and whatever happens next. >> >> I think that it's kind of implied, since calling ERR_get_error() pops >> the stack. But even if that isn't so, it might be worth preventing bad >> things from happening to client applications only sometimes. > > The lore on the internet suggests that multiple errors could definitely > happen. So popping one error afterwards isn't going to fix it, it just > moves the edge case around. Are you sure, specifically, that an I/O function is known to add more than one error to the per-thread queue? Obviously there can be more than one error in the queue. But I haven't seen any specific indication, either in the docs or in the lore, that more than one error can be added by a single call to an I/O function such as SSL_read(). Perhaps you can share where you encountered the lore. >> doesn't it give you pause? Heikki seemed to think that clearing our >> own queue was important when he looked at this a year ago: >> >> http://www.postgresql.org/message-id/54EDD30D.5050107@vmware.com > > I think that message suggests that we should clear the queue before each > call, not after. Uh, it very clearly *is* Heikki's view that we should clear the queue *afterwards*. Certainly, I think Heikki also wanted us to clear the queue before, so we aren't screwed, "just to be sure", as he puts it -- but nobody disputes that that's necessary anyway. That it might not be *sufficient* to just call ERR_get_error() is the new information in the bug report. Heikki said: """ <pedantic> The OpenSSL manual doesn't directly require you to call ERR_clear_error() before every SSL_* call. It just requires that you ensure that the error queue is empty. Libpq ensures that by always clearing the queue *after* an error happens, in SSLerrmessage(). </pedantic> """ The problem with this, as Heikki goes on to say, it that we might not get to that point in SSLerrmessage(); we may not be able to pop the queue/call ERR_get_error(), more or less by accident (e.g. I noticed an OOM could do that). That's why I proposed to fix that by calling ERR_get_error() early and unambiguously. If we must rely on that happening, it should not be from such a long distance (i.e. from within SSLerrmessage(), which is kind of far removed from the original I/O function calls). Thanks -- Peter Geoghegan
On Thu, Mar 10, 2016 at 8:16 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Mar 10, 2016 at 7:22 PM, Peter Eisentraut <peter_e@gmx.net> wrote: >> Arguably, if everyone followed "my" approach, this should be very easy >> to fix everywhere. > > I don't think that there is any clear indication that the OpenSSL > people would share that view. Or my view. Or anything that's sensible > or practical or actionable. Ideally, we'd be able to get this into the upcoming minor release. This bug has caused Heroku some serious problems. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Thu, Mar 10, 2016 at 8:16 PM, Peter Geoghegan <pg@heroku.com> wrote: >> On Thu, Mar 10, 2016 at 7:22 PM, Peter Eisentraut <peter_e@gmx.net> wrote: >>> Arguably, if everyone followed "my" approach, this should be very easy >>> to fix everywhere. >> I don't think that there is any clear indication that the OpenSSL >> people would share that view. Or my view. Or anything that's sensible >> or practical or actionable. > Ideally, we'd be able to get this into the upcoming minor release. Agreed, we need to deal with this one way or the other. My proposal is: 1. In HEAD, do it as Peter E. suggests, ie clear error queue before calls. 2. In back branches, clear error queue before *and* after calls. This will waste a few nanoseconds but will avoid any risk of breaking existing third-party code. I think it's reasonable to expect extensions to update to the newer behavior in a major release, but we're taking risks if we expect that to happen in minor releases. In any case, we need a patch that touches the backend-side code as well. regards, tom lane
On Mon, Mar 14, 2016 at 3:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Agreed, we need to deal with this one way or the other. My proposal > is: > > 1. In HEAD, do it as Peter E. suggests, ie clear error queue before calls. > > 2. In back branches, clear error queue before *and* after calls. This > will waste a few nanoseconds but will avoid any risk of breaking > existing third-party code. > > I think it's reasonable to expect extensions to update to the newer > behavior in a major release, but we're taking risks if we expect > that to happen in minor releases. I am concerned that users will never be able to get this right, since I think it requires every Ruby or PHP app using some thin OpenSSL wrapper to clear the per-queue thread. It's a big mess, but it's our mess to some degree. I wonder if it would be just as good if we ensured that ERR_get_error() was definitely called in any circumstance where it looked like we might have an error: ERR_get_error() would be *reliably* called, as in my patch, but unlike my patch only when SSL_get_error() indicated a problem (not all the time). Heikki believed that clearing the error queue by calling ERR_clear_error() before calling an I/O function like SSL_read() was necessary, as we all do; no controversy there. But Heikki also believed, even prior to hearing about this bug, that it was important and necessary for ERR_get_error() to be reached when there might be an error added to the queue following a Postgres/libpq call to an I/O function like SSL_read() followed by SSL_get_error() indicating trouble. He thought, as I do, that it would be a good idea to not rely on that happening from a distance (i.e. not relying on reaching SSLerrmessage()). Peter E. seems to believe that there is absolutely no reason to rely on ERR_get_error() getting called at all, and that the existing SSLerrmessage() only exists for the purposes of producing a human-readable error message. Anyway, thinking about it some more, perhaps the best solution is to do the ERR_get_error() call iff SSL_get_error() seems unhappy, perhaps even placing the two into a utility function. That's probably almost the same as the existing behavior, as far as clearing up the queue after we may have added to it goes. I don't know if that's any less safe then my patch's pessimistic approach. It seems like it might be just as safe. Under this compromise, I think we'd probably clear the error queue after SSL_get_error() returned a value that is not SSL_ERROR_NONE, though (including SSL_ERROR_WANT_READ, etc). What do you think about that? > In any case, we need a patch that touches the backend-side code as well. I agree that the backend-side code should be covered. I quickly changed my mind about that, and am happy to produce a revision along those lines. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Mon, Mar 14, 2016 at 3:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Agreed, we need to deal with this one way or the other. My proposal >> is: >> >> 1. In HEAD, do it as Peter E. suggests, ie clear error queue before calls. >> >> 2. In back branches, clear error queue before *and* after calls. This >> will waste a few nanoseconds but will avoid any risk of breaking >> existing third-party code. > I am concerned that users will never be able to get this right, since > I think it requires every Ruby or PHP app using some thin OpenSSL > wrapper to clear the per-queue thread. It's a big mess, but it's our > mess to some degree. So your proposal is basically to do #2 in all branches? I won't fight it, if it doesn't bloat the code much. The overhead should surely be trivial compared to network communication costs, and I'm afraid you might be right about the risk of latent bugs. regards, tom lane
On Mon, Mar 14, 2016 at 4:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So your proposal is basically to do #2 in all branches? I won't fight it, > if it doesn't bloat the code much. The overhead should surely be trivial > compared to network communication costs, and I'm afraid you might be right > about the risk of latent bugs. Yes, with one small difference: I wouldn't be calling ERR_get_error() in the common case where SSL_get_error() returns SSL_ERROR_NONE, on the theory that skipping that case represents no risk. I'm making a concession to Peter E's view that that will calling ERR_get_error() more will add useless cycles. -- Peter Geoghegan
On Mon, Mar 14, 2016 at 4:11 PM, Peter Geoghegan <pg@heroku.com> wrote: > Yes, with one small difference: I wouldn't be calling ERR_get_error() > in the common case where SSL_get_error() returns SSL_ERROR_NONE, on > the theory that skipping that case represents no risk. I'm making a > concession to Peter E's view that that will calling ERR_get_error() > more will add useless cycles. The attached patch is what I have in mind. I can produce a back-patchable variant of this if you and Peter E. think this approach is okay. -- Peter Geoghegan
Attachment
On Mon, Mar 14, 2016 at 6:44 PM, Peter Geoghegan <pg@heroku.com> wrote: > I can produce a back-patchable variant of this if you and Peter E. > think this approach is okay. Where are we on this? -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Mon, Mar 14, 2016 at 6:44 PM, Peter Geoghegan <pg@heroku.com> wrote: >> I can produce a back-patchable variant of this if you and Peter E. >> think this approach is okay. > Where are we on this? I'm generally okay with the approach used in http://www.postgresql.org/message-id/CAM3SWZS0iV1HQ2fqNxvmTZm4+hPLAfH=7LeC4znmFX8az=ARMg@mail.gmail.com though I have not reviewed that patch in detail. One minor thought is that maybe it'd be better to do this: errno = 0; + ERR_clear_error(); in the other order, so as not to assume that ERR_clear_error doesn't set errno. On the other hand, if it does, things are probably hopelessly broken anyway; so I'm not sure there is any case where this helps. regards, tom lane
On Wed, Mar 23, 2016 at 9:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > in the other order, so as not to assume that ERR_clear_error doesn't > set errno. On the other hand, if it does, things are probably hopelessly > broken anyway; so I'm not sure there is any case where this helps. I'm fine with that. Will this make it into the next point release? I was rather hoping it would. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > Will this make it into the next point release? I was rather hoping it would. I dunno. I certainly haven't reviewed it carefully enough to commit it. Perhaps Peter has, but time grows short ... regards, tom lane
On Sun, Mar 27, 2016 at 9:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Geoghegan <pg@heroku.com> writes: >> Will this make it into the next point release? I was rather hoping it would. > > I dunno. I certainly haven't reviewed it carefully enough to commit it. > Perhaps Peter has, but time grows short ... I have looked at this patch. Do we need to worry as well about SSL_shutdown in disconnection code path? I believe that we don't care much if an error happens at this point but we surely should consume any error generated because the SSL context is kept after destroy_ssl_system and another connection attempt may be done using the same SSL context, no? -- Michael
On 03/14/2016 09:44 PM, Peter Geoghegan wrote: > On Mon, Mar 14, 2016 at 4:11 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Yes, with one small difference: I wouldn't be calling ERR_get_error() >> in the common case where SSL_get_error() returns SSL_ERROR_NONE, on >> the theory that skipping that case represents no risk. I'm making a >> concession to Peter E's view that that will calling ERR_get_error() >> more will add useless cycles. > > The attached patch is what I have in mind. > > I can produce a back-patchable variant of this if you and Peter E. > think this approach is okay. I think this patch is OK under the premises that we have established. I wish we could avoid the huge, repeated comment blocks. Perhaps we could put them at the top of the files once? Also, why do you write 0UL instead of just 0?
On 04/07/2016 03:47 AM, Michael Paquier wrote: > I have looked at this patch. Do we need to worry as well about > SSL_shutdown in disconnection code path? I believe that we don't care > much if an error happens at this point but we surely should consume > any error generated because the SSL context is kept after > destroy_ssl_system and another connection attempt may be done using > the same SSL context, no? But we are the only user of our SSL context, and we clear the error before every call we make (with this patch). The clean up afterwards is only if someone else is also using SSL in the same process, and they won't use our SSL context.
On Thu, Apr 7, 2016 at 6:08 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > I wish we could avoid the huge, repeated comment blocks. Perhaps we could > put them at the top of the files once? I'm fine with that. Do you want to take care of that, or should I? > Also, why do you write 0UL instead of just 0? Because the variable is declared as a raw unsigned long, IIRC. It hardly matters. -- Peter Geoghegan
On Fri, Apr 8, 2016 at 10:23 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 04/07/2016 03:47 AM, Michael Paquier wrote: >> >> I have looked at this patch. Do we need to worry as well about >> SSL_shutdown in disconnection code path? I believe that we don't care >> much if an error happens at this point but we surely should consume >> any error generated because the SSL context is kept after >> destroy_ssl_system and another connection attempt may be done using >> the same SSL context, no? > > > But we are the only user of our SSL context, and we clear the error before > every call we make (with this patch). The clean up afterwards is only if > someone else is also using SSL in the same process, and they won't use our > SSL context. Argh, yes. Because SSL_free() is directly used. I should read those docs more carefully. -- Michael
On 04/07/2016 09:36 PM, Peter Geoghegan wrote: > On Thu, Apr 7, 2016 at 6:08 PM, Peter Eisentraut <peter_e@gmx.net> wrote: >> I wish we could avoid the huge, repeated comment blocks. Perhaps we could >> put them at the top of the files once? > > I'm fine with that. Do you want to take care of that, or should I? I have committed this so that the comments are only in the first instance in each file. I think that should give enough information to someone who is curious about the details of the error handling. Also, I have adjusted this so that we check for n<0 after SSL_read and SSL_write, as you had in the frontend code but not in the backend code.
On Fri, Apr 8, 2016 at 11:12 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > I have committed this so that the comments are only in the first instance in > each file. I think that should give enough information to someone who is > curious about the details of the error handling. > > Also, I have adjusted this so that we check for n<0 after SSL_read and > SSL_write, as you had in the frontend code but not in the backend code. That seems reasonable. I'm glad we finally got this done. Thanks. -- Peter Geoghegan
On Fri, Apr 8, 2016 at 11:20 AM, Peter Geoghegan <pg@heroku.com> wrote: > That seems reasonable. I'm glad we finally got this done. Thanks. Are we going to backpatch this? -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Fri, Apr 8, 2016 at 11:20 AM, Peter Geoghegan <pg@heroku.com> wrote: >> That seems reasonable. I'm glad we finally got this done. Thanks. > Are we going to backpatch this? Seems like a reasonable thing to do, but somebody would have to do the legwork to produce back-branch patches. regards, tom lane
On Fri, Apr 8, 2016 at 11:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Seems like a reasonable thing to do, but somebody would have to do the > legwork to produce back-branch patches. I'll do so soon. I was waiting on Peter E to take me up on the offer. -- Peter Geoghegan
On Fri, Apr 8, 2016 at 11:43 AM, Peter Geoghegan <pg@heroku.com> wrote: > I'll do so soon. I was waiting on Peter E to take me up on the offer. Attached is a series of patches for each supported release branch. As discussed, I would like to target every released version of Postgres with this bugfix. This is causing users real pain at the moment. Only the 9.5 backpatch was a simple, conflict-free "git cherry-pick". Most of the effort here involved producing a clean 9.4 patch. This was largely mechanical, if a little tricky. In release branches for releases that preceded 9.4, there were a few further merge conflicts as I worked backwards through the branches, but those were trivial. I'm not sure if project policy around backpatching (that commit messages and so on should match exactly) has anything to say about the case where backpatching follows several weeks after commit to the master branch. In the absence of any clear direction on that, I've created commits that look like what Peter E might have pushed in early April, had he decided to do that backpatch leg-work up front. -- Peter Geoghegan
Attachment
On Tue, Apr 26, 2016 at 9:37 AM, Peter Geoghegan <pg@heroku.com> wrote: > Only the 9.5 backpatch was a simple, conflict-free "git cherry-pick". > Most of the effort here involved producing a clean 9.4 patch. This was > largely mechanical, if a little tricky. In release branches for > releases that preceded 9.4, there were a few further merge conflicts > as I worked backwards through the branches, but those were trivial. Looking again at this thread, the general agreement was to clear the error stack before calling any SSL routine. Those patches are doing so, and they look in good shape to me. Note: there is SSL_do_handshake() on back-branches for the SSL renegotiation but we don't need to bother about clearing the error queue as any error occurring in those cases just stops the session, and we've never bothered calling ERR_get_error there to get more details about the errors. > I'm not sure if project policy around backpatching (that commit > messages and so on should match exactly) has anything to say about the > case where backpatching follows several weeks after commit to the > master branch. In the absence of any clear direction on that, I've > created commits that look like what Peter E might have pushed in early > April, had he decided to do that backpatch leg-work up front. It seems to me that we definitely want to get this stuff backpatched at the end. So +1 for this move. -- Michael
On Mon, Apr 25, 2016 at 6:44 PM, Michael Paquier <michael.paquier@gmail.com> wrote: >> I'm not sure if project policy around backpatching (that commit >> messages and so on should match exactly) has anything to say about the >> case where backpatching follows several weeks after commit to the >> master branch. In the absence of any clear direction on that, I've >> created commits that look like what Peter E might have pushed in early >> April, had he decided to do that backpatch leg-work up front. > > It seems to me that we definitely want to get this stuff backpatched > at the end. So +1 for this move. Right. This issue has a long history of causing users significant (though often intermittent) problems. As an example, consider this problem report from a co-worker of mine that dates back to 2012: https://bitbucket.org/ged/ruby-pg/issues/142/async_exec-over-ssl-connection-can-fail-on There are numerous problem reports like this that are easily found using Google. I think that there are probably a variety of unpleasant interactions and symptoms. Crashes are one rarer symptom, seen in certain scenarios only (crashes are not described in the link above). -- Peter Geoghegan
Peter Geoghegan wrote: > I'm not sure if project policy around backpatching (that commit > messages and so on should match exactly) has anything to say about the > case where backpatching follows several weeks after commit to the > master branch. There is no value in providing exact message matches when the backpatch occurs even after one other commit in the master branch. The principal reason for the requirement is so that src/tools/git_changelog is able to merge the commits as a single entry, and that's not going to happen when you have one or more other commits in the master branch anyway. I find it more productive to mention, in the backpatched commit message, what is the commit ID in the master branch. That way it can be more easily be searched for, later on. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Peter Geoghegan wrote: >> I'm not sure if project policy around backpatching (that commit >> messages and so on should match exactly) has anything to say about the >> case where backpatching follows several weeks after commit to the >> master branch. > There is no value in providing exact message matches when the backpatch > occurs even after one other commit in the master branch. Actually, git_changelog can merge identically-messaged commits despite intervening commits. It's set up to not merge commits more than 24 hours apart, though. We could loosen that requirement but I'm afraid it would make things confusing to merge far-apart commits. regards, tom lane
On 04/27/2016 11:04 PM, Tom Lane wrote: >> There is no value in providing exact message matches when the backpatch >> occurs even after one other commit in the master branch. > > Actually, git_changelog can merge identically-messaged commits despite > intervening commits. It's set up to not merge commits more than 24 hours > apart, though. We could loosen that requirement but I'm afraid it would > make things confusing to merge far-apart commits. ISTM that git_changelog should be looking at the AuthorDate instead of the CommitDate. Then it would work correctly for backpatches done using cherry-pick.
Peter Eisentraut <peter_e@gmx.net> writes: > On 04/27/2016 11:04 PM, Tom Lane wrote: >> Actually, git_changelog can merge identically-messaged commits despite >> intervening commits. It's set up to not merge commits more than 24 hours >> apart, though. We could loosen that requirement but I'm afraid it would >> make things confusing to merge far-apart commits. > ISTM that git_changelog should be looking at the AuthorDate instead of > the CommitDate. Then it would work correctly for backpatches done using > cherry-pick. Meh. That would also make it noticeably *less* accurate for some other scenarios. It actually did look at AuthorDate to start with, and we changed it because we didn't like the results; cf 7299778a9. Also, IME, backpatching with cherry-pick fails often enough that designing one's process around it is just asking for pain. Certainly, many fewer than half of my own back-patches manage to use cherry-pick. regards, tom lane
On 4/25/16 8:37 PM, Peter Geoghegan wrote: > Attached is a series of patches for each supported release branch. As > discussed, I would like to target every released version of Postgres > with this bugfix. This is causing users real pain at the moment. All committed. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, May 6, 2016 at 9:12 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > All committed. Thanks! This should no longer be referenced in the 9.6 release notes. It should just appear in the next batch of point releases. Tom has an sgml comment in the draft 9.6 release notes to that effect. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Fri, May 6, 2016 at 9:12 PM, Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: >> All committed. > Thanks! > This should no longer be referenced in the 9.6 release notes. It > should just appear in the next batch of point releases. Tom has an > sgml comment in the draft 9.6 release notes to that effect. Yeah, I was just on that ... regards, tom lane