Thread: [PATCH] Log details for client certificate failures
Hello, (I'm cleaning up some old git branches and found this. It was helpful when I was trying to debug failures between an NSS client and an OpenSSL server, and it seems general enough to help for more complicated OpenSSL-only setups as well.) Currently, debugging client cert verification failures is mostly limited to looking at the TLS alert code on the client side. For simple deployments, usually it's enough to see "sslv3 alert certificate revoked" and know exactly what needs to be fixed, but if you add any more complexity (multiple CA layers, misconfigured CA certificates, etc.), trying to debug what happened based on the TLS alert alone can be an exercise in frustration. Luckily, the server has more information about exactly what failed in the chain, and we already have the requisite callback implemented as a stub, so I've filled it out with error handling and added a COMMERROR log so that a DBA can debug client failures more easily. It ends up looking like LOG: connection received: host=localhost port=44120 LOG: client certificate verification failed at depth 1: unable to get local issuer certificate DETAIL: failed certificate's subject: /CN=Test CA for PostgreSQL SSL regression test client certs LOG: could not accept SSL connection: certificate verify failed It might be even nicer to make this available to the client, but I think the server log is an appropriate place for this information -- an admin might not want to advertise exactly why a client certificate has failed verification (other than what's already available via the TLS alert, that is), and I think more complicated failures (with intermediate CAs, etc.) are going to need administrator intervention anyway. So having to check the logs doesn't seem like a big hurdle. One question/concern -- the Subject that's printed to the logs could be pretty big (OpenSSL limits the incoming certificate chain to 100K, by default), which introduces an avenue for intentional log spamming. Is there an existing convention for limiting the length of log output used for debugging? Maybe I should just hardcode a smaller limit and truncate anything past that? Or we could just log the Common Name, which should be limited to 64 bytes... I'll add this to the July commitfest. Thanks, --Jacob
Attachment
On 03.05.22 19:04, Jacob Champion wrote: > One question/concern -- the Subject that's printed to the logs could be > pretty big (OpenSSL limits the incoming certificate chain to 100K, by > default), which introduces an avenue for intentional log spamming. Is > there an existing convention for limiting the length of log output used > for debugging? Maybe I should just hardcode a smaller limit and > truncate anything past that? Or we could just log the Common Name, > which should be limited to 64 bytes... The information in pg_stat_ssl is limited to NAMEDATALEN (see struct PgBackendSSLStatus). It might make sense to align what your patch prints to identify certificates with what is shown in that view.
On Tue, 2022-05-03 at 21:06 +0200, Peter Eisentraut wrote: > The information in pg_stat_ssl is limited to NAMEDATALEN (see struct > PgBackendSSLStatus). > > It might make sense to align what your patch prints to identify > certificates with what is shown in that view. Sure, a max length should be easy enough to do. Is there a reason to limit to NAMEDATALEN specifically? I was under the impression that we would rather not have had that limitation in the stats framework, if we could have avoided it. (In particular I think NAMEDATALEN will cut off the longest possible Common Name by just five bytes.) Thanks, --Jacob
On 04.05.22 01:05, Jacob Champion wrote: > On Tue, 2022-05-03 at 21:06 +0200, Peter Eisentraut wrote: >> The information in pg_stat_ssl is limited to NAMEDATALEN (see struct >> PgBackendSSLStatus). >> >> It might make sense to align what your patch prints to identify >> certificates with what is shown in that view. > > Sure, a max length should be easy enough to do. Is there a reason to > limit to NAMEDATALEN specifically? I was under the impression that we > would rather not have had that limitation in the stats framework, if we > could have avoided it. (In particular I think NAMEDATALEN will cut off > the longest possible Common Name by just five bytes.) Just saying that cutting it off appears to be acceptable. A bit more than 63 bytes should be okay for the log. In terms of aligning what is printed, I meant that pg_stat_ssl uses the issuer plus serial number to identify the certificate unambiguously.
On Wed, 2022-05-04 at 15:53 +0200, Peter Eisentraut wrote: > Just saying that cutting it off appears to be acceptable. A bit more > than 63 bytes should be okay for the log. Gotcha. > In terms of aligning what is printed, I meant that pg_stat_ssl uses the > issuer plus serial number to identify the certificate unambiguously. Oh, that's a great idea. I'll do that too. Thanks! --Jacob
On Thu, 2022-05-05 at 15:12 +0000, Jacob Champion wrote: > On Wed, 2022-05-04 at 15:53 +0200, Peter Eisentraut wrote: > > In terms of aligning what is printed, I meant that pg_stat_ssl uses the > > issuer plus serial number to identify the certificate unambiguously. > > Oh, that's a great idea. I'll do that too. v2 limits the maximum subject length and adds the serial number to the logs. Thanks! --Jacob
Attachment
On 13.05.22 00:36, Jacob Champion wrote: > On Thu, 2022-05-05 at 15:12 +0000, Jacob Champion wrote: >> On Wed, 2022-05-04 at 15:53 +0200, Peter Eisentraut wrote: >>> In terms of aligning what is printed, I meant that pg_stat_ssl uses the >>> issuer plus serial number to identify the certificate unambiguously. >> >> Oh, that's a great idea. I'll do that too. > > v2 limits the maximum subject length and adds the serial number to the > logs. I wrote that pg_stat_ssl uses the *issuer* plus serial number to identify a certificate. What your patch shows is the subject and the serial number, which isn't the same thing. Let's get that sorted out one way or the other. Another point, your patch produces LOG: connection received: host=localhost port=44120 LOG: client certificate verification failed at depth 1: ... DETAIL: failed certificate had subject ... LOG: could not accept SSL connection: certificate verify failed I guess what we really would like is LOG: connection received: host=localhost port=44120 LOG: could not accept SSL connection: certificate verify failed DETAIL: client certificate verification failed at depth 1: ... failed certificate had subject ... But I suppose that would be very cumbersome to produce with the callback structure provided by OpenSSL? I'm not saying the proposed way is unacceptable, but maybe it's worth being explicit about this tradeoff.
On 30 Jun 2022, at 10:43, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
I wrote that pg_stat_ssl uses the *issuer* plus serial number to identify a certificate. What your patch shows is the subject and the serial number, which isn't the same thing. Let's get that sorted out one way or the other.
Quick observation on this one, the string format of an issuer and serial number is defined as a “Certificate Exact Assertion” in RFC 4523.
I added this to httpd a while back:
SSL_CLIENT_CERT_RFC4523_CEA |
It would be good to interoperate.
Regards,
Graham
—
On Thu, Jun 30, 2022 at 2:43 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 13.05.22 00:36, Jacob Champion wrote: > > v2 limits the maximum subject length and adds the serial number to the > > logs. > > I wrote that pg_stat_ssl uses the *issuer* plus serial number to > identify a certificate. What your patch shows is the subject and the > serial number, which isn't the same thing. Let's get that sorted out > one way or the other. Sorry for the misunderstanding! v3 adds the Issuer to the logs as well. I wanted to clarify that this "issuer" has not actually been verified, but all I could come up with was "purported issuer" which doesn't read well to me. "Claimed issuer"? "Alleged issuer"? Thoughts? > Another point, your patch produces > > LOG: connection received: host=localhost port=44120 > LOG: client certificate verification failed at depth 1: ... > DETAIL: failed certificate had subject ... > LOG: could not accept SSL connection: certificate verify failed > > I guess what we really would like is > > LOG: connection received: host=localhost port=44120 > LOG: could not accept SSL connection: certificate verify failed > DETAIL: client certificate verification failed at depth 1: ... > failed certificate had subject ... > > But I suppose that would be very cumbersome to produce with the callback > structure provided by OpenSSL? I was about to say "yes, very cumbersome", but I actually think we might be able to do that without bubbling the error up through multiple callback layers, using SSL_set_ex_data() and friends. I'll take a closer look. Thanks! --Jacob
Attachment
On Thu, Jun 30, 2022 at 2:54 AM Graham Leggett <minfrin@sharp.fm> wrote: > > I added this to httpd a while back: > > SSL_CLIENT_CERT_RFC4523_CEA > > It would be good to interoperate. What kind of interoperation did you have in mind? Are there existing tools that want to scrape this information for observability? I think the CEA syntax might not be a good fit for this particular patch: first, we haven't actually verified the certificate, so no one should be using it to assert certificate equality (and I'm truncating the Issuer anyway, to avoid letting someone flood the logs). Second, this is designed to be human-readable rather than machine-readable. Thanks, --Jacob
On Fri, Jul 1, 2022 at 1:51 PM Jacob Champion <jchampion@timescale.com> wrote: > Sorry for the misunderstanding! v3 adds the Issuer to the logs as well. Resending v3; I messed up the certificate diff with my gitconfig. --Jacob
Attachment
On 05.07.22 18:34, Jacob Champion wrote: > On Fri, Jul 1, 2022 at 1:51 PM Jacob Champion <jchampion@timescale.com> wrote: >> Sorry for the misunderstanding! v3 adds the Issuer to the logs as well. > > Resending v3; I messed up the certificate diff with my gitconfig. This patch looks pretty good to me. Some minor details: I looked into how you decode the serial number. I have found some code elsewhere that passed the result of X509_get_serialNumber() directly to ASN1_INTEGER_set(). But I guess a serial number of maximum length 20 octets wouldn't fit into a 32-bit long. (There is ASN1_INTEGER_set_int64(), but that requires OpenSSL 1.1.0.) Does that match your understanding? For the detail string, I think we could do something like: DETAIL: Failed certificate data (unverified): subject '%s', serial number %s, issuer '%s'
On Thu, Jul 7, 2022 at 2:50 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > I looked into how you decode the serial number. I have found some code > elsewhere that passed the result of X509_get_serialNumber() directly to > ASN1_INTEGER_set(). But I guess a serial number of maximum length 20 > octets wouldn't fit into a 32-bit long. (There is > ASN1_INTEGER_set_int64(), but that requires OpenSSL 1.1.0.) Does that > match your understanding? Yep. And the bit lengths of the serial numbers used in the test suite are in the low 60s already. Many people will just randomize their serial numbers, so I think BN_bn2dec() is the way to go. > For the detail string, I think we could do something like: > > DETAIL: Failed certificate data (unverified): subject '%s', serial > number %s, issuer '%s' Done that way in v4. I also added an optional 0002 that bubbles the error info up to the final ereport(ERROR), using errdetail() and errhint(). I can squash it into 0001 if you like it, or drop it if you don't. (This approach could be adapted to the client, too.) Thanks! --Jacob
Attachment
On 01 Jul 2022, at 22:59, Jacob Champion <jchampion@timescale.com> wrote: >> I added this to httpd a while back: >> >> SSL_CLIENT_CERT_RFC4523_CEA >> >> It would be good to interoperate. > > What kind of interoperation did you have in mind? Are there existing > tools that want to scrape this information for observability? This is for human troubleshooting. > I think the CEA syntax might not be a good fit for this particular > patch: first, we haven't actually verified the certificate, so no one > should be using it to assert certificate equality (and I'm truncating > the Issuer anyway, to avoid letting someone flood the logs). Second, > this is designed to be human-readable rather than machine-readable. This is what a CEA looks like: { serialNumber 400410167207191393705333222102472642510002355884, issuer rdnSequence:”CN=Foo UK G1,O=Foo,C=UK" } Whitespace and escaping is important above. When troubleshooting, you want a string like the above that you can cut and paste and search for in other systems and logfiles. The verification status of the cert isn’t an issue at this point, you have a system in front of you where it doesn’twork when it should, and you need to know exactly what’s connecting, not what you think you’re connecting to, andyou need precise data. Please don’t invent another format, or try and truncate the data. This is a huge headache when troubleshooting. Regards, Graham —
On 08.07.22 20:39, Jacob Champion wrote: > I also added an optional 0002 that bubbles the error info up to the > final ereport(ERROR), using errdetail() and errhint(). I can squash it > into 0001 if you like it, or drop it if you don't. (This approach > could be adapted to the client, too.) I squashed those two together. I also adjusted the error message a bit more for project style. (We can put both lines into detail.) I had to read up on this "ex_data" API. Interesting. But I'm wondering a bit about how the life cycle of these objects is managed. What happens if the allocated error string is deallocated before its containing object? Or vice versa? How do we ensure we don't accidentally reuse the error string when the code runs again? (I guess currently it can't?) Maybe we should avoid this and just put the errdetail itself into a static variable that we unset once the message is printed?
Attachment
On Sat, Jul 9, 2022 at 6:49 AM Graham Leggett <minfrin@sharp.fm> wrote: > Please don’t invent another format, or try and truncate the data. This is a huge headache when troubleshooting. I hear you, and I agree that correlating these things across machines is something we should be making easier. I'm just not convinced that the particular format you've proposed, with a new set of rules for quoting and escaping, needs to be part of this patch. (And I think there are good reasons to truncate unverified cert data, so there'd have to be clear benefits to offset the risk of opening it up.) Searching Google for "issuer rdnSequence" comes up with mostly false positives related to LDAP filtering and certificate dumps, and the true positives seem to be mail threads that you've participated in. Do many LDAP servers log certificate failures in this format by default? (For that matter, does httpd?) The discussion at the time you added this to httpd [1] seemed to be making the point that this was a niche format, suited mostly for interaction with LDAP filters -- and Kaspar additionally pointed out that it's not a canonical format, so all of our implementations would have to have an ad hoc agreement to choose exactly one encoding. If you're using randomized serial numbers, you should be able to grep for those by themselves and successfully match many different formats, no? To me, that seems good enough for a first patch, considering we don't currently log any of this information. --Jacobfi [1] https://lists.apache.org/thread/1665qc4mod7ppp58qk3bqc2l3wtl3lkn
On Mon, Jul 11, 2022 at 6:09 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > I squashed those two together. I also adjusted the error message a bit > more for project style. (We can put both lines into detail.) Oh, okay. Log parsers don't have any issues with that? > I had to read up on this "ex_data" API. Interesting. But I'm wondering > a bit about how the life cycle of these objects is managed. What > happens if the allocated error string is deallocated before its > containing object? Or vice versa? Yeah, I'm currently leaning heavily on the lack of any memory context switches here. And I end up leaking out a pointer to the stale stack of be_tls_open_server(), which is gross -- it works since there are no other clients, but that could probably come back to bite us. The ex_data API exposes optional callbacks for new/dup/free (I'm currently setting those to NULL), so we can run custom code whenever the SSL* is destroyed. If you'd rather the data have the same lifetime of the SSL* object, we can switch to malloc/strdup/free (or even OPENSSL_strdup() in later versions). But since we don't have any use for the ex_data outside of this function, maybe we should just clear it before we return, rather than carrying it around. > How do we ensure we don't > accidentally reuse the error string when the code runs again? (I guess > currently it can't?) The ex_data is associated with the SSL*, not the global SSL_CTX*, so that shouldn't be an issue. A new SSL* gets created at the start of be_tls_open_server(). > Maybe we should avoid this and just put the > errdetail itself into a static variable that we unset once the message > is printed? If you're worried about the lifetime of the palloc'd data being too short, does switching to a static variable help in that case? --Jacob
On 13.07.22 01:06, Jacob Champion wrote: >> I had to read up on this "ex_data" API. Interesting. But I'm wondering >> a bit about how the life cycle of these objects is managed. What >> happens if the allocated error string is deallocated before its >> containing object? Or vice versa? > > Yeah, I'm currently leaning heavily on the lack of any memory context > switches here. And I end up leaking out a pointer to the stale stack > of be_tls_open_server(), which is gross -- it works since there are no > other clients, but that could probably come back to bite us. > > The ex_data API exposes optional callbacks for new/dup/free (I'm > currently setting those to NULL), so we can run custom code whenever > the SSL* is destroyed. If you'd rather the data have the same lifetime > of the SSL* object, we can switch to malloc/strdup/free (or even > OPENSSL_strdup() in later versions). But since we don't have any use > for the ex_data outside of this function, maybe we should just clear > it before we return, rather than carrying it around. Concretely, I was thinking like the attached top-up patch. The other way can surely be made to work somehow, but this seems much simpler and with fewer questions about the details.
Attachment
On Thu, Jul 14, 2022 at 1:12 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > Concretely, I was thinking like the attached top-up patch. > > The other way can surely be made to work somehow, but this seems much > simpler and with fewer questions about the details. Ah, seeing it side-by-side helps. That's much easier, I agree. Thanks, --Jacob
On 14.07.22 23:09, Jacob Champion wrote: > On Thu, Jul 14, 2022 at 1:12 PM Peter Eisentraut > <peter.eisentraut@enterprisedb.com> wrote: >> Concretely, I was thinking like the attached top-up patch. >> >> The other way can surely be made to work somehow, but this seems much >> simpler and with fewer questions about the details. > > Ah, seeing it side-by-side helps. That's much easier, I agree. Committed like that.
On 7/15/22 09:34, Peter Eisentraut wrote: > Committed like that. Thanks for all the reviews! --Jacob
Hi, On 2022-07-15 09:46:40 -0700, Jacob Champion wrote: > On 7/15/22 09:34, Peter Eisentraut wrote: > > Committed like that. > > Thanks for all the reviews! This might have been discussed somewhere, but I'm worried about emitting unescaped data from pre-auth clients. What guarantees that subject / issuer name only contain printable ascii-chars? Printing terminal control chars or such would not be great, nor would splitting a string at a multi-boundary. Greetings, Andres Freund
On 7/15/22 12:11, Andres Freund wrote: > This might have been discussed somewhere, but I'm worried about emitting > unescaped data from pre-auth clients. What guarantees that subject / issuer > name only contain printable ascii-chars? Printing terminal control chars or > such would not be great, nor would splitting a string at a multi-boundary. Hm. The last time I asked about that, Magnus pointed out that we reflect port->user_name as-is [1], so I kind of stopped worrying about it. Is this more dangerous than that? (And do we want to fix it now, regardless?) What guarantees are we supposed to be making for log encoding? Thanks, --Jacob [1] https://www.postgresql.org/message-id/CABUevExVHryTasKmtJW5RtU-dBesYj4bV7ggpeVMfiPCHCvLNA%40mail.gmail.com
Hi, On 2022-07-15 13:20:59 -0700, Jacob Champion wrote: > On 7/15/22 12:11, Andres Freund wrote: > > This might have been discussed somewhere, but I'm worried about emitting > > unescaped data from pre-auth clients. What guarantees that subject / issuer > > name only contain printable ascii-chars? Printing terminal control chars or > > such would not be great, nor would splitting a string at a multi-boundary. > > Hm. The last time I asked about that, Magnus pointed out that we reflect > port->user_name as-is [1], so I kind of stopped worrying about it. I think we need to fix a number of these. But no, I don't think we should just add more because we've not been careful in a bunch of other places. > Is this more dangerous than that? Hard to say. > (And do we want to fix it now, regardless?) Yes. > What guarantees are we supposed to be making for log encoding? I don't know, but I don't think not caring at all is a good option. Particularly for unauthenticated data I'd say that escaping everything but printable ascii chars is a sensible approach. Greetings, Andres Freund
On 7/15/22 13:35, Andres Freund wrote: >> (And do we want to fix it now, regardless?) > > Yes. Cool. I can get on board with that. >> What guarantees are we supposed to be making for log encoding? > > I don't know, but I don't think not caring at all is a good > option. Particularly for unauthenticated data I'd say that escaping everything > but printable ascii chars is a sensible approach. It'll also be painful for anyone whose infrastructure isn't in a Latin character set... Maybe that's worth the tradeoff for a v1. Is there an acceptable approach that could centralize it, so we fix it once and are done? E.g. a log_encoding GUC and either conversion or escaping in send_message_to_server_log()? --Jacob
Hi, On 2022-07-15 14:01:53 -0700, Jacob Champion wrote: > On 7/15/22 13:35, Andres Freund wrote: > >> (And do we want to fix it now, regardless?) > > > > Yes. > > Cool. I can get on board with that. > > >> What guarantees are we supposed to be making for log encoding? > > > > I don't know, but I don't think not caring at all is a good > > option. Particularly for unauthenticated data I'd say that escaping everything > > but printable ascii chars is a sensible approach. > > It'll also be painful for anyone whose infrastructure isn't in a Latin > character set... Maybe that's worth the tradeoff for a v1. I don't think it's a huge issue, or really avoidable, pre-authentication. Don't we require all server-side encodings to be supersets of ascii? We already have pg_clean_ascii() and use it for application_name, fwiw. > Is there an acceptable approach that could centralize it, so we fix it > once and are done? E.g. a log_encoding GUC and either conversion or > escaping in send_message_to_server_log()? Introducing escaping to ascii for all log messages seems like it'd be incredibly invasive, and would remove a lot of worthwhile information. Nor does it really address the whole scope - consider e.g. the truncation in this patch, that can't be done correctly by the time send_message_to_server_log() is reached - just chopping in the middle of a multi-byte string would have made the string invalidly encoded. And we can't perform encoding conversion from client data until we've gone further into the authentication process, I think. Always escaping ANSI escape codes (or rather the non-printable ascii range) is more convincing. Then we'd just need to make sure that client controlled data is properly encoded before handing it over to other parts of the system. I can see a point in a log_encoding GUC at some point, but it seems a bit separate from the discussion here. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > We already have pg_clean_ascii() and use it for application_name, fwiw. Yeah, we should just use that. If anyone wants to upgrade the situation for non-ASCII data later, fixing it for all of these cases at once would be appropriate. regards, tom lane
On 7/15/22 14:19, Andres Freund wrote: > On 2022-07-15 14:01:53 -0700, Jacob Champion wrote: >> On 7/15/22 13:35, Andres Freund wrote: >>> I don't know, but I don't think not caring at all is a good >>> option. Particularly for unauthenticated data I'd say that escaping everything >>> but printable ascii chars is a sensible approach. >> >> It'll also be painful for anyone whose infrastructure isn't in a Latin >> character set... Maybe that's worth the tradeoff for a v1. > > I don't think it's a huge issue, or really avoidable, pre-authentication. > Don't we require all server-side encodings to be supersets of ascii? Well, I was going to say that for this feature, where the goal is to debug a failed certificate chain, having to manually unescape the logged certificate names if your infrastructure already handled UTF-8 natively would be a real pain. But your later point about truncation makes that moot; I forgot that my patch can truncate in the middle of a UTF-8 sequence, so you're probably dealing with replacement glyphs anyway. I don't really have a leg to stand on there. > We already have pg_clean_ascii() and use it for application_name, fwiw. That seems much worse than escaping for this particular patch; if your cert's Common Name is in (non-ASCII) UTF-8 then all you'll see is "CN=?????????" in the log lines that were supposed to be helping you root-cause. Escaping would be much more helpful in this case. >> Is there an acceptable approach that could centralize it, so we fix it >> once and are done? E.g. a log_encoding GUC and either conversion or >> escaping in send_message_to_server_log()? > > Introducing escaping to ascii for all log messages seems like it'd be > incredibly invasive, and would remove a lot of worthwhile information. Nor > does it really address the whole scope - consider e.g. the truncation in this > patch, that can't be done correctly by the time send_message_to_server_log() > is reached - just chopping in the middle of a multi-byte string would have > made the string invalidly encoded. And we can't perform encoding conversion > from client data until we've gone further into the authentication process, I > think. > > Always escaping ANSI escape codes (or rather the non-printable ascii range) is > more convincing. Then we'd just need to make sure that client controlled data > is properly encoded before handing it over to other parts of the system. > > I can see a point in a log_encoding GUC at some point, but it seems a bit > separate from the discussion here. Fair enough. --Jacob
Hi, On 2022-07-15 14:51:38 -0700, Jacob Champion wrote: > > We already have pg_clean_ascii() and use it for application_name, fwiw. > > That seems much worse than escaping for this particular patch; if your > cert's Common Name is in (non-ASCII) UTF-8 then all you'll see is > "CN=?????????" in the log lines that were supposed to be helping you > root-cause. Escaping would be much more helpful in this case. I'm doubtful that's all that common. But either way, I suggest a separate patch to deal with that... Greetings, Andres Freund
On Fri, Jul 15, 2022 at 4:45 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-07-15 14:51:38 -0700, Jacob Champion wrote: > > That seems much worse than escaping for this particular patch; if your > > cert's Common Name is in (non-ASCII) UTF-8 then all you'll see is > > "CN=?????????" in the log lines that were supposed to be helping you > > root-cause. Escaping would be much more helpful in this case. > > I'm doubtful that's all that common. Probably not, but the more systems that support it without weird usability bugs, the more common it will hopefully become. > But either way, I suggest a separate patch to deal with that... Proposed fix attached, which uses \x-escaping for bytes outside of printable ASCII. Thanks, --Jacob
Attachment
Hi, On 2022-07-19 09:07:31 -0700, Jacob Champion wrote: > On Fri, Jul 15, 2022 at 4:45 PM Andres Freund <andres@anarazel.de> wrote: > > On 2022-07-15 14:51:38 -0700, Jacob Champion wrote: > > > That seems much worse than escaping for this particular patch; if your > > > cert's Common Name is in (non-ASCII) UTF-8 then all you'll see is > > > "CN=?????????" in the log lines that were supposed to be helping you > > > root-cause. Escaping would be much more helpful in this case. > > > > I'm doubtful that's all that common. > > Probably not, but the more systems that support it without weird > usability bugs, the more common it will hopefully become. > > > But either way, I suggest a separate patch to deal with that... > > Proposed fix attached, which uses \x-escaping for bytes outside of > printable ASCII. I don't think this should be open coded in the ssl part of the code. IMO this should replace the existing ascii escape function instead. I strongly oppose open coding this functionality in prepare_cert_name(). Greetings, Andres Freund
[resending to list] On 7/19/22 09:14, Andres Freund wrote: > IMO this should replace the existing ascii escape function instead. That will affect the existing behavior of application_name and cluster_name; is that acceptable? --Jacob
Jacob Champion <jchampion@timescale.com> writes: > On 7/19/22 09:14, Andres Freund wrote: >> IMO this should replace the existing ascii escape function instead. > That will affect the existing behavior of application_name and > cluster_name; is that acceptable? I think Andres' point is exactly that these should all act alike. Having said that, I struggle to see why we are panicking about badly encoded log data from this source while blithely ignoring the problems posed by non-ASCII role names, database names, and tablespace names. regards, tom lane
Hi, On 2022-07-19 12:39:43 -0400, Tom Lane wrote: > Having said that, I struggle to see why we are panicking about badly > encoded log data from this source while blithely ignoring the problems > posed by non-ASCII role names, database names, and tablespace names. I think we should fix these as well. I'm not as concerned about post-auth encoding issues (i.e. tablespace name) as about pre-auth data (role name, database name) - obviously being allowed to log in already is a pretty good filter... Greetings, Andres Freund
On Tue, Jul 19, 2022 at 10:09 AM Andres Freund <andres@anarazel.de> wrote: > On 2022-07-19 12:39:43 -0400, Tom Lane wrote: > > Having said that, I struggle to see why we are panicking about badly > > encoded log data from this source while blithely ignoring the problems > > posed by non-ASCII role names, database names, and tablespace names. > > I think we should fix these as well. I'm not as concerned about post-auth > encoding issues (i.e. tablespace name) as about pre-auth data (role name, > database name) - obviously being allowed to log in already is a pretty good > filter... v2 adds escaping to pg_clean_ascii(). My original attempt used StringInfo allocation, but that didn't play well with guc_malloc(), so I switched to a two-pass API where the caller allocates. Let me know if I'm missing something obvious; this way is more verbose than I'd like... Thanks, --Jacob
Attachment
Hi, On 2022-07-19 15:08:38 -0700, Jacob Champion wrote: > v2 adds escaping to pg_clean_ascii(). My original attempt used > StringInfo allocation, but that didn't play well with guc_malloc(), so > I switched to a two-pass API where the caller allocates. Let me know > if I'm missing something obvious; this way is more verbose than I'd > like... Hm, that's pretty awkward. Perhaps we can have a better API for everything but guc.c? Or alternatively, perhaps we can just make pg_clean_ascii() return NULL if allocation failed and then guc_strdup() the result in guc.c? If we end up needing a two phase approach, why use the same function for both phases? That seems quite awkward. Greetings, Andres Freund
On Tue, Jul 19, 2022 at 3:38 PM Andres Freund <andres@anarazel.de> wrote: > Or alternatively, perhaps we can just make pg_clean_ascii() return NULL > if allocation failed and then guc_strdup() the result in guc.c? The guc_strdup() approach really reduces the amount of code, so that's what I did in v3. I'm not following why we need to return NULL on failure, though -- both palloc() and guc_malloc() ERROR on failure, so is it okay to keep those semantics the same? > If we end up needing a two phase approach, why use the same function for > both phases? That seems quite awkward. Mostly so the byte counting always agrees between the two phases, no matter how the implementation evolves. But it's hopefully moot now. --Jacob
Attachment
Jacob Champion <jchampion@timescale.com> writes: > The guc_strdup() approach really reduces the amount of code, so that's > what I did in v3. I'm not following why we need to return NULL on > failure, though -- both palloc() and guc_malloc() ERROR on failure, so > is it okay to keep those semantics the same? guc_malloc's behavior varies depending on elevel. It's *not* equivalent to palloc. regards, tom lane
On Wed, Jul 20, 2022 at 3:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > guc_malloc's behavior varies depending on elevel. It's *not* > equivalent to palloc. Right, sorry -- a better way for me to ask the question: I'm currently hardcoding an elevel of ERROR on the new guc_strdup()s, because that seems to be a common case for the check hooks. If that's okay, is there any reason not to use palloc() semantics for pg_clean_ascii()? (And if it's not okay, why?) --Jacob
Jacob Champion <jchampion@timescale.com> writes: > I'm currently hardcoding an elevel of ERROR on the new guc_strdup()s, > because that seems to be a common case for the check hooks. Really? That's almost certainly NOT okay. As an example, if you have a problem with a new value loaded from postgresql.conf during SIGHUP processing, throwing ERROR will cause the postmaster to exit. I wouldn't be too surprised if there are isolated cases where people didn't understand what they were doing and wrote that, but that needs to be fixed not emulated. regards, tom lane
On Wed, Jul 20, 2022 at 3:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jacob Champion <jchampion@timescale.com> writes: > > I'm currently hardcoding an elevel of ERROR on the new guc_strdup()s, > > because that seems to be a common case for the check hooks. > > Really? That's almost certainly NOT okay. As an example, if you > have a problem with a new value loaded from postgresql.conf during > SIGHUP processing, throwing ERROR will cause the postmaster to exit. v4 attempts to fix this by letting the check hooks pass MCXT_ALLOC_NO_OOM to pg_clean_ascii(). (It's ignored in the frontend, which just mallocs.) > I wouldn't be too surprised if there are isolated cases where people > didn't understand what they were doing and wrote that, but that > needs to be fixed not emulated. I might be missing something, but in guc.c at least it appears to be the rule and not the exception. Thanks, --Jacob
Attachment
On Thu, Jul 21, 2022 at 4:29 PM Jacob Champion <jchampion@timescale.com> wrote: > v4 attempts to fix this by letting the check hooks pass > MCXT_ALLOC_NO_OOM to pg_clean_ascii(). (It's ignored in the frontend, > which just mallocs.) Ping -- should I add an open item somewhere so this isn't lost? --Jacob
On Thu, Jul 28, 2022 at 9:19 AM Jacob Champion <jchampion@timescale.com> wrote: > On Thu, Jul 21, 2022 at 4:29 PM Jacob Champion <jchampion@timescale.com> wrote: > > v4 attempts to fix this by letting the check hooks pass > > MCXT_ALLOC_NO_OOM to pg_clean_ascii(). (It's ignored in the frontend, > > which just mallocs.) > > Ping -- should I add an open item somewhere so this isn't lost? Trying again. Peter, is this approach acceptable? Should I try something else? Thanks, --Jacob
On 09.09.22 00:32, Jacob Champion wrote: > On Thu, Jul 28, 2022 at 9:19 AM Jacob Champion <jchampion@timescale.com> wrote: >> On Thu, Jul 21, 2022 at 4:29 PM Jacob Champion <jchampion@timescale.com> wrote: >>> v4 attempts to fix this by letting the check hooks pass >>> MCXT_ALLOC_NO_OOM to pg_clean_ascii(). (It's ignored in the frontend, >>> which just mallocs.) >> >> Ping -- should I add an open item somewhere so this isn't lost? > > Trying again. Peter, is this approach acceptable? Should I try something else? This looks fine to me. Committed.
On Tue, Sep 13, 2022 at 7:11 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > This looks fine to me. Committed. Thanks! --Jacob
Hi, On Tue, Sep 13, 2022 at 11:11 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 09.09.22 00:32, Jacob Champion wrote: > > On Thu, Jul 28, 2022 at 9:19 AM Jacob Champion <jchampion@timescale.com> wrote: > >> On Thu, Jul 21, 2022 at 4:29 PM Jacob Champion <jchampion@timescale.com> wrote: > >>> v4 attempts to fix this by letting the check hooks pass > >>> MCXT_ALLOC_NO_OOM to pg_clean_ascii(). (It's ignored in the frontend, > >>> which just mallocs.) > >> > >> Ping -- should I add an open item somewhere so this isn't lost? > > > > Trying again. Peter, is this approach acceptable? Should I try something else? > > This looks fine to me. Committed. While looking at the recent changes for check_cluster_name() I found this thread. Regarding the following change made by the commit 45b1a67a0fc, there is possibly small memory leak: static bool check_cluster_name(char **newval, void **extra, GucSource source) { + char *clean; + /* Only allow clean ASCII chars in the cluster name */ - pg_clean_ascii(*newval); + clean = pg_clean_ascii(*newval, MCXT_ALLOC_NO_OOM); + if (!clean) + return false; + + clean = guc_strdup(WARNING, clean); + if (!clean) + return false; + *newval = clean; return true; } pg_clean_ascii() does palloc_extended() to allocate memory in Postmaster context for the new characters and the clean is then replaced with the new memory allocated by guc_strdup(). No-one references the memory allocated by pg_clean_ascii() and it lasts for postmaster lifetime. Valgrind memcheck also shows: 1 bytes in 1 blocks are definitely lost in loss record 4 of 70 at 0xCD2A16: palloc_extended (mcxt.c:1239) by 0xD09437: pg_clean_ascii (string.c:99) by 0x7A5CF3: check_cluster_name (variable.c:1061) by 0xCAF7CD: call_string_check_hook (guc.c:6365) by 0xCAA724: InitializeOneGUCOption (guc.c:1439) by 0xCAA0ED: InitializeGUCOptions (guc.c:1268) by 0x99B245: PostmasterMain (postmaster.c:691) by 0x858896: main (main.c:197) I think we can fix it by the attached patch but I'd like to discuss whether it's worth fixing it. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Sep 27, 2022 at 1:51 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I think we can fix it by the attached patch but I'd like to discuss > whether it's worth fixing it. Whoops. So every time it's changed, we leak a little postmaster memory? Your patch looks good to me and I see no reason not to fix it. Thanks, --Jacob
On Wed, Sep 28, 2022 at 4:44 AM Jacob Champion <jchampion@timescale.com> wrote: > > On Tue, Sep 27, 2022 at 1:51 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I think we can fix it by the attached patch but I'd like to discuss > > whether it's worth fixing it. > > Whoops. So every time it's changed, we leak a little postmaster memory? No. Since cluster_name is PGC_POSTMATER, we leak a little postmaster memory only once when starting up. application_name is PGC_USERSET but since we normally allocate memory in PortalMemoryContext we eventually can free it. Since check_cluster_name and check_application_name are similar, I changed both for consistency. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Sep 27, 2022 at 6:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > No. Since cluster_name is PGC_POSTMATER, we leak a little postmaster > memory only once when starting up. application_name is PGC_USERSET but > since we normally allocate memory in PortalMemoryContext we eventually > can free it. Oh, I see; thank you for the correction. And even if someone put an application_name into their postgresql.conf, and then changed it a bunch of times, we'd free the leaked memory from the config_cxt that's created in ProcessConfigFile(). Is there a reason we don't provide a similar temporary context during InitializeGUCOptions()? Naively it seems like that would suppress any future one-time leaks, and maybe cut down on some Valgrind noise. Then again, maybe there's just not that much demand for pallocs during GUC hooks. Thanks, --Jacob
On Thu, Sep 29, 2022 at 1:43 AM Jacob Champion <jchampion@timescale.com> wrote: > > On Tue, Sep 27, 2022 at 6:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > No. Since cluster_name is PGC_POSTMATER, we leak a little postmaster > > memory only once when starting up. application_name is PGC_USERSET but > > since we normally allocate memory in PortalMemoryContext we eventually > > can free it. > > Oh, I see; thank you for the correction. And even if someone put an > application_name into their postgresql.conf, and then changed it a > bunch of times, we'd free the leaked memory from the config_cxt that's > created in ProcessConfigFile(). Right. > > Is there a reason we don't provide a similar temporary context during > InitializeGUCOptions()? Naively it seems like that would suppress any > future one-time leaks, and maybe cut down on some Valgrind noise. Then > again, maybe there's just not that much demand for pallocs during GUC > hooks. While this seems a future-proof idea, I wonder if it might be overkill since we don't need to worry about accumulation of leaked memory in this case. Given that only check_cluter_name is the case where we found a small memory leak, I think it's adequate to fix it. Fixing this issue suppresses the valgrind's complaint but since the boot value of cluster_name is "" the memory leak we can avoid is only 1 byte. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 29.09.22 06:52, Masahiko Sawada wrote: > While this seems a future-proof idea, I wonder if it might be overkill > since we don't need to worry about accumulation of leaked memory in > this case. Given that only check_cluter_name is the case where we > found a small memory leak, I think it's adequate to fix it. > > Fixing this issue suppresses the valgrind's complaint but since the > boot value of cluster_name is "" the memory leak we can avoid is only > 1 byte. I have committed this. I think it's better to keep the code locally robust and not to have to rely on complex analysis of how GUC memory management works.
On Sat, Oct 1, 2022 at 7:53 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 29.09.22 06:52, Masahiko Sawada wrote: > > While this seems a future-proof idea, I wonder if it might be overkill > > since we don't need to worry about accumulation of leaked memory in > > this case. Given that only check_cluter_name is the case where we > > found a small memory leak, I think it's adequate to fix it. > > > > Fixing this issue suppresses the valgrind's complaint but since the > > boot value of cluster_name is "" the memory leak we can avoid is only > > 1 byte. > > I have committed this. I think it's better to keep the code locally > robust and not to have to rely on complex analysis of how GUC memory > management works. Thanks! Agreed. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi,
I think that 9d58bf, left a tiny oversight.
guc_strdup uses strdup, so must be cleaned by free not pfree.
IMO, can be used once free call too.
regards,
Ranier Vilela
Attachment
Em qua., 5 de out. de 2022 às 09:19, Ranier Vilela <ranier.vf@gmail.com> escreveu:
Hi,I think that 9d58bf, left a tiny oversight.guc_strdup uses strdup, so must be cleaned by free not pfree.IMO, can be used once free call too.
Sorry, my fault.
Please ignore this.
regards,
Ranier Vilela