Thread: Patch proposal: New hooks in the connection path
Hi hackers,
While commit 960869da08 added some information about connections that have been successfully authenticated, there is no metrics for connections that have not (or did not reached the authentication stage).
Adding metrics about failed connections attempts could also help, for example with proper sampling, to:
- detect spikes in failed login attempts
- check if there is a correlation between spikes in successful and failed connection attempts
While the number of successful connections could also already been tracked with the ClientAuthentication_hook (and also the ones that failed the authentication) we are missing metrics about:
- why the connection failed (could be bad password, bad database, bad user, missing CONNECT privilege...)
- number of times the authentication stage has not been reached
- why the authentication stage has not been reached (bad startup packets, timeout while processing startup packet,...)
Those missing metrics (in addition to the ones that can be already gathered) could provide value for:
- security investigations
- anomalies detections
- tracking application misconfigurations
In an attempt to be able to provide those metrics, please find attached a patch proposal to add new hooks in the connection path, that would be fired if:
- there is a bad startup packet
- there is a timeout while processing the startup packet
- user does not have CONNECT privilege
- database does not exist
For safety those hooks request the use of a const Port parameter, so that they could be used only for reporting purpose (for example, we are working on an extension to record detailed login metrics counters).
Another option could be to add those metrics in the engine itself (instead of providing new hooks to get them), but the new hooks option gives more flexibility on how to render and exploit them (there is a lot of information in the Port Struct that one could be interested with).
I’m adding this patch proposal to the commitfest.
Looking forward to your feedback,
Regards,
Bertrand
Attachment
On Thu, Jun 30, 2022 at 1:31 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > > Hi hackers, > > While commit 960869da08 added some information about connections that have been successfully authenticated, there is nometrics for connections that have not (or did not reached the authentication stage). > > Adding metrics about failed connections attempts could also help, for example with proper sampling, to: > > detect spikes in failed login attempts > check if there is a correlation between spikes in successful and failed connection attempts > > While the number of successful connections could also already been tracked with the ClientAuthentication_hook (and alsothe ones that failed the authentication) we are missing metrics about: > > why the connection failed (could be bad password, bad database, bad user, missing CONNECT privilege...) > number of times the authentication stage has not been reached > why the authentication stage has not been reached (bad startup packets, timeout while processing startup packet,...) > > Those missing metrics (in addition to the ones that can be already gathered) could provide value for: > > security investigations > anomalies detections > tracking application misconfigurations > > In an attempt to be able to provide those metrics, please find attached a patch proposal to add new hooks in the connectionpath, that would be fired if: > > there is a bad startup packet > there is a timeout while processing the startup packet > user does not have CONNECT privilege > database does not exist > > For safety those hooks request the use of a const Port parameter, so that they could be used only for reporting purpose(for example, we are working on an extension to record detailed login metrics counters). > > Another option could be to add those metrics in the engine itself (instead of providing new hooks to get them), but thenew hooks option gives more flexibility on how to render and exploit them (there is a lot of information in the Port Structthat one could be interested with). > > I’m adding this patch proposal to the commitfest. > Looking forward to your feedback, +1 for the idea. I've seen numerous cases where the login metrics (especially failed connections) are handy in analyzing stuff. And I'm okay with the hook approach than the postgres emitting the necessary metrics. However, I'm personally not okay with having multiple hooks as proposed in the v1 patch. Can we think of having a single hook or enhancing the existing ClientAuthentication_hook where we pass a PURPOSE parameter (CONN_SUCCESS, CONN_FAILURE, CONN_FOO, CONN_BAR ....) tp the hook? With this approach, we don't need to spread out the postgres code with many hooks and the hook implementers will look at the PURPOSE parameter and deal with it accordingly. On the security aspect, we must ensure we don't leak any sensitive information such as password or SSH key to the new hook - if PGPORT has this information, maybe we need to mask that structure a bit before handing it off to the hook. Regards, Bharath Rupireddy.
Hi, On 6/30/22 11:23 AM, Bharath Rupireddy wrote: > On Thu, Jun 30, 2022 at 1:31 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: >> Hi hackers, >> >> While commit 960869da08 added some information about connections that have been successfully authenticated, there is nometrics for connections that have not (or did not reached the authentication stage). >> >> Adding metrics about failed connections attempts could also help, for example with proper sampling, to: >> >> detect spikes in failed login attempts >> check if there is a correlation between spikes in successful and failed connection attempts >> >> While the number of successful connections could also already been tracked with the ClientAuthentication_hook (and alsothe ones that failed the authentication) we are missing metrics about: >> >> why the connection failed (could be bad password, bad database, bad user, missing CONNECT privilege...) >> number of times the authentication stage has not been reached >> why the authentication stage has not been reached (bad startup packets, timeout while processing startup packet,...) >> >> Those missing metrics (in addition to the ones that can be already gathered) could provide value for: >> >> security investigations >> anomalies detections >> tracking application misconfigurations >> >> In an attempt to be able to provide those metrics, please find attached a patch proposal to add new hooks in the connectionpath, that would be fired if: >> >> there is a bad startup packet >> there is a timeout while processing the startup packet >> user does not have CONNECT privilege >> database does not exist >> >> For safety those hooks request the use of a const Port parameter, so that they could be used only for reporting purpose(for example, we are working on an extension to record detailed login metrics counters). >> >> Another option could be to add those metrics in the engine itself (instead of providing new hooks to get them), but thenew hooks option gives more flexibility on how to render and exploit them (there is a lot of information in the Port Structthat one could be interested with). >> >> I’m adding this patch proposal to the commitfest. >> Looking forward to your feedback, > +1 for the idea. I've seen numerous cases where the login metrics > (especially failed connections) are handy in analyzing stuff. And I'm > okay with the hook approach than the postgres emitting the necessary > metrics. Thanks for looking at it! > However, I'm personally not okay with having multiple hooks > as proposed in the v1 patch. I agree that it would be great to reduce the number of proposed hooks. But, > Can we think of having a single hook The proposed hooks are triggered during errors (means that the connection attempt break) and: - In the connection paths that will not reach the ClientAuthentication_hook at all: those are the ones related to the bad startup packet and timeout while processing the startup packet. or - After the ClientAuthentication_hook is fired: those are the bad db oid, bad db name and bad perm ones. So, It does look like having only one hook would require refactoring in the connection path and I'm not sure if this is worth it. > or > enhancing the existing ClientAuthentication_hook where we pass a > PURPOSE parameter (CONN_SUCCESS, CONN_FAILURE, CONN_FOO, CONN_BAR > ....) tp the hook? I think one could already "predict" the bad db and bad perm errors within the current ClientAuthentication_hook. But in case of multiple "possible" errors (within the same connection attempt) how could we know for sure the one that will be actually reported? That's why i think the best way is to put new hooks as close as possible to the place where the related errors are reported. What do you think? > With this approach, we don't need to spread out the > postgres code with many hooks and the hook implementers will look at > the PURPOSE parameter and deal with it accordingly. > > On the security aspect, we must ensure we don't leak any sensitive > information such as password or SSH key to the new hook - if PGPORT > has this information, maybe we need to mask that structure a bit > before handing it off to the hook. Yeah good point, I'll have a closer look if there is any information that could be present in the Port struct before the ClientAuthentication_hook is fired (means during the ones that are related to the startup packet) and that we would want to mask. Regards, Bertrand
On Fri, Jul 01, 2022 at 09:48:40AM +0200, Drouvot, Bertrand wrote: >> However, I'm personally not okay with having multiple hooks >> as proposed in the v1 patch. > > I agree that it would be great to reduce the number of proposed hooks. > > But, > >> Can we think of having a single hook > > The proposed hooks are triggered during errors (means that the connection > attempt break) and: > > - In the connection paths that will not reach the ClientAuthentication_hook > at all: those are the ones related to the bad startup packet and timeout > while processing the startup packet. > > or > > - After the ClientAuthentication_hook is fired: those are the bad db oid, > bad db name and bad perm ones. > > So, It does look like having only one hook would require refactoring in the > connection path and I'm not sure if this is worth it. > >> or >> enhancing the existing ClientAuthentication_hook where we pass a >> PURPOSE parameter (CONN_SUCCESS, CONN_FAILURE, CONN_FOO, CONN_BAR >> ....) tp the hook? > > I think one could already "predict" the bad db and bad perm errors within > the current ClientAuthentication_hook. > > But in case of multiple "possible" errors (within the same connection > attempt) how could we know for sure the one that will be actually reported? > That's why i think the best way is to put new hooks as close as possible to > the place where the related errors are reported. > > What do you think? Could we model this after fmgr_hook? The first argument in that hook indicates where it is being called from. This doesn't alleviate the need for several calls to the hook in the authentication logic, but extension authors would only need to define one hook. That being said, I don't see why this information couldn't be provided in a system view. IMO it is generically useful. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
That being said, I don't see why this information couldn't be provided in a
system view. IMO it is generically useful.
Hi, On 7/2/22 1:00 AM, Nathan Bossart wrote: > Could we model this after fmgr_hook? The first argument in that hook > indicates where it is being called from. This doesn't alleviate the need > for several calls to the hook in the authentication logic, but extension > authors would only need to define one hook. I like the idea and indeed fmgr.h looks a good place to model it. Attached a new patch version doing so. Thanks -- Bertrand Drouvot Amazon Web Services: https://aws.amazon.com
Attachment
Hi,
On Fri, Jul 1, 2022 at 5:00 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
That being said, I don't see why this information couldn't be provided in a
system view. IMO it is generically useful.+1 for a system view with appropriate permissions, in addition to the hooks.That would make the information easily accessible to a number or monitoring systems besides the admin.
Agree about that.
I'll start another thread and propose a dedicated patch for the "internal counters" and how to expose them.
Thanks for the feedback,
-- Bertrand Drouvot Amazon Web Services: https://aws.amazon.com
Hi,
On 7/2/22 1:00 AM, Nathan Bossart wrote:
> Could we model this after fmgr_hook? The first argument in that hook
> indicates where it is being called from. This doesn't alleviate the need
> for several calls to the hook in the authentication logic, but extension
> authors would only need to define one hook.
I like the idea and indeed fmgr.h looks a good place to model it.
Attached a new patch version doing so.
Thanks
--
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com
+ FCET_BSP, /* bad startup packet */
On Mon, Jul 4, 2022 at 6:29 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > > On 7/2/22 2:49 AM, Roberto Mello wrote: > > On Fri, Jul 1, 2022 at 5:00 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> >> That being said, I don't see why this information couldn't be provided in a >> system view. IMO it is generically useful. > > +1 for a system view with appropriate permissions, in addition to the hooks. > > That would make the information easily accessible to a number or monitoring systems besides the admin. > > Agree about that. Are we going to have it as a part of shared memory stats? Or a separate shared memory for connection stats exposing these via a function and a view can be built on this function like pg_get_replication_slots and pg_replication_slots? > I'll start another thread and propose a dedicated patch for the "internal counters" and how to expose them. IMHO, let's have the discussion here in this thread and the patch can be 0002. Regards, Bharath Rupireddy.
Hi,
Hi,+ FCET_SPT, /* startup packet timeout */
+ FCET_BSP, /* bad startup packet */Looking at existing enum type, such as FmgrHookEventType, the part after underscore is a word.I think it would be good to follow existing practice and make the enums more readable.
Fair point.
Attached a new version which makes the enums more readable.
Thanks for the feedback
Regards
-- Bertrand Drouvot Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Jul 4, 2022 at 6:23 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > > Hi, > > On 7/2/22 1:00 AM, Nathan Bossart wrote: > > Could we model this after fmgr_hook? The first argument in that hook > > indicates where it is being called from. This doesn't alleviate the need > > for several calls to the hook in the authentication logic, but extension > > authors would only need to define one hook. > > I like the idea and indeed fmgr.h looks a good place to model it. > > Attached a new patch version doing so. Thanks for the patch. Can we think of enhancing ClientAuthentication_hook_type itself i.e. make it a generic hook for all sorts of authentication metrics, info etc. with the type parameter embedded to it instead of new hook FailedConnection_hook?We can either add a new parameter for the "event" (the existing ClientAuthentication_hook_type implementers will have problems), or embed/multiplex the "event" info to existing Port structure or status variable (macro or enum) (existing implementers will not have compatibility problems). IMO, this looks cleaner going forward. On the v2 patch: 1. Why do we need to place the hook and structures in fmgr.h? Why not in auth.h? 2. Timeout Handler is a signal handler, called as part of SIGALRM signal handler, most of the times, signal handlers ought to be doing small things, now that we are handing off the control to hook, which can do any long running work (writing to a remote storage, file, aggregate etc.), I don't think it's the right thing to do here. static void StartupPacketTimeoutHandler(void) { + if (FailedConnection_hook) + (*FailedConnection_hook) (FCET_STARTUP_PACKET_TIMEOUT, MyProcPort); + ereport(COMMERROR, + (errcode(ERRCODE_PROTOCOL_VIOLATION), + errmsg("timeout while processing startup packet"))); 3. On "not letting these hooks (ClientAuthentication_hook_type or FailedConnection_hook_type) expose sensitive info via Port structure" - it seems like the Port structure has sensitive info like HbaLine, host address, username, etc. but that's what it is so I think we are okay with the structure as-is. Regards, Bharath Rupireddy.
On 6/30/22 5:23 AM, Bharath Rupireddy wrote: > <snip> > On the security aspect, we must ensure we don't leak any sensitive > information such as password or SSH key to the new hook - if PGPORT > has this information, maybe we need to mask that structure a bit > before handing it off to the hook. Can you elaborate more on why you see this as necessary? Extensions run in-process and have no real memory access limits, "masking", which really means copying data to another struct, is just extra work and overhead with no actual security gain, IMO.
On 7/5/22 03:37, Bharath Rupireddy wrote: > On Mon, Jul 4, 2022 at 6:23 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: >> On 7/2/22 1:00 AM, Nathan Bossart wrote: >> > Could we model this after fmgr_hook? The first argument in that hook >> > indicates where it is being called from. This doesn't alleviate the need >> > for several calls to the hook in the authentication logic, but extension >> > authors would only need to define one hook. >> >> I like the idea and indeed fmgr.h looks a good place to model it. >> >> Attached a new patch version doing so. I was thinking along the same lines, so +1 for the general approach > Thanks for the patch. Can we think of enhancing > ClientAuthentication_hook_type itself i.e. make it a generic hook for > all sorts of authentication metrics, info etc. with the type parameter > embedded to it instead of new hook FailedConnection_hook?We can either > add a new parameter for the "event" (the existing > ClientAuthentication_hook_type implementers will have problems), or > embed/multiplex the "event" info to existing Port structure or status > variable (macro or enum) (existing implementers will not have > compatibility problems). IMO, this looks cleaner going forward. Not sure I like this though -- I'll have to think about that > On the v2 patch: > > 1. Why do we need to place the hook and structures in fmgr.h? Why not in auth.h? agreed -- it does not belong in fmgr.h > 2. Timeout Handler is a signal handler, called as part of SIGALRM > signal handler, most of the times, signal handlers ought to be doing > small things, now that we are handing off the control to hook, which > can do any long running work (writing to a remote storage, file, > aggregate etc.), I don't think it's the right thing to do here. > static void > StartupPacketTimeoutHandler(void) > { > + if (FailedConnection_hook) > + (*FailedConnection_hook) (FCET_STARTUP_PACKET_TIMEOUT, MyProcPort); > + ereport(COMMERROR, > + (errcode(ERRCODE_PROTOCOL_VIOLATION), > + errmsg("timeout while processing startup packet"))); Why add the ereport()? But more to Bharath's point, perhaps this is a case that is better served by incrementing a stat counter and not exposed as a hook? Also, a teeny nit: 8<-------------- + if (status != STATUS_OK) { + if (FailedConnection_hook) 8<-------------- does not follow usual practice and probably should be: 8<-------------- + if (status != STATUS_OK) + { + if (FailedConnection_hook) 8<-------------- -- Joe Conway RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 7/6/22 12:11 AM, Joe Conway wrote: > > On 7/5/22 03:37, Bharath Rupireddy wrote: >> On Mon, Jul 4, 2022 at 6:23 PM Drouvot, Bertrand >> <bdrouvot@amazon.com> wrote: >>> On 7/2/22 1:00 AM, Nathan Bossart wrote: >>> > Could we model this after fmgr_hook? The first argument in that hook >>> > indicates where it is being called from. This doesn't alleviate >>> the need >>> > for several calls to the hook in the authentication logic, but >>> extension >>> > authors would only need to define one hook. >>> >>> I like the idea and indeed fmgr.h looks a good place to model it. >>> >>> Attached a new patch version doing so. > > I was thinking along the same lines, so +1 for the general approach Thanks for the review! > >> Thanks for the patch. Can we think of enhancing >> ClientAuthentication_hook_type itself i.e. make it a generic hook for >> all sorts of authentication metrics, info etc. with the type parameter >> embedded to it instead of new hook FailedConnection_hook?We can either >> add a new parameter for the "event" (the existing >> ClientAuthentication_hook_type implementers will have problems), or >> embed/multiplex the "event" info to existing Port structure or status >> variable (macro or enum) (existing implementers will not have >> compatibility problems). IMO, this looks cleaner going forward. > > Not sure I like this though -- I'll have to think about that Not sure about this one neither. The "enhanced" ClientAuthentication_hook will have to be fired at the same places as the new FailedConnection_hook is, but i think those places are not necessary linked to real authentication per say (making the name confusing). > >> On the v2 patch: >> >> 1. Why do we need to place the hook and structures in fmgr.h? Why not >> in auth.h? > > agreed -- it does not belong in fmgr.h Moved to auth.h. > >> 2. Timeout Handler is a signal handler, called as part of SIGALRM >> signal handler, most of the times, signal handlers ought to be doing >> small things, now that we are handing off the control to hook, which >> can do any long running work (writing to a remote storage, file, >> aggregate etc.), I don't think it's the right thing to do here. >> static void >> StartupPacketTimeoutHandler(void) >> { >> + if (FailedConnection_hook) >> + (*FailedConnection_hook) (FCET_STARTUP_PACKET_TIMEOUT, MyProcPort); >> + ereport(COMMERROR, >> + (errcode(ERRCODE_PROTOCOL_VIOLATION), >> + errmsg("timeout while processing startup packet"))); > > Why add the ereport()? removed it. > > But more to Bharath's point, perhaps this is a case that is better > served by incrementing a stat counter and not exposed as a hook? I think that the advantage of the hook is that it gives the extension author the ability/flexibility to aggregate the counter based on information available in the Port Struct (say the client addr for example) at this stage. What about to aggregate the stat counter based on the client addr? (Not sure if there is more useful information (than the client addr) at this stage though) That said, i agree that having a hook in a time out handler might not be the right thing to do (even if at the end that would be to the extension author responsibility to do "small things" in it), so it has been removed in the new attached version. > > Also, a teeny nit: > 8<-------------- > + if (status != STATUS_OK) { > + if (FailedConnection_hook) > 8<-------------- > > does not follow usual practice and probably should be: > > 8<-------------- > + if (status != STATUS_OK) > + { > + if (FailedConnection_hook) > 8<-------------- > > Thanks!, fixed. -- Bertrand Drouvot Amazon Web Services: https://aws.amazon.com
Attachment
Hi, On 7/5/22 9:07 AM, Bharath Rupireddy wrote: > On Mon, Jul 4, 2022 at 6:29 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: >> On 7/2/22 2:49 AM, Roberto Mello wrote: >> >> On Fri, Jul 1, 2022 at 5:00 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >>> That being said, I don't see why this information couldn't be provided in a >>> system view. IMO it is generically useful. >> +1 for a system view with appropriate permissions, in addition to the hooks. >> >> That would make the information easily accessible to a number or monitoring systems besides the admin. >> >> Agree about that. > Are we going to have it as a part of shared memory stats? Or a > separate shared memory for connection stats exposing these via a > function and a view can be built on this function like > pg_get_replication_slots and pg_replication_slots? Thanks for looking at it. I don't have any proposal yet, I'll have to look at it. -- Bertrand Drouvot Amazon Web Services: https://aws.amazon.com
On 7/6/22 04:13, Drouvot, Bertrand wrote: > On 7/6/22 12:11 AM, Joe Conway wrote: >> On 7/5/22 03:37, Bharath Rupireddy wrote: >>> 2. Timeout Handler is a signal handler, called as part of SIGALRM >>> signal handler, most of the times, signal handlers ought to be doing >>> small things, now that we are handing off the control to hook, which >>> can do any long running work (writing to a remote storage, file, >>> aggregate etc.), I don't think it's the right thing to do here. >>> static void >>> StartupPacketTimeoutHandler(void) >>> { >>> + if (FailedConnection_hook) >>> + (*FailedConnection_hook) (FCET_STARTUP_PACKET_TIMEOUT, MyProcPort); >> But more to Bharath's point, perhaps this is a case that is better >> served by incrementing a stat counter and not exposed as a hook? > > I think that the advantage of the hook is that it gives the extension > author the ability/flexibility to aggregate the counter based on > information available in the Port Struct (say the client addr for > example) at this stage. > > What about to aggregate the stat counter based on the client addr? (Not > sure if there is more useful information (than the client addr) at this > stage though) > > That said, i agree that having a hook in a time out handler might not be > the right thing to do (even if at the end that would be to the extension > author responsibility to do "small things" in it), so it has been > removed in the new attached version. It isn't clear to me if having a hook in the timeout handler is a nonstarter -- perhaps a comment with suitable warning for prospective extension authors is enough? Anyone else want to weigh in on this issue specifically? -- Joe Conway RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Joe Conway <mail@joeconway.com> writes: > It isn't clear to me if having a hook in the timeout handler is a > nonstarter -- perhaps a comment with suitable warning for prospective > extension authors is enough? Anyone else want to weigh in on this issue > specifically? It doesn't seem like a great place for a hook, because the list of stuff you could safely do there would be mighty short, possibly the empty set. Write to shared memory? Not too safe. Write to a file? Even less. Write to local memory? Pointless, because we're about to _exit(1). Pretty much anything I can think of that you'd want to do is something we've already decided the core code can't safely do, and putting it in a hook won't make it safer. If someone wants to argue for this hook, I'd like to see a credible example of a *safe* use-case, keeping in mind the points raised in the comments in BackendInitialize and process_startup_packet_die. regards, tom lane
On Fri, Jul 8, 2022 at 1:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Joe Conway <mail@joeconway.com> writes: > > It isn't clear to me if having a hook in the timeout handler is a > > nonstarter -- perhaps a comment with suitable warning for prospective > > extension authors is enough? Anyone else want to weigh in on this issue > > specifically? > > It doesn't seem like a great place for a hook, because the list of stuff > you could safely do there would be mighty short, possibly the empty set. > Write to shared memory? Not too safe. Write to a file? Even less. > Write to local memory? Pointless, because we're about to _exit(1). > Pretty much anything I can think of that you'd want to do is something > we've already decided the core code can't safely do, and putting it > in a hook won't make it safer. I agree with this. But, all of the areas that v2-0003 touched for connectivity failures, they typically are emitting ereport(FATAL,/ereport(COMMERROR, (in ProcessStartupPacket) and we have emit_log_hook already being exposed and the implementers can, literally, do anything the hook. Looking at v2-0003 patch and emit_log_hook, how about we filter out for those connectivity errors either based on error codes and if they aren't unique, perhaps passing special flags to ereport API indicating that it's a connectivity error and in the emit_log_hook we can look for those connectivity error codes or flags to collect the stats about the failure connections (with MyProcPort being present in emit_log_hook)? This way, we don't need a new hook. Thoughts? Regards, Bharath Rupireddy.
Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes: > On Fri, Jul 8, 2022 at 1:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It doesn't seem like a great place for a hook, because the list of stuff >> you could safely do there would be mighty short, possibly the empty set. > I agree with this. But, all of the areas that v2-0003 touched for > connectivity failures, they typically are emitting > ereport(FATAL,/ereport(COMMERROR, (in ProcessStartupPacket) and we > have emit_log_hook already being exposed and the implementers can, > literally, do anything the hook. This is utterly off-point, because those calls are not inside signal handlers. regards, tom lane
Hi, On 7/7/22 10:10 PM, Tom Lane wrote: > Joe Conway <mail@joeconway.com> writes: >> It isn't clear to me if having a hook in the timeout handler is a >> nonstarter -- perhaps a comment with suitable warning for prospective >> extension authors is enough? Anyone else want to weigh in on this issue >> specifically? > It doesn't seem like a great place for a hook, because the list of stuff > you could safely do there would be mighty short, possibly the empty set. > Write to shared memory? Not too safe. Write to a file? Even less. > Write to local memory? Pointless, because we're about to _exit(1). > Pretty much anything I can think of that you'd want to do is something > we've already decided the core code can't safely do, and putting it > in a hook won't make it safer. > > If someone wants to argue for this hook, I'd like to see a credible > example of a *safe* use-case, keeping in mind the points raised in > the comments in BackendInitialize and process_startup_packet_die. The use case would be to increment a counter in shared memory (or most probably within an hash table) to reflect the number of times a startup packet timeout occurred. Reading the comments in/related to BackendInitialize() I understand that's definitely not safe to write in shared memory for the EXEC_BACKEND case, but wouldn't it be safe for the non EXEC_BACKEND case? BTW, it makes me realize that the hook being fired in the bad startup packet case: /* * Stop here if it was bad or a cancel packet. ProcessStartupPacket * already did any appropriate error reporting. */ if (status != STATUS_OK) + { + if (FailedConnection_hook) + (*FailedConnection_hook) (FCET_BAD_STARTUP_PACKET, port); proc_exit(0); + } is not safe for the EXEC_BACKEND case. Regards, -- Bertrand Drouvot Amazon Web Services: https://aws.amazon.com
Hi, On 7/11/22 8:18 AM, Drouvot, Bertrand wrote: > Hi, > > On 7/7/22 10:10 PM, Tom Lane wrote: >> Joe Conway <mail@joeconway.com> writes: >>> It isn't clear to me if having a hook in the timeout handler is a >>> nonstarter -- perhaps a comment with suitable warning for prospective >>> extension authors is enough? Anyone else want to weigh in on this issue >>> specifically? >> It doesn't seem like a great place for a hook, because the list of stuff >> you could safely do there would be mighty short, possibly the empty set. >> Write to shared memory? Not too safe. Write to a file? Even less. >> Write to local memory? Pointless, because we're about to _exit(1). >> Pretty much anything I can think of that you'd want to do is something >> we've already decided the core code can't safely do, and putting it >> in a hook won't make it safer. >> >> If someone wants to argue for this hook, I'd like to see a credible >> example of a *safe* use-case, keeping in mind the points raised in >> the comments in BackendInitialize and process_startup_packet_die. > > The use case would be to increment a counter in shared memory (or most > probably within an hash table) to reflect the number of times a > startup packet timeout occurred. > > Reading the comments in/related to BackendInitialize() I understand > that's definitely not safe to write in shared memory for the > EXEC_BACKEND case, but wouldn't it be safe for the non EXEC_BACKEND case? > > BTW, it makes me realize that the hook being fired in the bad startup > packet case: > > /* > * Stop here if it was bad or a cancel packet. > ProcessStartupPacket > * already did any appropriate error reporting. > */ > if (status != STATUS_OK) > + { > + if (FailedConnection_hook) > + (*FailedConnection_hook) > (FCET_BAD_STARTUP_PACKET, port); > proc_exit(0); > + } > > is not safe for the EXEC_BACKEND case. > What about the idea to trigger the hook for the STARTUP PACKET TIMEOUT and BAD STARTUP PACKET only for the non EXEC_BACKEND/Windows cases? I'm tempted to think it's better to have some cases where one could benefit from the hook as opposed to none. Thoughts? Regards, -- Bertrand Drouvot Amazon Web Services: https://aws.amazon.com
On Fri, Jul 8, 2022 at 5:54 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Looking at v2-0003 patch and emit_log_hook, how about we filter out > for those connectivity errors either based on error codes and if they > aren't unique, perhaps passing special flags to ereport API indicating > that it's a connectivity error and in the emit_log_hook we can look > for those connectivity error codes or flags to collect the stats about > the failure connections (with MyProcPort being present in > emit_log_hook)? This way, we don't need a new hook. Thoughts? Bertrand and Other Hackers, above comment may have been lost in the wild - any thoughts on it? Regards, Bharath Rupireddy.
Hi Bharath, On 7/14/22 11:43 AM, Bharath Rupireddy wrote: > On Fri, Jul 8, 2022 at 5:54 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: >> Looking at v2-0003 patch and emit_log_hook, how about we filter out >> for those connectivity errors either based on error codes and if they >> aren't unique, perhaps passing special flags to ereport API indicating >> that it's a connectivity error and in the emit_log_hook we can look >> for those connectivity error codes or flags to collect the stats about >> the failure connections (with MyProcPort being present in >> emit_log_hook)? This way, we don't need a new hook. Thoughts? > Bertrand and Other Hackers, above comment may have been lost in the > wild - any thoughts on it? Thanks for your feedback! I can see 2 issues with that approach: - We’ll not be able to track the “startup timeout case” (well, we may not be able to track it anyway depending of what next to [1] will be) as it does not emit any log messages. - We’ll depend of the log_min_messages value (means edata->output_to_server needs to be true for the emit_log_hook to be triggered). [1]: https://www.postgresql.org/message-id/a1558d12-c1c4-0fe5-f8a5-2b6c2294e55f%40amazon.com Regards, -- Bertrand Drouvot Amazon Web Services: https://aws.amazon.com
Hi, On 7/12/22 2:58 PM, Drouvot, Bertrand wrote: > Hi, > > On 7/11/22 8:18 AM, Drouvot, Bertrand wrote: >> Hi, >> >> The use case would be to increment a counter in shared memory (or >> most probably within an hash table) to reflect the number of times a >> startup packet timeout occurred. >> >> Reading the comments in/related to BackendInitialize() I understand >> that's definitely not safe to write in shared memory for the >> EXEC_BACKEND case, but wouldn't it be safe for the non EXEC_BACKEND >> case? >> >> BTW, it makes me realize that the hook being fired in the bad startup >> packet case: >> >> /* >> * Stop here if it was bad or a cancel packet. >> ProcessStartupPacket >> * already did any appropriate error reporting. >> */ >> if (status != STATUS_OK) >> + { >> + if (FailedConnection_hook) >> + (*FailedConnection_hook) >> (FCET_BAD_STARTUP_PACKET, port); >> proc_exit(0); >> + } >> >> is not safe for the EXEC_BACKEND case. >> > What about the idea to trigger the hook for the STARTUP PACKET TIMEOUT > and BAD STARTUP PACKET only for the non EXEC_BACKEND/Windows cases? > > I'm tempted to think it's better to have some cases where one could > benefit from the hook as opposed to none. > > Thoughts? Please find attached v2-0004-connection_hooks.patch as an attempt of doing so. Thanks -- Bertrand Drouvot Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Aug 2, 2022 at 6:55 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > > Hi Bharath, > > On 7/14/22 11:43 AM, Bharath Rupireddy wrote: > > On Fri, Jul 8, 2022 at 5:54 PM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > >> Looking at v2-0003 patch and emit_log_hook, how about we filter out > >> for those connectivity errors either based on error codes and if they > >> aren't unique, perhaps passing special flags to ereport API indicating > >> that it's a connectivity error and in the emit_log_hook we can look > >> for those connectivity error codes or flags to collect the stats about > >> the failure connections (with MyProcPort being present in > >> emit_log_hook)? This way, we don't need a new hook. Thoughts? > > Bertrand and Other Hackers, above comment may have been lost in the > > wild - any thoughts on it? > > Thanks for your feedback! > > I can see 2 issues with that approach: > > - We’ll not be able to track the “startup timeout case” (well, we may > not be able to track it anyway depending of what next to [1] will be) as > it does not emit any log messages. > > [1]: > https://www.postgresql.org/message-id/a1558d12-c1c4-0fe5-f8a5-2b6c2294e55f%40amazon.com Yes, we wanted to be very quick in StartupPacketTimeoutHandler because it is a timeout signal handler after all. > - We’ll depend of the log_min_messages value (means > edata->output_to_server needs to be true for the emit_log_hook to be > triggered). Hm, we can just say that 'log_min_message setting will enable/disable the feature'. I agree with your first point of not having an error in StartupPacketTimeoutHandler hence I don't think using emit log hook for the connection failure stats helps. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
On Mon, Aug 8, 2022 at 3:51 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > Please find attached v2-0004-connection_hooks.patch /* * Stop here if it was bad or a cancel packet. ProcessStartupPacket * already did any appropriate error reporting. */ if (status != STATUS_OK) + { +#ifndef EXEC_BACKEND + if (FailedConnection_hook) + (*FailedConnection_hook) (FCET_BAD_STARTUP_PACKET, port); +#endif proc_exit(0); + } Per the comment above the if condition, the `status != OK` may represent a cancel packet, as well. Clearly, a cancel packet is not the same as a _bad_ packet. So I think here you need to differentiate between a cancel packet and a genuinely bad packet; I don't see anything good coming good out of us, or the hook-developer, lumping those 2 cases together. I think we can reduce the number of places the hook is called, if we call the hook from proc_exit(), and all the other places we simply set a global variable to signify the reason for the failure. The case of _exit(1) from the signal-handler cannot use such a mechanism, but I think all the other cases of interest can simply register one of the FCET_* value, and the hook call from proc_exit() can pass that value to the hook. If we can convinces ourselves that we can use proc_exit(1) in StartupPacketTimeoutHandler(), instead of calling _exit(1), I think we cal eliminate replace all call sites for this hook with set-global-variable variant. > ... > * This should be the only function to call exit(). > * -cim 2/6/90 >... > proc_exit(int code) The comment on proc_exit() claims that should be the only place calling exit(), except the add-on/extension hooks. So there must be a strong reason why the signal-handler uses _exit() to bypass all callbacks. Best regards, Gurjeet http://Gurje.et
(reposting the same review, with many grammatical fixes) On Mon, Aug 8, 2022 at 3:51 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > Please find attached v2-0004-connection_hooks.patch /* * Stop here if it was bad or a cancel packet. ProcessStartupPacket * already did any appropriate error reporting. */ if (status != STATUS_OK) + { +#ifndef EXEC_BACKEND + if (FailedConnection_hook) + (*FailedConnection_hook) (FCET_BAD_STARTUP_PACKET, port); +#endif proc_exit(0); + } Per the comment above the if condition, the `status != OK` may represent a cancel packet, as well. Clearly, a cancel packet is not the same as a _bad_ packet. So I think here you need to differentiate between a cancel packet and a genuinely bad packet; I don't see anything good coming out of us, or the hook-developer, lumping those 2 cases together. I think we can reduce the number of places the hook is called, if we call the hook from proc_exit(), and at all the other places we simply set a global variable to signify the reason for the failure. The case of _exit(1) from the signal-handler cannot use such a mechanism, but I think all the other cases of interest can simply register one of the FCET_* values, and let the call from proc_exit() pass that value to the hook. If we can convince ourselves that we can use proc_exit(1) in StartupPacketTimeoutHandler(), instead of calling _exit(1), I think we cal replace all call sites for this hook with the set-global-variable variant. > ... > * This should be the only function to call exit(). > * -cim 2/6/90 >... > proc_exit(int code) The comment on proc_exit() claims that it should be the only place calling exit(), except that the add-on/extension hooks may ignore this. So there must be a strong reason why the signal-handler uses _exit() to bypass all callbacks. Best regards, Gurjeet http://Gurje.et
Hi, On 8/14/22 7:52 AM, Gurjeet Singh wrote: > On Mon, Aug 8, 2022 at 3:51 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: >> Please find attached v2-0004-connection_hooks.patch > /* > * Stop here if it was bad or a cancel packet. ProcessStartupPacket > * already did any appropriate error reporting. > */ > if (status != STATUS_OK) > + { > +#ifndef EXEC_BACKEND > + if (FailedConnection_hook) > + (*FailedConnection_hook) (FCET_BAD_STARTUP_PACKET, port); > +#endif > proc_exit(0); > + } > > Per the comment above the if condition, the `status != OK` may > represent a cancel packet, as well. Clearly, a cancel packet is not > the same as a _bad_ packet. So I think here you need to differentiate > between a cancel packet and a genuinely bad packet; I don't see > anything good coming out of us, or the hook-developer, lumping > those 2 cases together. Thanks for the feedback! Yeah, good point. I agree that it would be better to make the distinction. > I think we can reduce the number of places the hook is called, if we > call the hook from proc_exit(), and at all the other places we simply set > a global variable to signify the reason for the failure. The case of > _exit(1) from the signal-handler cannot use such a mechanism, but I > think all the other cases of interest can simply register one of the > FCET_* values, and let the call from proc_exit() pass that value > to the hook. That looks like a good idea to me. I'm tempted to rewrite the patch that way (and addressing the first comment in the same time). Curious to hear about others hackers thoughts too. > If we can convince ourselves that we can use proc_exit(1) in > StartupPacketTimeoutHandler(), instead of calling _exit(1), I think we > cal replace all call sites for this hook with the > set-global-variable variant. I can see why this is not safe in the EXEC_BACKEND case, but it's not clear to me why it would not be safe in the non EXEC_BACKEND case. >> ... >> * This should be the only function to call exit(). >> * -cim 2/6/90 >> ... >> proc_exit(int code) > The comment on proc_exit() claims that it should be the only place > calling exit(), except that the add-on/extension hooks may ignore this. > So there must be a strong reason why the signal-handler uses _exit() > to bypass all callbacks. yeah. Regards, -- Bertrand Drouvot Amazon Web Services: https://aws.amazon.com
On Tue, Aug 16, 2022 at 1:31 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > > On 8/14/22 7:52 AM, Gurjeet Singh wrote: > > On Mon, Aug 8, 2022 at 3:51 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > > > I think we can reduce the number of places the hook is called, if we > > call the hook from proc_exit(), and at all the other places we simply set > > a global variable to signify the reason for the failure. The case of > > _exit(1) from the signal-handler cannot use such a mechanism, but I > > think all the other cases of interest can simply register one of the > > FCET_* values, and let the call from proc_exit() pass that value > > to the hook. > > That looks like a good idea to me. I'm tempted to rewrite the patch that > way (and addressing the first comment in the same time). > > Curious to hear about others hackers thoughts too. IMO, calling the hook from proc_exit() is not a good design as proc_exit() is a generic code called from many places in the source code, even the simple code of kind if(call_failed_conn_hook) { falied_conn_hook(params);} can come in the way of many exit code paths which is undesirable, and the likelihood of introducing new bugs may increase. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
Hi, On 8/16/22 10:10 AM, Bharath Rupireddy wrote: > On Tue, Aug 16, 2022 at 1:31 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: >> On 8/14/22 7:52 AM, Gurjeet Singh wrote: >>> On Mon, Aug 8, 2022 at 3:51 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: >>> I think we can reduce the number of places the hook is called, if we >>> call the hook from proc_exit(), and at all the other places we simply set >>> a global variable to signify the reason for the failure. The case of >>> _exit(1) from the signal-handler cannot use such a mechanism, but I >>> think all the other cases of interest can simply register one of the >>> FCET_* values, and let the call from proc_exit() pass that value >>> to the hook. >> That looks like a good idea to me. I'm tempted to rewrite the patch that >> way (and addressing the first comment in the same time). >> >> Curious to hear about others hackers thoughts too. > IMO, calling the hook from proc_exit() is not a good design as > proc_exit() is a generic code called from many places in the source > code, even the simple code of kind if(call_failed_conn_hook) { > falied_conn_hook(params);} can come in the way of many exit code paths > which is undesirable, and the likelihood of introducing new bugs may > increase. Thanks for the feedback. What do you think about calling the hook only if the new global variable is not equal to its default value (which would mean don't trigger the hook)? Regards, -- Bertrand Drouvot Amazon Web Services: https://aws.amazon.com
On Tue, Aug 16, 2022 at 1:55 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > > Hi, > > On 8/16/22 10:10 AM, Bharath Rupireddy wrote: > > On Tue, Aug 16, 2022 at 1:31 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > >> On 8/14/22 7:52 AM, Gurjeet Singh wrote: > >>> On Mon, Aug 8, 2022 at 3:51 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > >>> I think we can reduce the number of places the hook is called, if we > >>> call the hook from proc_exit(), and at all the other places we simply set > >>> a global variable to signify the reason for the failure. The case of > >>> _exit(1) from the signal-handler cannot use such a mechanism, but I > >>> think all the other cases of interest can simply register one of the > >>> FCET_* values, and let the call from proc_exit() pass that value > >>> to the hook. > >> That looks like a good idea to me. I'm tempted to rewrite the patch that > >> way (and addressing the first comment in the same time). > >> > >> Curious to hear about others hackers thoughts too. > > IMO, calling the hook from proc_exit() is not a good design as > > proc_exit() is a generic code called from many places in the source > > code, even the simple code of kind if(call_failed_conn_hook) { > > falied_conn_hook(params);} can come in the way of many exit code paths > > which is undesirable, and the likelihood of introducing new bugs may > > increase. > > Thanks for the feedback. > > What do you think about calling the hook only if the new global variable > is not equal to its default value (which would mean don't trigger the > hook)? IMO, that's not a good design as explained above. Why should the failed connection hook related code get hit for each and every proc_exit() call? Here, the code duplication i.e. the number of places the failed connection hook gets called mustn't be the reason to move that code to proc_exit(). -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
On Tue, Aug 16, 2022 at 3:16 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Tue, Aug 16, 2022 at 1:55 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > > > > Hi, > > > > On 8/16/22 10:10 AM, Bharath Rupireddy wrote: > > > On Tue, Aug 16, 2022 at 1:31 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > > >> On 8/14/22 7:52 AM, Gurjeet Singh wrote: > > >>> On Mon, Aug 8, 2022 at 3:51 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > > >>> I think we can reduce the number of places the hook is called, if we > > >>> call the hook from proc_exit(), and at all the other places we simply set > > >>> a global variable to signify the reason for the failure. The case of > > >>> _exit(1) from the signal-handler cannot use such a mechanism, but I > > >>> think all the other cases of interest can simply register one of the > > >>> FCET_* values, and let the call from proc_exit() pass that value > > >>> to the hook. > > >> That looks like a good idea to me. I'm tempted to rewrite the patch that > > >> way (and addressing the first comment in the same time). > > >> > > >> Curious to hear about others hackers thoughts too. I agree that we need feedback from long-timers here, on the decision of whether to use proc_exit() for this purpose. > > > IMO, calling the hook from proc_exit() is not a good design as > > > proc_exit() is a generic code called from many places in the source > > > code, even the simple code of kind if(call_failed_conn_hook) { > > > falied_conn_hook(params);} can come in the way of many exit code paths > > > which is undesirable, and the likelihood of introducing new bugs may > > > increase. > > > > Thanks for the feedback. > > > > What do you think about calling the hook only if the new global variable > > is not equal to its default value (which would mean don't trigger the > > hook)? > > IMO, that's not a good design as explained above. Why should the > failed connection hook related code get hit for each and every > proc_exit() call? Here, the code duplication i.e. the number of places > the failed connection hook gets called mustn't be the reason to move > that code to proc_exit(). I agree, it doesn't feel _clean_, having to maintain a global variable, pass it to hook at exit, etc. But the alternative feels less cleaner. This hook needs to be called when the process has decided to exit, so it makes sense to place this call in stack above proc_exit(), whose sole job is to let the process die gracefully, and take care of things on the way out. There are quite a few places in core that leverage proc_exit()'s facilities (by registering on_proc_exit callbacks), so an extension/hook doing so wouldn't be out of the ordinary; (apparently contrib/sepgsql has already set the precedent on an extension using the on_proc_exit callback). Admittedly, in this case the core will be managing and passing it the additional global variable needed to record the connection failure reason, FCET_*. If we agree that proc_exit() is a good place to place this call, then this hook can be converted into a on_proc_exit callback. If the global variable is exported, then the extension(s) can access it in the callback to ascertain why the process is exiting, and proc_exit() won't have to know anything special about the extension, or hook, or the global variable. The on_proc_exit callback method wouldn't work for the _exit() called in StartupPacketTimeoutHandler(), so that will need to be handled separately. Best regards, Gurjeet http://Gurje.et
This looks like it was a good discussion -- last summer. But it doesn't seem to be a patch under active development now. It sounds like there were some design constraints that still need some new ideas to solve and a new patch will be needed to address them. Should this be marked Returned With Feedback? -- Gregory Stark As Commitfest Manager
Hi, On 4/4/23 12:08 AM, Gregory Stark (as CFM) wrote: > This looks like it was a good discussion -- last summer. But it > doesn't seem to be a patch under active development now. > > It sounds like there were some design constraints that still need some > new ideas to solve and a new patch will be needed to address them. > > Should this be marked Returned With Feedback? > I just marked it as Returned With Feedback. I may re-open it later on to resume the discussion or share new ideas though. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com