Thread: Add last failed connection error message to pg_stat_wal_receiver
Hi, In production environments WAL receiver connection attempts to primary may fail for many reasons (primary down, network is broken, authentication tokens changes, primary_conn_info modifications, socket errors and so on.). Although we emit the error message to server logs, isn't it useful to show the last connection error message via pg_stat_wal_receiver or pg_stat_get_wal_receiver? This will be super helpful in production environments to analyse what the WAL receiver issues as accessing and sifting through server logs can be quite cumbersome for the end users. Thoughts? Attached patch can only display the last_conn_error only after the WAL receiver is up, but it will be good to let pg_stat_wal_receiver emit last_conn_error even before that. Imagine WAL receiver is continuously failing on the standby, if we let pg_stat_wal_receiver report last_conn_error, all other columns will show NULL. I can change this way, if others are okay with it. Regards, Bharath Rupireddy.
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed Hello The patch can be applied to PG master branch without problem and it passed regression and tap tests. I manually tested thisfeature too and the last conn error is correctly shown in the pg_stat_get_wal_receiver output, which does exactly asdescribed. I think this feature is nice to have to troubleshoot replication issues on the standby side. thank you Cary Huang ---------------- Highgo Software Canada
On Sat, Jul 23, 2022 at 2:29 AM Cary Huang <cary.huang@highgo.ca> wrote: > > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: tested, passed > Documentation: tested, passed > > Hello > > The patch can be applied to PG master branch without problem and it passed regression and tap tests. I manually testedthis feature too and the last conn error is correctly shown in the pg_stat_get_wal_receiver output, which does exactlyas described. I think this feature is nice to have to troubleshoot replication issues on the standby side. Thanks a lot Cary for reviewing. It will be great if you can add yourself as a reviewer and set the status accordingly in the CF entry here - https://commitfest.postgresql.org/38/3666/. Regards, Bharath Rupireddy.
On Mon, Jul 25, 2022 at 12:19:40PM +0530, Bharath Rupireddy wrote: > Thanks a lot Cary for reviewing. It will be great if you can add > yourself as a reviewer and set the status accordingly in the CF entry > here - https://commitfest.postgresql.org/38/3666/. Hmm. This stands for the connection error, but there are other things that could cause a failure down the road, like an incorrect system ID or a TLI-related report, so that seems a bit limited to me? -- Michael
Attachment
On Mon, Jul 25, 2022 at 2:40 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Jul 25, 2022 at 12:19:40PM +0530, Bharath Rupireddy wrote: > > Thanks a lot Cary for reviewing. It will be great if you can add > > yourself as a reviewer and set the status accordingly in the CF entry > > here - https://commitfest.postgresql.org/38/3666/. > > Hmm. This stands for the connection error, but there are other things > that could cause a failure down the road, like an incorrect system > ID or a TLI-related report, so that seems a bit limited to me? Good point. The walreceiver can exit for any reason. We can either 1) store for all the error messages or 2) think of using sigsetjmp but that only catches the ERROR kinds, leaving FATAL and PANIC messages. The option (1) is simple but there are problems - we may miss storing future error messages, good commenting and reviewing may help here and all the error messages now need to be stored in string, which is complex. The option (2) seems reasonable but we will miss FATAL and PANIC messages (we have many ERRORs, 2 FATALs, 3 PANICs). Maybe a combination of option (1) for FATALs and PANICs, and option (2) for ERRORs helps. Thoughts? -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
On Thu, Aug 04, 2022 at 03:27:11PM +0530, Bharath Rupireddy wrote: > Good point. The walreceiver can exit for any reason. We can either 1) > store for all the error messages or 2) think of using sigsetjmp but > that only catches the ERROR kinds, leaving FATAL and PANIC messages. > The option (1) is simple but there are problems - we may miss storing > future error messages, good commenting and reviewing may help here and > all the error messages now need to be stored in string, which is > complex. The option (2) seems reasonable but we will miss FATAL and > PANIC messages (we have many ERRORs, 2 FATALs, 3 PANICs). Maybe a > combination of option (1) for FATALs and PANICs, and option (2) for > ERRORs helps. > > Thoughts? PANIC is not something you'd care about as the system would go down as and shared memory would be reset (right?) even if restart_on_crash is enabled. Perhaps it would help here to use something like a macro to catch and save the error, in a style similar to what's in hba.c for example, which is the closest example I can think of, even if on ERROR we don't really care about the error string anyway as there is nothing to report back to the SQL views used for the HBA/ident files. FATAL may prove to be tricky though, because I'd expect the error to be saved in shared memory in this case. This is particularly critical as this takes the WAL receiver process down, actually. Anyway, outside the potential scope of the proposal, there are more things that I find strange with the code: - Why isn't the string reset when the WAL receiver is starting up? That surely is not OK to keep a past state not referring to what actually happens with a receiver currently running. - pg_stat_wal_receiver (system view) reports no rows if pid is NULL, which would be the state stored in shared memory after a connection. This means that one would never be able to see last_conn_error except when calling directly the SQL function pg_stat_get_wal_receiver(). One could say that we should report a row for this view all the time, but this creates a compatibility breakage: existing application assuming something like (one row <=> WAL receiver running) could break. -- Michael
Attachment
On Thu, Aug 18, 2022 at 9:01 AM Michael Paquier <michael@paquier.xyz> wrote: > > PANIC is not something you'd care about as the system would go down as > and shared memory would be reset (right?) even if restart_on_crash is > enabled. Perhaps it would help here to use something like a macro to > catch and save the error, in a style similar to what's in hba.c for > example, which is the closest example I can think of, even if on ERROR > we don't really care about the error string anyway as there is nothing > to report back to the SQL views used for the HBA/ident files. > > FATAL may prove to be tricky though, because I'd expect the error to > be saved in shared memory in this case. This is particularly critical > as this takes the WAL receiver process down, actually. Hm, we can use error callbacks or pg try/catch blocks to save the error message into walreceiver shared memory. > Anyway, outside the potential scope of the proposal, there are more > things that I find strange with the code: > - Why isn't the string reset when the WAL receiver is starting up? > That surely is not OK to keep a past state not referring to what > actually happens with a receiver currently running. I agree that it's not a good way to show some past failure state when things are fine currently. Would naming the column name as last_connectivity_error or something better and describing it in the docs clearly help here? Otherwise, we can have another simple function that just returns the last connection failure of walreceiver and if required PID. > - pg_stat_wal_receiver (system view) reports no rows if pid is NULL, > which would be the state stored in shared memory after a connection. > This means that one would never be able to see last_conn_error except > when calling directly the SQL function pg_stat_get_wal_receiver(). > > One could say that we should report a row for this view all the time, > but this creates a compatibility breakage: existing application > assuming something like (one row <=> WAL receiver running) could > break. -1. We can think of having a separate infrastructure for reporting all backend or process specific errors similar to pg_stat_activity and pg_stat_get_activity, but that needs some shared memory and all of that - IMO, it's an overkill. I'm fine to withdraw this thread, if none of the above thoughts is sensible enough to pursue further. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com