Thread: Add last failed connection error message to pg_stat_wal_receiver

Add last failed connection error message to pg_stat_wal_receiver

From
Bharath Rupireddy
Date:
Hi,

In production environments WAL receiver connection attempts to primary
may fail for many reasons (primary down, network is broken,
authentication tokens changes, primary_conn_info modifications, socket
errors and so on.). Although we emit the error message to server logs,
isn't it useful to show the last connection error message via
pg_stat_wal_receiver or pg_stat_get_wal_receiver? This will be super
helpful in production environments to analyse what the WAL receiver
issues as accessing and sifting through server logs can be quite
cumbersome for the end users.

Thoughts?

Attached patch can only display the last_conn_error only after the WAL
receiver is up, but it will be good to let pg_stat_wal_receiver emit
last_conn_error even before that. Imagine WAL receiver is continuously
failing on the standby, if we let pg_stat_wal_receiver report
last_conn_error, all other columns will show NULL. I can change this
way, if others are okay with it.

Regards,
Bharath Rupireddy.

Attachment

Re: Add last failed connection error message to pg_stat_wal_receiver

From
Cary Huang
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            tested, passed

Hello

The patch can be applied to PG master branch without problem and it passed regression and tap tests. I manually tested
thisfeature too and the last conn error is correctly shown in the pg_stat_get_wal_receiver output, which does exactly
asdescribed. I think this feature is nice to have to troubleshoot replication issues on the standby side.
 

thank you

Cary Huang
----------------
Highgo Software Canada

Re: Add last failed connection error message to pg_stat_wal_receiver

From
Bharath Rupireddy
Date:
On Sat, Jul 23, 2022 at 2:29 AM Cary Huang <cary.huang@highgo.ca> wrote:
>
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, passed
> Spec compliant:           tested, passed
> Documentation:            tested, passed
>
> Hello
>
> The patch can be applied to PG master branch without problem and it passed regression and tap tests. I manually
testedthis feature too and the last conn error is correctly shown in the pg_stat_get_wal_receiver output, which does
exactlyas described. I think this feature is nice to have to troubleshoot replication issues on the standby side. 

Thanks a lot Cary for reviewing. It will be great if you can add
yourself as a reviewer and set the status accordingly in the CF entry
here - https://commitfest.postgresql.org/38/3666/.

Regards,
Bharath Rupireddy.



Re: Add last failed connection error message to pg_stat_wal_receiver

From
Michael Paquier
Date:
On Mon, Jul 25, 2022 at 12:19:40PM +0530, Bharath Rupireddy wrote:
> Thanks a lot Cary for reviewing. It will be great if you can add
> yourself as a reviewer and set the status accordingly in the CF entry
> here - https://commitfest.postgresql.org/38/3666/.

Hmm.  This stands for the connection error, but there are other things
that could cause a failure down the road, like an incorrect system
ID or a TLI-related report, so that seems a bit limited to me?
--
Michael

Attachment

Re: Add last failed connection error message to pg_stat_wal_receiver

From
Bharath Rupireddy
Date:
On Mon, Jul 25, 2022 at 2:40 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Jul 25, 2022 at 12:19:40PM +0530, Bharath Rupireddy wrote:
> > Thanks a lot Cary for reviewing. It will be great if you can add
> > yourself as a reviewer and set the status accordingly in the CF entry
> > here - https://commitfest.postgresql.org/38/3666/.
>
> Hmm.  This stands for the connection error, but there are other things
> that could cause a failure down the road, like an incorrect system
> ID or a TLI-related report, so that seems a bit limited to me?

Good point. The walreceiver can exit for any reason. We can either 1)
store for all the error messages or 2) think of using sigsetjmp but
that only catches the ERROR kinds, leaving FATAL and PANIC messages.
The option (1) is simple but there are problems - we may miss storing
future error messages, good commenting and reviewing may help here and
all the error messages now need to be stored in string, which is
complex. The option (2) seems reasonable but we will miss FATAL and
PANIC messages (we have many ERRORs, 2 FATALs, 3 PANICs). Maybe a
combination of option (1) for FATALs and PANICs, and option (2) for
ERRORs helps.

Thoughts?

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/



Re: Add last failed connection error message to pg_stat_wal_receiver

From
Michael Paquier
Date:
On Thu, Aug 04, 2022 at 03:27:11PM +0530, Bharath Rupireddy wrote:
> Good point. The walreceiver can exit for any reason. We can either 1)
> store for all the error messages or 2) think of using sigsetjmp but
> that only catches the ERROR kinds, leaving FATAL and PANIC messages.
> The option (1) is simple but there are problems - we may miss storing
> future error messages, good commenting and reviewing may help here and
> all the error messages now need to be stored in string, which is
> complex. The option (2) seems reasonable but we will miss FATAL and
> PANIC messages (we have many ERRORs, 2 FATALs, 3 PANICs). Maybe a
> combination of option (1) for FATALs and PANICs, and option (2) for
> ERRORs helps.
>
> Thoughts?

PANIC is not something you'd care about as the system would go down as
and shared memory would be reset (right?) even if restart_on_crash is
enabled.  Perhaps it would help here to use something like a macro to
catch and save the error, in a style similar to what's in hba.c for
example, which is the closest example I can think of, even if on ERROR
we don't really care about the error string anyway as there is nothing
to report back to the SQL views used for the HBA/ident files.

FATAL may prove to be tricky though, because I'd expect the error to
be saved in shared memory in this case.  This is particularly critical
as this takes the WAL receiver process down, actually.

Anyway, outside the potential scope of the proposal, there are more
things that I find strange with the code:
- Why isn't the string reset when the WAL receiver is starting up?
That surely is not OK to keep a past state not referring to what
actually happens with a receiver currently running.
- pg_stat_wal_receiver (system view) reports no rows if pid is NULL,
which would be the state stored in shared memory after a connection.
This means that one would never be able to see last_conn_error except
when calling directly the SQL function pg_stat_get_wal_receiver().

One could say that we should report a row for this view all the time,
but this creates a compatibility breakage: existing application
assuming something like (one row <=> WAL receiver running) could
break.
--
Michael

Attachment

Re: Add last failed connection error message to pg_stat_wal_receiver

From
Bharath Rupireddy
Date:
On Thu, Aug 18, 2022 at 9:01 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> PANIC is not something you'd care about as the system would go down as
> and shared memory would be reset (right?) even if restart_on_crash is
> enabled.  Perhaps it would help here to use something like a macro to
> catch and save the error, in a style similar to what's in hba.c for
> example, which is the closest example I can think of, even if on ERROR
> we don't really care about the error string anyway as there is nothing
> to report back to the SQL views used for the HBA/ident files.
>
> FATAL may prove to be tricky though, because I'd expect the error to
> be saved in shared memory in this case.  This is particularly critical
> as this takes the WAL receiver process down, actually.

Hm, we can use error callbacks or pg try/catch blocks to save the
error message into walreceiver shared memory.

> Anyway, outside the potential scope of the proposal, there are more
> things that I find strange with the code:
> - Why isn't the string reset when the WAL receiver is starting up?
> That surely is not OK to keep a past state not referring to what
> actually happens with a receiver currently running.

I agree that it's not a good way to show some past failure state when
things are fine currently. Would naming the column name as
last_connectivity_error or something better and describing it in the
docs clearly help here?

Otherwise, we can have another simple function that just returns the
last connection failure of walreceiver and if required PID.

> - pg_stat_wal_receiver (system view) reports no rows if pid is NULL,
> which would be the state stored in shared memory after a connection.
> This means that one would never be able to see last_conn_error except
> when calling directly the SQL function pg_stat_get_wal_receiver().
>
> One could say that we should report a row for this view all the time,
> but this creates a compatibility breakage: existing application
> assuming something like (one row <=> WAL receiver running) could
> break.

-1.

We can think of having a separate infrastructure for reporting all
backend or process specific errors similar to pg_stat_activity and
pg_stat_get_activity, but that needs some shared memory and all of
that - IMO, it's an overkill.

I'm fine to withdraw this thread, if none of the above thoughts is
sensible enough to pursue further.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com