Re: Add Information during standby recovery conflicts - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Add Information during standby recovery conflicts
Date
Msg-id CA+fd4k6oBCoF+4sAgWqsC5+B5+EWQfT=FyPg9zVo4wh+zTXhUg@mail.gmail.com
Whole thread Raw
In response to Add Information during standby recovery conflicts  ("Drouvot, Bertrand" <bdrouvot@amazon.com>)
Responses Re: Add Information during standby recovery conflicts
List pgsql-hackers
On Thu, 18 Jun 2020 at 16:28, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> Hi hackers,
>
> I've attached a patch to add information during standby recovery conflicts.
> The motive behind is to be able to get some information:
>
> On the apply side
> On the blocker(s) side
>
> Motivation:
>
> When a standby recovery conflict occurs it could be useful to get more information to be able to dive deep on the
rootcause and find a way to avoid/mitigate new occurrences. 

I think this is a good feature. Like log_lock_waits, it will help the
users to investigate recovery conflict issues.

>
> Adding this information would make the investigations easier, it could help answering questions like:
>
> On which LSN was the WAL apply blocked?
> What was the purpose of the bocked WAL record?
> On which relation (if any) was the blocked WAL record related to?
> What was the blocker(s) doing?
> When did the blocker(s) started their queries (if any)?
> What was the blocker(s) waiting for? on which wait event?
>
> Technical context and proposal:
>
> There is 2 points in this patch:
>
> Add the information about the blocked WAL record. This is done in standby.c (ResolveRecoveryConflictWithVirtualXIDs,
ResolveRecoveryConflictWithDatabase,StandbyTimeoutHandler) 

I think we already have the information about the WAL record being
applied in errcontext.

I wonder if we can show the recovery conflict information in the main
LOG message, the blocker information in errdetail, and use errcontext
with regard to WAL record information. For example:

LOG:  process 500 waiting for recovery conflict on snapshot
DETAIL:  conflicting transition id: 100, 200, 300
CONTEXT:  WAL redo at 0/3001970 for Heap2/CLEAN: remxid 506

> Outcome Example:
>
> 2020-06-15 06:48:23.774 UTC [6971] LOG:  wal record apply is blocked by 2 connection(s), reason: User query might
haveneeded to see row versions that must be removed. 
> 2020-06-15 06:48:23.774 UTC [6971] LOG:  blocked wal record rmgr: Heap2, lsn: 0/038E2678, received at: 2020-06-15
06:48:23.774098+00,desc: CLEAN, relation: rel 1663/13586/16652 fork main blk 0 
> 2020-06-15 06:48:54.773 UTC [7088] LOG:  about to interrupt pid: 7088, backend_type: client backend, state: active,
wait_event_type:Timeout, wait_event: PgSleep, query_start: 2020-06-15 06:48:14.87672+00 

I'm concerned that the above information on the process who is about
to be interrupted is very detailed but I'm not sure it will be helpful
for the users. If the blocker is waiting on something lock, the
information should be logged by log_lock_waits. Also the canceled
backend will emit the ERROR log with the message "canceling statement
due to conflict with recovery”, and its pid can be logged at the log
prefix. The user can know who has been canceled by recovery conflicts
and the query.

Probably we need to consider having a time threshold (or boolean to
turn on/off) to emit this information to the server logs. Otherwise,
we will end up writing every conflict information, making the log
dirty needlessly.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: [PATCH] postgres_fdw connection caching - cause remote sessionslinger till the local session exit
Next
From: Fujii Masao
Date:
Subject: Re: track_planning causing performance regression