Re: Add Information during standby recovery conflicts - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Add Information during standby recovery conflicts |
Date | |
Msg-id | CA+fd4k6oBCoF+4sAgWqsC5+B5+EWQfT=FyPg9zVo4wh+zTXhUg@mail.gmail.com Whole thread Raw |
In response to | Add Information during standby recovery conflicts ("Drouvot, Bertrand" <bdrouvot@amazon.com>) |
Responses |
Re: Add Information during standby recovery conflicts
|
List | pgsql-hackers |
On Thu, 18 Jun 2020 at 16:28, Drouvot, Bertrand <bdrouvot@amazon.com> wrote: > > Hi hackers, > > I've attached a patch to add information during standby recovery conflicts. > The motive behind is to be able to get some information: > > On the apply side > On the blocker(s) side > > Motivation: > > When a standby recovery conflict occurs it could be useful to get more information to be able to dive deep on the rootcause and find a way to avoid/mitigate new occurrences. I think this is a good feature. Like log_lock_waits, it will help the users to investigate recovery conflict issues. > > Adding this information would make the investigations easier, it could help answering questions like: > > On which LSN was the WAL apply blocked? > What was the purpose of the bocked WAL record? > On which relation (if any) was the blocked WAL record related to? > What was the blocker(s) doing? > When did the blocker(s) started their queries (if any)? > What was the blocker(s) waiting for? on which wait event? > > Technical context and proposal: > > There is 2 points in this patch: > > Add the information about the blocked WAL record. This is done in standby.c (ResolveRecoveryConflictWithVirtualXIDs, ResolveRecoveryConflictWithDatabase,StandbyTimeoutHandler) I think we already have the information about the WAL record being applied in errcontext. I wonder if we can show the recovery conflict information in the main LOG message, the blocker information in errdetail, and use errcontext with regard to WAL record information. For example: LOG: process 500 waiting for recovery conflict on snapshot DETAIL: conflicting transition id: 100, 200, 300 CONTEXT: WAL redo at 0/3001970 for Heap2/CLEAN: remxid 506 > Outcome Example: > > 2020-06-15 06:48:23.774 UTC [6971] LOG: wal record apply is blocked by 2 connection(s), reason: User query might haveneeded to see row versions that must be removed. > 2020-06-15 06:48:23.774 UTC [6971] LOG: blocked wal record rmgr: Heap2, lsn: 0/038E2678, received at: 2020-06-15 06:48:23.774098+00,desc: CLEAN, relation: rel 1663/13586/16652 fork main blk 0 > 2020-06-15 06:48:54.773 UTC [7088] LOG: about to interrupt pid: 7088, backend_type: client backend, state: active, wait_event_type:Timeout, wait_event: PgSleep, query_start: 2020-06-15 06:48:14.87672+00 I'm concerned that the above information on the process who is about to be interrupted is very detailed but I'm not sure it will be helpful for the users. If the blocker is waiting on something lock, the information should be logged by log_lock_waits. Also the canceled backend will emit the ERROR log with the message "canceling statement due to conflict with recovery”, and its pid can be logged at the log prefix. The user can know who has been canceled by recovery conflicts and the query. Probably we need to consider having a time threshold (or boolean to turn on/off) to emit this information to the server logs. Otherwise, we will end up writing every conflict information, making the log dirty needlessly. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: