Re: Add Information during standby recovery conflicts - Mailing list pgsql-hackers
From | Drouvot, Bertrand |
---|---|
Subject | Re: Add Information during standby recovery conflicts |
Date | |
Msg-id | daeecc9d-567a-cad8-298a-6f12ed7cb665@amazon.com Whole thread Raw |
In response to | Re: Add Information during standby recovery conflicts (Masahiko Sawada <masahiko.sawada@2ndquadrant.com>) |
Responses |
Re: Add Information during standby recovery conflicts
|
List | pgsql-hackers |
Hey, On 6/29/20 10:55 AM, Masahiko Sawada wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you canconfirm the sender and know the content is safe. > > > > On Thu, 18 Jun 2020 at 16:28, Drouvot, Bertrand <bdrouvot@amazon.com> wrote: >> Hi hackers, >> >> I've attached a patch to add information during standby recovery conflicts. >> The motive behind is to be able to get some information: >> >> On the apply side >> On the blocker(s) side >> >> Motivation: >> >> When a standby recovery conflict occurs it could be useful to get more information to be able to dive deep on the rootcause and find a way to avoid/mitigate new occurrences. > I think this is a good feature. Like log_lock_waits, it will help the > users to investigate recovery conflict issues. exactly, thanks for looking at it. > >> Adding this information would make the investigations easier, it could help answering questions like: >> >> On which LSN was the WAL apply blocked? >> What was the purpose of the bocked WAL record? >> On which relation (if any) was the blocked WAL record related to? >> What was the blocker(s) doing? >> When did the blocker(s) started their queries (if any)? >> What was the blocker(s) waiting for? on which wait event? >> >> Technical context and proposal: >> >> There is 2 points in this patch: >> >> Add the information about the blocked WAL record. This is done in standby.c (ResolveRecoveryConflictWithVirtualXIDs, ResolveRecoveryConflictWithDatabase,StandbyTimeoutHandler) > I think we already have the information about the WAL record being > applied in errcontext. right, but it won’t be displayed in case log_error_verbosity is set to terse. Or did you had in mind to try to avoid using the new “current_replayed_xlog” in xlog.c? > > I wonder if we can show the recovery conflict information in the main > LOG message, the blocker information in errdetail, and use errcontext > with regard to WAL record information. For example: > > LOG: process 500 waiting for recovery conflict on snapshot > DETAIL: conflicting transition id: 100, 200, 300 > CONTEXT: WAL redo at 0/3001970 for Heap2/CLEAN: remxid 506 Not sure at all if it would be possible to put all the information at the same time. For example in case of shared buffer pin conflict, the blocker is currently known “only” during the RecoveryConflictInterrupt call (so still not known yet when we can “already” report the blocked LSN information). It might also happen that the blocker(s) will never get interrupted (was not blocking anymore before max_standby_streaming_delay has been reached): then it would not be possible to display all the information here (aka when it is interrupted) while we still want to be warned that the WAL replay has been blocked. > >> Outcome Example: >> >> 2020-06-15 06:48:23.774 UTC [6971] LOG: wal record apply is blocked by 2 connection(s), reason: User query might haveneeded to see row versions that must be removed. >> 2020-06-15 06:48:23.774 UTC [6971] LOG: blocked wal record rmgr: Heap2, lsn: 0/038E2678, received at: 2020-06-15 06:48:23.774098+00,desc: CLEAN, relation: rel 1663/13586/16652 fork main blk 0 >> 2020-06-15 06:48:54.773 UTC [7088] LOG: about to interrupt pid: 7088, backend_type: client backend, state: active, wait_event_type:Timeout, wait_event: PgSleep, query_start: 2020-06-15 06:48:14.87672+00 > I'm concerned that the above information on the process who is about > to be interrupted is very detailed but I'm not sure it will be helpful > for the users. If the blocker is waiting on something lock, the > information should be logged by log_lock_waits. The blocker could also be in “idle in transaction”, say for example, on the standby (with hot_standby_feedback set to off): standby> begin; standby> SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; standby> select * from bdt; on the master: master> update bdt set generate_series = 15; master> vacuum verbose bdt; would produce: 2020-07-01 09:18:55.256 UTC [32751] LOG: about to interrupt pid: 32751, backend_type: client backend, state: idle in transaction, wait_event_type: Client, wait_event: ClientRead, query_start: 2020-07-01 09:18:17.390572+00 2020-07-01 09:18:55.256 UTC [32751] DETAIL: statement: select * from bdt; I think those information are useful to have (might get rid of wait_event_type though: done in the new attached patch). > Also the canceled > backend will emit the ERROR log with the message "canceling statement > due to conflict with recovery”, and its pid can be logged at the log > prefix. The user can know who has been canceled by recovery conflicts > and the query. Right, we can also get rid of the pid and the query in the new log message too (done in the new attached patch). > Probably we need to consider having a time threshold (or boolean to > turn on/off) to emit this information to the server logs. Otherwise, > we will end up writing every conflict information, making the log > dirty needlessly. good point, I do agree (done in the new attached patch). Bertrand > > Regards, > > -- > Masahiko Sawada http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: