Thread: log messages for archive recovery progress
log messages for archive recovery progress
From
"Satoshi Nagayasu / Uptime Technologies, LLC."
Date:
Hi, When I look into archive recovery deeply as DBA point of view, I found that it's difficult to know (1) when the recovery process switched reading WAL segments files from archive directory to pg_xlog directory, and (2) whether it succeeded applying the latest WAL segments in the pg_xlog directory. For example, when I had "47" WAL segment in the archive directory and "48" WAL segment in the pg_xlog directory, PostgreSQL said: > [2011-12-08 05:59:03 JST] 9003: LOG: restored log file "000000080000000000000046" from archive > [2011-12-08 05:59:03 JST] 9003: LOG: restored log file "000000080000000000000047" from archive > cp: cannot stat `/backups/archlog/000000080000000000000048': No such file or directory > [2011-12-08 05:59:03 JST] 9003: LOG: record with zero length at 0/489F8B74 > [2011-12-08 05:59:03 JST] 9003: LOG: redo done at 0/489F8B38 > [2011-12-08 05:59:03 JST] 9003: LOG: last completed transaction was at log time 2011-12-08 05:52:01.507063+09 > cp: cannot stat `/backups/archlog/00000009.history': No such file or directory > [2011-12-08 05:59:03 JST] 9003: LOG: selected new timeline ID: 9 > [2011-12-08 05:59:03 JST] 9003: LOG: restored log file "00000008.history" from archive > [2011-12-08 05:59:04 JST] 9003: LOG: archive recovery complete I felt it's difficult to determine whether PostgreSQL applied "48" WAL segment in the pg_xlog, or not. So, I want to propose new log messages to show archive log progress as reading WAL segments. With applying my tiny modification, the recovery process reports its progress for each WAL segment file, and also tells switching reading archive directory to pg_xlog directory explicitly as shown below. > [2011-12-08 15:14:36 JST] 16758: LOG: restored log file "000000080000000000000046" from archive > [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 000000080000000000000046 > [2011-12-08 15:14:36 JST] 16758: LOG: restored log file "000000080000000000000047" from archive > [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 000000080000000000000047 > cp: cannot stat `/backups/archlog/000000080000000000000048': No such file or directory > [2011-12-08 15:14:37 JST] 16758: LOG: could not restore file "000000080000000000000048" from archive > [2011-12-08 15:14:37 JST] 16758: LOG: attempting to look into pg_xlog > [2011-12-08 15:14:37 JST] 16758: LOG: recoverying 000000080000000000000048 > [2011-12-08 15:14:37 JST] 16758: LOG: record with zero length at 0/489F8B74 > [2011-12-08 15:14:37 JST] 16758: LOG: redo done at 0/489F8B38 > [2011-12-08 15:14:37 JST] 16758: LOG: last completed transaction was at log time 2011-12-08 05:52:01.507063+09 > cp: cannot stat `/backups/archlog/00000009.history': No such file or directory > [2011-12-08 15:14:37 JST] 16758: LOG: selected new timeline ID: 9 > [2011-12-08 15:14:37 JST] 16758: LOG: restored log file "00000008.history" from archive > [2011-12-08 15:14:38 JST] 16758: LOG: archive recovery complete Do you think this is useful? Thanks, -- Satoshi Nagayasu <snaga@uptime.jp> Uptime Technologies, LLC. http://www.uptime.jp
On 08-01-2012 11:59, Satoshi Nagayasu / Uptime Technologies, LLC. wrote: >> [2011-12-08 15:14:36 JST] 16758: LOG: restored log file "000000080000000000000046" from archive >> [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 000000080000000000000046 >> [2011-12-08 15:14:36 JST] 16758: LOG: restored log file "000000080000000000000047" from archive >> [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 000000080000000000000047 >> cp: cannot stat `/backups/archlog/000000080000000000000048': No such file or directory >> [2011-12-08 15:14:37 JST] 16758: LOG: could not restore file "000000080000000000000048" from archive >> [2011-12-08 15:14:37 JST] 16758: LOG: attempting to look into pg_xlog >> [2011-12-08 15:14:37 JST] 16758: LOG: recoverying 000000080000000000000048 > What about just 'restored log file "000000080000000000000048" from pg_xlog' instead of the last two messages? If you can't read from pg_xlog emit 'could not restore file "000000080000000000000048" from pg_xlog'. -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte24x7 e Treinamento
2012/1/11 Euler Taveira de Oliveira <euler@timbira.com>: > On 08-01-2012 11:59, Satoshi Nagayasu / Uptime Technologies, LLC. wrote: >>> [2011-12-08 15:14:36 JST] 16758: LOG: restored log file "000000080000000000000046" from archive >>> [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 000000080000000000000046 >>> [2011-12-08 15:14:36 JST] 16758: LOG: restored log file "000000080000000000000047" from archive >>> [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 000000080000000000000047 >>> cp: cannot stat `/backups/archlog/000000080000000000000048': No such file or directory >>> [2011-12-08 15:14:37 JST] 16758: LOG: could not restore file "000000080000000000000048" from archive >>> [2011-12-08 15:14:37 JST] 16758: LOG: attempting to look into pg_xlog >>> [2011-12-08 15:14:37 JST] 16758: LOG: recoverying 000000080000000000000048 >> > What about just 'restored log file "000000080000000000000048" from pg_xlog' > instead of the last two messages? If you can't read from pg_xlog emit 'could > not restore file "000000080000000000000048" from pg_xlog'. Yes, simple is better. We already have a message if the file is not present, we just need one if we switch to using pg_xlog. Please look at this. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
2012/01/11 19:56, Simon Riggs wrote: > 2012/1/11 Euler Taveira de Oliveira<euler@timbira.com>: >> On 08-01-2012 11:59, Satoshi Nagayasu / Uptime Technologies, LLC. wrote: >>>> [2011-12-08 15:14:36 JST] 16758: LOG: restored log file "000000080000000000000046" from archive >>>> [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 000000080000000000000046 >>>> [2011-12-08 15:14:36 JST] 16758: LOG: restored log file "000000080000000000000047" from archive >>>> [2011-12-08 15:14:36 JST] 16758: LOG: recoverying 000000080000000000000047 >>>> cp: cannot stat `/backups/archlog/000000080000000000000048': No such file or directory >>>> [2011-12-08 15:14:37 JST] 16758: LOG: could not restore file "000000080000000000000048" from archive >>>> [2011-12-08 15:14:37 JST] 16758: LOG: attempting to look into pg_xlog >>>> [2011-12-08 15:14:37 JST] 16758: LOG: recoverying 000000080000000000000048 >>> >> What about just 'restored log file "000000080000000000000048" from pg_xlog' >> instead of the last two messages? If you can't read from pg_xlog emit 'could >> not restore file "000000080000000000000048" from pg_xlog'. > > Yes, simple is better. > > We already have a message if the file is not present, we just need one > if we switch to using pg_xlog. > > Please look at this. Thanks for a patch. I agree that simple is better. However, I'm a bit afraid that it will confuse DBA if we use "restored" under the pg_xlog replay context, because we have already used "restored" that means a WAL file as successfully "copied" (not "replayed") from archive directory into pg_xlog directory under the archive recovery context. So, to determine the status of copying WAL files from archive directory, I think we can use "restored", or "could not restore" on failure. And to determine the status of replaying WAL files in pg_xlog directory (even if a WAL is copied from archive), we have to use "recover" or "replay". I'll try to revise my proposed log messages again. Thanks, -- Satoshi Nagayasu <snaga@uptime.jp> Uptime Technologies, LLC. http://www.uptime.jp
On Wed, Jan 11, 2012 at 1:54 PM, Satoshi Nagayasu <snaga@uptime.jp> wrote: > However, I'm a bit afraid that it will confuse DBA if we use > "restored" under the pg_xlog replay context, because we have > already used "restored" that means a WAL file as successfully > "copied" (not "replayed") from archive directory into pg_xlog > directory under the archive recovery context. > > So, to determine the status of copying WAL files from > archive directory, I think we can use "restored", or > "could not restore" on failure. > > And to determine the status of replaying WAL files > in pg_xlog directory (even if a WAL is copied from archive), > we have to use "recover" or "replay". Agreed. I can change "restored" to "using", so we have two message types LOG: restored log file "000000080000000000000047" from archive LOG: using pre-existing log file "000000080000000000000047" from pg_xlog -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jan 11, 2012 at 9:01 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, Jan 11, 2012 at 1:54 PM, Satoshi Nagayasu <snaga@uptime.jp> wrote: > >> However, I'm a bit afraid that it will confuse DBA if we use >> "restored" under the pg_xlog replay context, because we have >> already used "restored" that means a WAL file as successfully >> "copied" (not "replayed") from archive directory into pg_xlog >> directory under the archive recovery context. >> >> So, to determine the status of copying WAL files from >> archive directory, I think we can use "restored", or >> "could not restore" on failure. >> >> And to determine the status of replaying WAL files >> in pg_xlog directory (even if a WAL is copied from archive), >> we have to use "recover" or "replay". > > Agreed. I can change "restored" to "using", so we have two message types > > LOG: restored log file "000000080000000000000047" from archive > LOG: using pre-existing log file "000000080000000000000047" from pg_xlog using seems pretty fuzzy to me. replaying? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 12, 2012 at 2:13 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jan 11, 2012 at 9:01 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On Wed, Jan 11, 2012 at 1:54 PM, Satoshi Nagayasu <snaga@uptime.jp> wrote: >> >>> However, I'm a bit afraid that it will confuse DBA if we use >>> "restored" under the pg_xlog replay context, because we have >>> already used "restored" that means a WAL file as successfully >>> "copied" (not "replayed") from archive directory into pg_xlog >>> directory under the archive recovery context. >>> >>> So, to determine the status of copying WAL files from >>> archive directory, I think we can use "restored", or >>> "could not restore" on failure. >>> >>> And to determine the status of replaying WAL files >>> in pg_xlog directory (even if a WAL is copied from archive), >>> we have to use "recover" or "replay". >> >> Agreed. I can change "restored" to "using", so we have two message types >> >> LOG: restored log file "000000080000000000000047" from archive >> LOG: using pre-existing log file "000000080000000000000047" from pg_xlog > > using seems pretty fuzzy to me. replaying? That was my first thought, but the message relates to which file has been selected, and how. Once it has been selected it will be replayed. The idea is to have the two messages look similar. The original message was "restored log file..." and says nothing about replaying. We could change the old message (ugh! backwards compatibility alert) LOG: replaying log file "000000080000000000000047" after restore from archiveLOG: replaying log file "000000080000000000000047"already in pg_xlog which doesn't sound much stronger to me... not sure. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jan 12, 2012 at 10:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, Jan 12, 2012 at 2:13 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Jan 11, 2012 at 9:01 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On Wed, Jan 11, 2012 at 1:54 PM, Satoshi Nagayasu <snaga@uptime.jp> wrote: >>> >>>> However, I'm a bit afraid that it will confuse DBA if we use >>>> "restored" under the pg_xlog replay context, because we have >>>> already used "restored" that means a WAL file as successfully >>>> "copied" (not "replayed") from archive directory into pg_xlog >>>> directory under the archive recovery context. >>>> >>>> So, to determine the status of copying WAL files from >>>> archive directory, I think we can use "restored", or >>>> "could not restore" on failure. >>>> >>>> And to determine the status of replaying WAL files >>>> in pg_xlog directory (even if a WAL is copied from archive), >>>> we have to use "recover" or "replay". >>> >>> Agreed. I can change "restored" to "using", so we have two message types >>> >>> LOG: restored log file "000000080000000000000047" from archive >>> LOG: using pre-existing log file "000000080000000000000047" from pg_xlog >> >> using seems pretty fuzzy to me. replaying? > > That was my first thought, but the message relates to which file has > been selected, and how. Once it has been selected it will be replayed. > The idea is to have the two messages look similar. > > The original message was "restored log file..." and says nothing about > replaying. > > We could change the old message (ugh! backwards compatibility alert) > > LOG: replaying log file "000000080000000000000047" after restore from archive > LOG: replaying log file "000000080000000000000047" already in pg_xlog > > which doesn't sound much stronger to me... not sure. Hmm, I don't know. But that phrasing does at least have the advantage of being clearly parallel, which I like. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2012/01/13 0:13, Robert Haas wrote: > On Thu, Jan 12, 2012 at 10:04 AM, Simon Riggs<simon@2ndquadrant.com> wrote: >> On Thu, Jan 12, 2012 at 2:13 PM, Robert Haas<robertmhaas@gmail.com> wrote: >>> On Wed, Jan 11, 2012 at 9:01 AM, Simon Riggs<simon@2ndquadrant.com> wrote: >>>> On Wed, Jan 11, 2012 at 1:54 PM, Satoshi Nagayasu<snaga@uptime.jp> wrote: >>>> >>>>> However, I'm a bit afraid that it will confuse DBA if we use >>>>> "restored" under the pg_xlog replay context, because we have >>>>> already used "restored" that means a WAL file as successfully >>>>> "copied" (not "replayed") from archive directory into pg_xlog >>>>> directory under the archive recovery context. >>>>> >>>>> So, to determine the status of copying WAL files from >>>>> archive directory, I think we can use "restored", or >>>>> "could not restore" on failure. >>>>> >>>>> And to determine the status of replaying WAL files >>>>> in pg_xlog directory (even if a WAL is copied from archive), >>>>> we have to use "recover" or "replay". >>>> >>>> Agreed. I can change "restored" to "using", so we have two message types >>>> >>>> LOG: restored log file "000000080000000000000047" from archive >>>> LOG: using pre-existing log file "000000080000000000000047" from pg_xlog >>> >>> using seems pretty fuzzy to me. replaying? >> >> That was my first thought, but the message relates to which file has >> been selected, and how. Once it has been selected it will be replayed. >> The idea is to have the two messages look similar. >> >> The original message was "restored log file..." and says nothing about >> replaying. >> >> We could change the old message (ugh! backwards compatibility alert) >> >> LOG: replaying log file "000000080000000000000047" after restore from archive >> LOG: replaying log file "000000080000000000000047" already in pg_xlog >> >> which doesn't sound much stronger to me... not sure. > > Hmm, I don't know. But that phrasing does at least have the advantage > of being clearly parallel, which I like. It seems difficult to keep backward compatibility. :) Anyway, how about this one? If we have 47 in archive, and 48 in pg_xlog, (1) LOG: restored log file "000000080000000000000047" from archive (2) LOG: replaying log file "000000080000000000000047" (3) LOG: could not restore file "000000080000000000000048" from archive (4) LOG: replaying log file "000000080000000000000048" In this case, "(4) replying" after "(3) could not restore from archive" would means that it has started replaying a WAL from pg_xlog. And if we have 47 in archive, and do not have 48 in pg_xlog, (5) LOG: restored log file "000000080000000000000047" from archive (6) LOG: replaying log file "000000080000000000000047" (7) LOG: could not restore file "000000080000000000000048" from archive (8) LOG: could not replay file "000000080000000000000048" In this case, "(8) could not replay file" after "(7) could not restore from archive" would means that "48" is not found both archive and pg_xlog, so that the latest WAL would gone. I just got another option in my mind. Telling both two numbers of WAL files, from archive and pg_xlog directory, those have been applied during archive recovery would make sense? How about this one? LOG: XXX file(s) from archive, YYY file(s) from pg_xlog successfully applied. Thanks, -- Satoshi Nagayasu <snaga@uptime.jp> Uptime Technologies, LLC. http://www.uptime.jp
On Fri, Jan 13, 2012 at 12:58 AM, Satoshi Nagayasu <snaga@uptime.jp> wrote: > Anyway, how about this one? > > If we have 47 in archive, and 48 in pg_xlog, > > (1) LOG: restored log file "000000080000000000000047" from archive > (2) LOG: replaying log file "000000080000000000000047" > (3) LOG: could not restore file "000000080000000000000048" from archive > (4) LOG: replaying log file "000000080000000000000048" > > In this case, "(4) replying" after "(3) could not restore from archive" > would means that it has started replaying a WAL from pg_xlog. Looks confusing. In this way, you always need to look at more than one messages to understand the meaning of one message. Which seems not good design. What if lots of log files exist in pg_xlog and you got the following message? To figure out where the log file 98 comes from, you need to find (3) from lots of log messages. LOG: replaying log file "000000080000000000000098" What about the following simpler way? (1) LOG: restored log file "000000080000000000000047" from archive (2) LOG: replaying log file "000000080000000000000047"from archive (4) LOG: replaying log file "000000080000000000000048" from pg_xlog In file-base log-shipping with standby_mode=on case, restoring from archive is repeated periodically, which would flood the log file with the messages like (3). So I don't like emitting the message like (3) with LOG level. > I just got another option in my mind. > > Telling both two numbers of WAL files, from archive and > pg_xlog directory, those have been applied during archive recovery > would make sense? > > How about this one? > > LOG: XXX file(s) from archive, YYY file(s) from pg_xlog successfully > applied. What's the use case of this message? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center