Thread: warm-standby errors

warm-standby errors

From
sramirez
Date:
Hello,
    We have a warm-standby of one of our databases, and by this I mean a
server in constant recovery mode applying logs being shipped from a
primary to the warm-standby. Recently we had to bounce the standby
instance and I saw this error in our logs:

2009-04-27 07:11:21.213 GMT,,,,8261,,,1, //  LOG:  database system was
interrupted while in recovery at log time 2009-04-27 06:55:08 GMT
2009-04-27 07:11:21.213 GMT,,,,8261,,,2, //  HINT:  If this has occurred
more than once some data may be corrupted and you may need to choose an
earlier recovery target.
2009-04-27 07:11:21.213 GMT,,,,8261,,,3, //  LOG:  starting archive recovery

the log message did not appear again until the instance was bounced
again. Short of copying the data files elsewhere and doing a row-level
comparison of the data, is there any way I can check to see if there is
actual corruption in the warm standby server? How can I prevent this
error from occurring ?

  Thanks,
   -Said

Re: warm-standby errors

From
Simon Riggs
Date:
On Mon, 2009-05-11 at 13:50 -0400, sramirez wrote:

> Short of copying the data files elsewhere and doing a row-level
> comparison of the data, is there any way I can check to see if there is
> actual corruption in the warm standby server?

Right now, Warm Standby has same functionality as equivalent Oracle
feature - i.e. no way to confirm absence of corruption. However, WAL
records contain CRC checks that ensure the transferred data is correct,
which is more than most other replication techniques posess. Hot Standby
will allow access to data blocks to allow them to be read and checked,
though that is also possible with an external utility to some extent.

It probably isn't practical with any replication system to confirm the
exact contents of both nodes while replication is running at reasonable
speed. Some heuristics may be possible.

Do you have anything in mind, other than "detect corruption"?

> How can I prevent this
> error from occurring ?

You haven't shown us the error, just what happens afterwards.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: warm-standby errors

From
sramirez
Date:
  > Right now, Warm Standby has same functionality as equivalent Oracle
> feature - i.e. no way to confirm absence of corruption. However, WAL
> records contain CRC checks that ensure the transferred data is correct,
> which is more than most other replication techniques posess. Hot Standby
> will allow access to data blocks to allow them to be read and checked,
> though that is also possible with an external utility to some extent.

Do you have a link to documentation on how to do this?

> It probably isn't practical with any replication system to confirm the
> exact contents of both nodes while replication is running at reasonable
> speed. Some heuristics may be possible.

agreed

>
> Do you have anything in mind, other than "detect corruption"?

Really what I am after, is being able to say 'yes our replication is as
error-free as it can be' with the most amount of certainty as possible.

>> How can I prevent this
>> error from occurring ?
>
> You haven't shown us the error, just what happens afterwards.

I might have written too fast. I am curious to know what causes the
message to appear in the logs. It only appears when a instance is
shutdown and then restarted again. Is there some thing I can do so that
the statement isn't triggered when I restart the warm-standby instance?
could it be a setting that I have missed?

For reference, here is the head of the 2 log files created when the
instance was restarted

$ ggrep -A 1 -B 1 HINT *
edb-2009-04-07_012241.log-2009-04-07 01:22:41.361 GMT,,,,1750,,,1, //
LOG:  database system was interrupted while in recovery at log time
2009-04-02 17:04:54 GMT
edb-2009-04-07_012241.log:2009-04-07 01:22:41.361 GMT,,,,1750,,,2, //
HINT:  If this has occurred more than once some data may be corrupted
and you may need to choose an earlier recovery target.
edb-2009-04-07_012241.log-2009-04-07 01:22:41.362 GMT,,,,1750,,,3, //
LOG:  starting archive recovery
--
edb-2009-04-07_013609.log-2009-04-07 01:36:09.424 GMT,,,,1920,,,1, //
LOG:  database system was interrupted while in recovery at log time
2009-04-02 17:04:54 GMT
edb-2009-04-07_013609.log:2009-04-07 01:36:09.424 GMT,,,,1920,,,2, //
HINT:  If this has occurred more than once some data may be corrupted
and you may need to choose an earlier recovery target.
edb-2009-04-07_013609.log-2009-04-07 01:36:09.424 GMT,,,,1920,,,3, //
LOG:  starting archive recovery
--
edb-2009-04-27_071121.log-2009-04-27 07:11:21.213 GMT,,,,8261,,,1, //
LOG:  database system was interrupted while in recovery at log time
2009-04-27 06:55:08 GMT
edb-2009-04-27_071121.log:2009-04-27 07:11:21.213 GMT,,,,8261,,,2, //
HINT:  If this has occurred more than once some data may be corrupted
and you may need to choose an earlier recovery target.
edb-2009-04-27_071121.log-2009-04-27 07:11:21.213 GMT,,,,8261,,,3, //
LOG:  starting archive recovery
--
edb-2009-04-27_071747.log-2009-04-27 07:17:47.819 GMT,,,,8328,,,1, //
LOG:  database system was interrupted while in recovery at log time
2009-04-27 06:55:08 GMT
edb-2009-04-27_071747.log:2009-04-27 07:17:47.819 GMT,,,,8328,,,2, //
HINT:  If this has occurred more than once some data may be corrupted
and you may need to choose an earlier recovery target.
edb-2009-04-27_071747.log-2009-04-27 07:17:47.819 GMT,,,,8328,,,3, //
LOG:  starting archive recovery





Thanks,
  -Said