Re: Recovery bug - Mailing list pgsql-bugs
From | Jeff Davis |
---|---|
Subject | Re: Recovery bug |
Date | |
Msg-id | 1287355728.8516.383.camel@jdavis Whole thread Raw |
In response to | Recovery bug (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Recovery bug
(Fujii Masao <masao.fujii@gmail.com>)
Re: Recovery bug (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>) |
List | pgsql-bugs |
On Fri, 2010-10-15 at 15:58 -0700, Jeff Davis wrote: > I don't have a fix yet, because I think it requires a little discussion. > For instance, it seems to be dangerous to assume that we're starting up > from a backup with access to the archive when it might have been a crash > of the primary system. This is obviously wrong in the case of an > automatic restart, or one with no restore_command. Fixing this issue > might also remove the annoying "If you are not restoring from a backup, > try removing..." PANIC error message. > > Also, in general we should do more logging during recovery, at least the > first stages, indicating what WAL segments it's looking for to get > started, why it thinks it needs that segment (from backup or control > data), etc. Ideally we would verify that the necessary files exist (at > least the initial ones) before making permanent changes. It was pretty > painful trying to work backwards on this problem from the final > controldata (where checkpoint and prior checkpoint are the same, and > redo is before both), a crash, a PANIC, a backup_label.old, and not much > else. > Here's a proposed fix. I didn't solve the problem of determining whether we really are restoring a backup, or if there's just a backup_label file left around. I did two things: 1. If reading a checkpoint from the backup_label location, verify that the REDO location for that checkpoint exists in addition to the checkpoint itself. If not, elog with a FATAL immediately. 2. Change the error that happens when the checkpoint location referenced in the backup_label doesn't exist to a FATAL. If it can happen due to a normal crash, a FATAL seems more appropriate than a PANIC. The benefit of this patch is that it won't continue on, corrupting the pg_controldata along the way. And it also tells the administrator exactly what's going on and how to correct it, rather than leaving them with a PANIC and bogus controldata after they crashed in the middle of a backup. I still think it would be nice if postgres knew whether it was restoring a backup or recovering from a crash, otherwise it's hard to automatically recover from failures. I thought about using the presence of recoveryRestoreCommand or PrimaryConnInfo to determine that. But it seemed potentially dangerous if the person restoring a backup simply forgot to set those, and then it tries restoring from the controldata instead (which is unsafe to do during a backup). Comments? Regards, Jeff Davis
Attachment
pgsql-bugs by date: