Thread: Serious problem: media recovery fails after system or PostgreSQL crash
Hello, Although this may have to be posted to pgsql-bugs or pgsql-general, let me ask you here because the problem probably needs PostgreSQL's code fix. [Problem] I'm using PostgreSQL 9.1.6 on Linux. I encountered a serious problem that media recovery failed showing the following message: FATAL: archive file "000000010000008000000028" has wrong size: 7340032 instead of 16777216 I'm using normal cp command to archive WAL files. That is: archive_command = '/path/to/my_script.sh "%p" "/backup/archive_log/%f"' <<my_script.sh>> -------------------------------------------------- #!/bin/sh some processing... cp "$1" "$2" other processing... -------------------------------------------------- The media recovery was triggered by power failure. The disk drive that stored $PGDATA failed after a power failure. So I replaced the failed disk, and performed media recovery by creating recovery.conf and running pg_ctl start. However, pg_ctl failed with the above error message. [Cause] The cause is clear from the message. PostgreSQL refuses to continue media recovery when it finds an archived WAL file whose size is not 16 MB. The relevant code is in src/backend/access/transam/xlog.c: -------------------------------------------------- if (expectedSize > 0 && stat_buf.st_size != expectedSize) { int elevel; /* * If we find a partial file in standby mode, we assume it's * because it's just being copied to the archive,and keep * trying. * * Otherwise treat a wrong-sized file as FATAL to ensure the * DBA would notice it,but is that too strong? We could try * to plow ahead with a local copy of the file ... but the * problem is thatthere probably isn't one, and we'd * incorrectly conclude we've reached the end of WAL and we're * done recovering... */ if (StandbyMode && stat_buf.st_size < expectedSize) elevel = DEBUG1; else elevel = FATAL; ereport(elevel, (errmsg("archive file \"%s\" has wrong size: %lu instead of %lu", xlogfname, (unsigned long)stat_buf.st_size, (unsigned long) expectedSize))); return false; } -------------------------------------------------- [How to fix] Archived files can become smaller than their expected sizes for some reasons: 1. The power fails while archive_command is copying files (as in my case). 2. Immediate shutdown (pg_ctl stop -mi) is performed while archive_command is copying files. In this case, cp or equivalent copying command is cancelled by SIGQUIT sent by postmaster. Therefore, I think postgres must continue recovery by fetching files from pg_xlog/ when it encounters a partially filled archive files. In addition, it may be necessary to remove the partially filled archived files, because they might prevent media recovery in the future (is this true?). I mean we need the following fix. What do you think? -------------------------------------------------- if (expectedSize > 0 && stat_buf.st_size != expectedSize) { int elevel; ... if (StandbyMode && stat_buf.st_size < expectedSize) elevel = DEBUG1; else { elevel = LOG; unlink(xlogpath); } ereport(elevel, (errmsg("archive file \"%s\" has wrong size: %lu instead of %lu", xlogfname, (unsigned long) stat_buf.st_size, (unsigned long) expectedSize))); return false; } -------------------------------------------------- I've heard that the next minor release is scheduled during this weekend. I really wish this problem will be fixed in that release. If you wish, I'll post the patch tomorrow or the next day. Could you include the fix in the weekend release? Regards MauMau
"MauMau" <maumau307@gmail.com> writes: > I'm using PostgreSQL 9.1.6 on Linux. I encountered a serious problem that > media recovery failed showing the following message: > FATAL: archive file "000000010000008000000028" has wrong size: 7340032 > instead of 16777216 Well, that's unfortunate, but it's not clear that automatic recovery is possible. The only way out of it would be if an undamaged copy of the segment was in pg_xlog/ ... but if I recall the logic correctly, we'd not even be trying to fetch from the archive if we had a local copy. > Therefore, I think postgres must continue recovery by fetching files from > pg_xlog/ when it encounters a partially filled archive files. In addition, > it may be necessary to remove the partially filled archived files, because > they might prevent media recovery in the future (is this true?). I mean we > need the following fix. What do you think? I think having PG automatically destroy archive files is bordering on insane. It might be reasonable for the archiving process to do something like this, if it has a full-size copy of the file available to replace the damaged copy with. But otherwise you're just throwing away what's probably the only copy of useful data. > I've heard that the next minor release is scheduled during this weekend. I > really wish this problem will be fixed in that release. If you wish, I'll > post the patch tomorrow or the next day. Could you include the fix in the > weekend release? Even if this were a good and uncontroversial idea, I'm afraid you're several days too late. The release is today, not next week. regards, tom lane
From: "Tom Lane" <tgl@sss.pgh.pa.us> > Well, that's unfortunate, but it's not clear that automatic recovery is > possible. The only way out of it would be if an undamaged copy of the > segment was in pg_xlog/ ... but if I recall the logic correctly, we'd > not even be trying to fetch from the archive if we had a local copy. No, PG will try to fetch the WAL file from pg_xlog when it cannot get it from archive. XLogFileReadAnyTLI() does that. Also, PG manual contains the following description: http://www.postgresql.org/docs/9.1/static/continuous-archiving.html#BACKUP-ARCHIVING-WAL WAL segments that cannot be found in the archive will be sought in pg_xlog/; this allows use of recent un-archived segments. However, segments that are available from the archive will be used in preference to files in pg_xlog/. So, continuing recovery by changing the emode to LOG would work. What do you think about this fix? > I think having PG automatically destroy archive files is bordering on > insane. I agree. Before that, PG cannot know the archive location, so PG cannot delete the partially filled archive files. Regards MauMau
Hi, On 6.12.2012 23:45, MauMau wrote: > From: "Tom Lane" <tgl@sss.pgh.pa.us> >> Well, that's unfortunate, but it's not clear that automatic recovery is >> possible. The only way out of it would be if an undamaged copy of the >> segment was in pg_xlog/ ... but if I recall the logic correctly, we'd >> not even be trying to fetch from the archive if we had a local copy. > > No, PG will try to fetch the WAL file from pg_xlog when it cannot get it > from archive. XLogFileReadAnyTLI() does that. Also, PG manual contains > the following description: > > http://www.postgresql.org/docs/9.1/static/continuous-archiving.html#BACKUP-ARCHIVING-WAL > > > WAL segments that cannot be found in the archive will be sought in > pg_xlog/; this allows use of recent un-archived segments. However, > segments that are available from the archive will be used in preference > to files in pg_xlog/. So why don't you use an archive command that does not create such incomplete files? I mean something like this: archive_command = 'cp %p /arch/%f.tmp && mv /arch/%f.tmp /arch/%f' Until the file is renamed, it's considered 'incomplete'. Tomas
On Thu, Dec 6, 2012 at 9:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "MauMau" <maumau307@gmail.com> writes: >> I'm using PostgreSQL 9.1.6 on Linux. I encountered a serious problem that >> media recovery failed showing the following message: >> FATAL: archive file "000000010000008000000028" has wrong size: 7340032 >> instead of 16777216 > > Well, that's unfortunate, but it's not clear that automatic recovery is > possible. The only way out of it would be if an undamaged copy of the > segment was in pg_xlog/ ... but if I recall the logic correctly, we'd > not even be trying to fetch from the archive if we had a local copy. I'm inclined to agree with this: I've had a case much like the original poster (too-short WAL segments because of media issues), except in my case the archiver had archived a bogus copy of the data also (short length and all), so our attempt to recover from archives on a brand new system was met with obscure failure[0]. And, rather interestingly, the WAL disk was able to "write" bogusly without error for many minutes, which made for a fairly exotic recovery based on pg_resetxlog: I grabbed quite a few minutes of WAL of various sub-16MB sizes to spot check the situation. It never occurred to me there was a way to really fix this that still involves the archiver reading from a file system: what can one do when one no longer trusts read() to get what was write()d? [0]: Postgres wasn't very good about reporting the failure: in the case bogus files have been restored from archives, it seems to just bounce through timelines it knows about searching for a WAL it likes, without any real error messaging like got "corrupt wal from archive". That could probably be fixed. -- fdr
Tomas wrote: > So why don't you use an archive command that does not create such > incomplete files? I mean something like this: > > archive_command = 'cp %p /arch/%f.tmp && mv /arch/%f.tmp /arch/%f' > > Until the file is renamed, it's considered 'incomplete'. That is a good idea. This would surely solve my problem. Don't you think this must be described in PG manual? Now there are two solutions given. I think it would be better to implement both of these. 1. Tomas's idea Instruct the method like below in the PG manual. archive_command = 'cp %p /arch/%f.tmp && mv /arch/%f.tmp /arch/%f' [Merits] The incomplete file will not eventually remain. The successful archival erases the incomplete file created during the previous interrupted archival. [Demerits] The administrator or developer of existing systems or products using PostgreSQL will not be likely to notice this change in the PG manual. So they might face the media recovery failure. 2. My idea PG continues media recovery when it encounters a partially filled WAL segment file. It ereport(LOG)'s the unexpected size, and try to fetch the WAL file from pg_xlog/. This is logically natural. If the archived WAL file is smaller than the expected size, it (almost always) means that archive processing was interrupted and the unarchived WAL file is in pg_xlog/. So PG should instead the same WAL file from pg_xlog/ and continue recovery. PG will then try to fetch the subsequent WAL files from archive again, but will hardly find those files there and fetch them from pg_xlog/. This behavior complies with the description in the PG manual. I mean just changing FATAL to LOG in the code in RestoreArchivedFile(): -------------------------------------------------- if (StandbyMode && stat_buf.st_size < expectedSize) elevel = DEBUG1; else elevel = FATAL; ereport(elevel, (errmsg("archive file \"%s\" has wrong size: %lu instead of %lu", xlogfname, (unsigned long) stat_buf.st_size, (unsigned long) expectedSize))); return false; -------------------------------------------------- [Merits] The administrator or developer of existing systems or products using PostgreSQL will benefit from this fix. [Demerits] The incomplete archive file(s) might be left forever unless the DBA delete them. That occurs if pg_xlog/ is lost and the incomplete archive files will not be overwritten. Could you give me opinions what to do? I'm willing to submit these fixes. Regards MauMau
On Thu, Dec 6, 2012 at 3:52 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > Hi, > > On 6.12.2012 23:45, MauMau wrote: >> From: "Tom Lane" <tgl@sss.pgh.pa.us> >>> Well, that's unfortunate, but it's not clear that automatic recovery is >>> possible. The only way out of it would be if an undamaged copy of the >>> segment was in pg_xlog/ ... but if I recall the logic correctly, we'd >>> not even be trying to fetch from the archive if we had a local copy. >> >> No, PG will try to fetch the WAL file from pg_xlog when it cannot get it >> from archive. XLogFileReadAnyTLI() does that. Also, PG manual contains >> the following description: >> >> http://www.postgresql.org/docs/9.1/static/continuous-archiving.html#BACKUP-ARCHIVING-WAL >> >> >> WAL segments that cannot be found in the archive will be sought in >> pg_xlog/; this allows use of recent un-archived segments. However, >> segments that are available from the archive will be used in preference >> to files in pg_xlog/. > > So why don't you use an archive command that does not create such > incomplete files? I mean something like this: > > archive_command = 'cp %p /arch/%f.tmp && mv /arch/%f.tmp /arch/%f' > > Until the file is renamed, it's considered 'incomplete'. Wouldn't having the incomplete file be preferable over having none of it at all? It seems to me you need considerable expertise to figure out how to do optimal recovery (i.e. losing the least transactions) in this situation, and that that expertise cannot be automated. Do you trust a partial file from a good hard drive, or a complete file from a partially melted pg_xlog? Cheers, Jeff
On 8.12.2012 03:08, Jeff Janes wrote: > On Thu, Dec 6, 2012 at 3:52 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> Hi, >> >> On 6.12.2012 23:45, MauMau wrote: >>> From: "Tom Lane" <tgl@sss.pgh.pa.us> >>>> Well, that's unfortunate, but it's not clear that automatic recovery is >>>> possible. The only way out of it would be if an undamaged copy of the >>>> segment was in pg_xlog/ ... but if I recall the logic correctly, we'd >>>> not even be trying to fetch from the archive if we had a local copy. >>> >>> No, PG will try to fetch the WAL file from pg_xlog when it cannot get it >>> from archive. XLogFileReadAnyTLI() does that. Also, PG manual contains >>> the following description: >>> >>> http://www.postgresql.org/docs/9.1/static/continuous-archiving.html#BACKUP-ARCHIVING-WAL >>> >>> >>> WAL segments that cannot be found in the archive will be sought in >>> pg_xlog/; this allows use of recent un-archived segments. However, >>> segments that are available from the archive will be used in preference >>> to files in pg_xlog/. >> >> So why don't you use an archive command that does not create such >> incomplete files? I mean something like this: >> >> archive_command = 'cp %p /arch/%f.tmp && mv /arch/%f.tmp /arch/%f' >> >> Until the file is renamed, it's considered 'incomplete'. > > Wouldn't having the incomplete file be preferable over having none of it at all? > > It seems to me you need considerable expertise to figure out how to do > optimal recovery (i.e. losing the least transactions) in this > situation, and that that expertise cannot be automated. Do you trust > a partial file from a good hard drive, or a complete file from a > partially melted pg_xlog? It clearly is a rather complex issue, no doubt about that. And yes, reliability of the devices with pg_xlog on them is an important detail. Alghough if the WAL is not written in a reliable way, you're hosed anyway I guess. The recommended archive command is based on the assumption that the local pg_xlog is intact (e.g. because it's located on a reliable RAID1 array), which seems to be the assumption of the OP too. In my opinion it's more likely to meet an incomplete copy of WAL in the archive than a corrupted local WAL. And if it really is corrupted, it would be identified during replay. Tomas
On Sun, Dec 16, 2012 at 8:38 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 8.12.2012 03:08, Jeff Janes wrote: >> >> It seems to me you need considerable expertise to figure out how to do >> optimal recovery (i.e. losing the least transactions) in this >> situation, and that that expertise cannot be automated. Do you trust >> a partial file from a good hard drive, or a complete file from a >> partially melted pg_xlog? > > It clearly is a rather complex issue, no doubt about that. And yes, > reliability of the devices with pg_xlog on them is an important detail. > Alghough if the WAL is not written in a reliable way, you're hosed > anyway I guess. > > The recommended archive command is based on the assumption that the > local pg_xlog is intact (e.g. because it's located on a reliable RAID1 > array), which seems to be the assumption of the OP too. > > In my opinion it's more likely to meet an incomplete copy of WAL in the > archive than a corrupted local WAL. And if it really is corrupted, it > would be identified during replay. Wouldn't the way it would be identified be for it to fail a checksum, assume it was garbage left over from the previous WAL file which was the process of being overwritten at the time of crash, and so terminate recovery and open the database? Assuming your goal is to recover all the transactions you possibly can (rather than restore to a known point), I think you would want to try recovery both ways and keep whichever one got the furthest. Cheers, Jeff