Thread: Should mdxxx functions(e.g. mdread, mdwrite, mdsync etc) PANIC instead of ERROR when I/O failed?
Should mdxxx functions(e.g. mdread, mdwrite, mdsync etc) PANIC instead of ERROR when I/O failed?
From
"Jacky Leng"
Date:
Recently, when I was running my application on 8.3.7, my data got corrupted. The scene was like this: "invalid memory alloc request size ...." I invested the error data, and found that one sector of a db-block became all-zero (I confirmed the reason later, it was because that my disk got bad). I also checked the log of postmaster, and I found that there were 453 ERROR messages that said "could not read block XXX of relation XXX: ??", where XXX was the db-block that the bad sector resided in. After these 453 failed read operations, postmaster read successed, but got an all-zero sector! (I don't know why operating system will allow this happen, but it just happened) My question is: should not mdxxx functions(e.g. mdread, mdwrite, mdsync) just report PANIC instead of ERROR when I/O failed? IMO, since the data has already corrupted, reporting ERROR will just leave us a very curious scene later -- which does more harm that benefit.
Re: Should mdxxx functions(e.g. mdread, mdwrite, mdsync etc) PANIC instead of ERROR when I/O failed?
From
Martijn van Oosterhout
Date:
On Mon, Jun 15, 2009 at 04:41:42PM +0800, Jacky Leng wrote: > My question is: should not mdxxx functions(e.g. mdread, mdwrite, mdsync) > just report PANIC instead of ERROR when I/O failed? IMO, since the data has > already corrupted, reporting ERROR will just leave us a very curious scene > later -- which does more harm that benefit. I think the reasoning is that if those functions reported a PANIC the chance you could recover your data is zero, because you need the database system to read the other (good) data. With an ERROR you can investigate the problem and save what can de saved... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Re: Should mdxxx functions(e.g. mdread, mdwrite, mdsync etc) PANIC instead of ERROR when I/O failed?
From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes: > On Mon, Jun 15, 2009 at 04:41:42PM +0800, Jacky Leng wrote: >> My question is: should not mdxxx functions(e.g. mdread, mdwrite, mdsync) >> just report PANIC instead of ERROR when I/O failed? IMO, since the data has >> already corrupted, reporting ERROR will just leave us a very curious scene >> later -- which does more harm that benefit. > I think the reasoning is that if those functions reported a PANIC the > chance you could recover your data is zero, because you need the > database system to read the other (good) data. Also, in the case you're complaining about, the problem was that there wasn't any O/S error report that we could have PANIC'd about anyhow. But Martijn is correct that a PANIC here would reduce the system's overall stability without any clear benefit. We already do refuse to read a page into shared buffers if there's a read error on it, so it's not clear to me how you think that an ERROR leaves things in an unstable state. regards, tom lane
Re: Should mdxxx functions(e.g. mdread, mdwrite, mdsync etc) PANIC instead of ERROR when I/O failed?
From
"Jacky Leng"
Date:
>> I think the reasoning is that if those functions reported a PANIC the >> chance you could recover your data is zero, because you need the >> database system to read the other (good) data. I do not see why PANIC reduced the chance to recover my data. AFAICS, my data has already corrupted(because of the bad-block here), whether PANIC or not, the read opertion on the bad-block should get the same result. > Also, in the case you're complaining about, the problem was that there > wasn't any O/S error report that we could have PANIC'd about anyhow. No, the O/S did report the error, which lead to the 453 ERROR messages of postgres. The O/S error messages(got this using dmesg) is like this: end_request: I/O error, dev sda, sector 504342711 ata1: EH complete SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: drive cache: write back ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0action 0x0 ata1.00: (irq_stat 0x40000008) ata1.00: cmd 60/08:00:b0:a8:0f/00:00:1e:00:00/40 tag 0 cdb 0x0 data 4096 in res 41/40:08:b7:a8:0f/06:00:1e:00:00/00 Emask 0x9 (media error) ata1.00: ata_hpa_resize 1: sectors = 976773168,hpa_sectors = 976773168 ata1.00: ata_hpa_resize 1: sectors = 976773168, hpa_sectors = 976773168 > We already do refuse > to read a page into shared buffers if there's a read error on it, > so it's not clear to me how you think that an ERROR leaves things > in an unstable state. > In my scene, it seems that the O/S does not ensure that if an I/O operation (read, write, sync, etc) on a block failed, then all later I/O operations on this block will also failed. For example: 1. As I noted before, although the bad db-block in my data has been read unsuccessfully for 453 times, but the 454th readoperation succeeds(but some data(the bad sector) has been set to all-zero). So, even if the 453 failed I/O has reportedERROR, there is still chance that the bad db-block can be read in shared buffres. 2. Besides, I have noticed a scene like this: 1)an mdsync operations failed with the message "ERROR: could not fsync segmentXXX of relation XXX: ??"; The error message of O/S(I get this using dmesg command) is like this: Buffer I/O error on device ^AXX205503,logical block 43837786 lost page write due to I/O error on ^AXX205503 2) This leaves a half-writen db-block in my data. But the page can still be read in shared buffers successfully later,which leads to an curious scene that says "ERROR: could not access status of transaction XXXXX"