Thread: Should mdxxx functions(e.g. mdread, mdwrite, mdsync etc) PANIC instead of ERROR when I/O failed?

  Recently, when I was running my application on 8.3.7, my data got 
corrupted. The scene was like this: "invalid memory alloc request size ...."
 I invested the error data, and found that one sector of a db-block became 
all-zero (I confirmed the reason later, it  was because that my disk got 
bad).
 I also checked the log of postmaster, and I found that there were 453 
ERROR messages that said "could not read block XXX of relation XXX: ??", 
where XXX was the db-block that the bad sector resided in. After these 453 
failed read operations, postmaster read successed, but got an all-zero 
sector! (I don't know why operating system will allow this happen, but it 
just happened)
 My question is: should not mdxxx functions(e.g. mdread, mdwrite, mdsync) 
just report PANIC instead of ERROR when I/O failed? IMO, since the data has 
already corrupted, reporting ERROR will just leave us a very curious scene 
later -- which does more harm that benefit. 




On Mon, Jun 15, 2009 at 04:41:42PM +0800, Jacky Leng wrote:
>   My question is: should not mdxxx functions(e.g. mdread, mdwrite, mdsync)
> just report PANIC instead of ERROR when I/O failed? IMO, since the data has
> already corrupted, reporting ERROR will just leave us a very curious scene
> later -- which does more harm that benefit.

I think the reasoning is that if those functions reported a PANIC the
chance you could recover your data is zero, because you need the
database system to read the other (good) data.

With an ERROR you can investigate the problem and save what can de
saved...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Martijn van Oosterhout <kleptog@svana.org> writes:
> On Mon, Jun 15, 2009 at 04:41:42PM +0800, Jacky Leng wrote:
>> My question is: should not mdxxx functions(e.g. mdread, mdwrite, mdsync) 
>> just report PANIC instead of ERROR when I/O failed? IMO, since the data has 
>> already corrupted, reporting ERROR will just leave us a very curious scene 
>> later -- which does more harm that benefit. 

> I think the reasoning is that if those functions reported a PANIC the
> chance you could recover your data is zero, because you need the
> database system to read the other (good) data.

Also, in the case you're complaining about, the problem was that there
wasn't any O/S error report that we could have PANIC'd about anyhow.

But Martijn is correct that a PANIC here would reduce the system's
overall stability without any clear benefit.  We already do refuse
to read a page into shared buffers if there's a read error on it,
so it's not clear to me how you think that an ERROR leaves things
in an unstable state.
        regards, tom lane


>> I think the reasoning is that if those functions reported a PANIC the
>> chance you could recover your data is zero, because you need the
>> database system to read the other (good) data.

I do not see why PANIC reduced the chance to recover my data. AFAICS,
my data has already corrupted(because of the bad-block here), whether
PANIC or not, the read opertion on the bad-block should get the same result.


> Also, in the case you're complaining about, the problem was that there
> wasn't any O/S error report that we could have PANIC'd about anyhow.

No, the O/S did report the error, which lead to the 453 ERROR messages of
postgres. The O/S error messages(got this using dmesg) is like this:   end_request: I/O error, dev sda, sector
504342711  ata1: EH complete   SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)   sda: Write Protect is off
 sda: Mode Sense: 00 3a 00 00   SCSI device sda: drive cache: write back   ata1.00: exception Emask 0x0 SAct 0x1 SErr
0x0action 0x0   ata1.00: (irq_stat 0x40000008)   ata1.00: cmd 60/08:00:b0:a8:0f/00:00:1e:00:00/40 tag 0 cdb 0x0 data
4096
 
in        res 41/40:08:b7:a8:0f/06:00:1e:00:00/00 Emask 0x9 (media error)   ata1.00: ata_hpa_resize 1: sectors =
976773168,hpa_sectors = 976773168   ata1.00: ata_hpa_resize 1: sectors = 976773168, hpa_sectors = 976773168
 


> We already do refuse
> to read a page into shared buffers if there's a read error on it,
> so it's not clear to me how you think that an ERROR leaves things
> in an unstable state.
>

In my scene, it seems that the O/S does not ensure that if an I/O operation
(read, write, sync, etc) on a block failed, then all later I/O operations
on this block will also failed. For example:
1. As I noted before, although the bad db-block in my data has been read  unsuccessfully for 453 times, but the 454th
readoperation succeeds(but  some data(the bad sector) has been set to all-zero). So, even if the 453  failed I/O has
reportedERROR, there is still chance that the bad 
 
db-block  can be read in shared buffres.
2. Besides, I have noticed a scene like this: 1)an mdsync operations failed  with the message "ERROR: could not fsync
segmentXXX of relation XXX: 
 
??";
  The error message of O/S(I get this using dmesg command) is like this:      Buffer I/O error on device
^AXX205503,logical block 43837786      lost page write due to I/O error on ^AXX205503
 
  2) This leaves a half-writen db-block in my data. But the page can still  be read in shared buffers successfully
later,which leads to an curious  scene that says "ERROR:  could not access status of transaction XXXXX"