Thread: IDE Drives and fsync

IDE Drives and fsync

From
"scott.marlowe"
Date:
OK, I've done some more testing on our IDE drive machine.

First, some background.  The hard drives we're using are Seagate 
drives, model number ST380023A.  Firmware version is 3.33.  The machine 
they are in is running RH9.  The setup string I'm feeding them on startup 
right now is:  hdparm -c3 -f -W1 /dev/hdx

where:

-c3 sets I/O to 32 bit w/sync (uh huh, sure...)
-f sets the drive to flush buffer cache on exit
-W1 turns on write caching

The drives come up using DMA.  turning unmask IRQ on / off has no affect 
on the tests I've been performaing.

Without the -f switch, data corruption due to sudden power down is an 
almost certain.  Running 'pgbench -c 5 -t 1000000' and pulling the plug 
will result in recovery failing with the typical invalid page type 
messages.

the pgbench database was originally set to -s 1 when initializing.

If I turn off write caching (-W0) then the data is coherent no matter how 
many concurrents I'm running, but performance is abysmal (drops from ~ 200 
tps down to 45, 10 if I'm using /dev/md0, a mirror set.)  This is all on a 
single drive.

If I use -W1 and -f, then I get corruption on about every 4th test or so 
if the number of parallel beaters is 50 or so.  If I crank it up to 200 or 
increase the size of the database by using -s 10 during initilization.  
Note that EITHER a larger test database OR a larger number of clients 
seems to increase the chance of corruption.

I'm guessing that the with -W1 and -f, what's happening is that at lower 
levels of parallel access, or a larger data set, the time between when the 
drive reports and fsync and when it actually writes the data out is 
climbing, and it is more likely that data that is in transit to the wal is 
getting lost during the power plug pull.

Tom, you had mentioned adding a delay of some kind to the fsync logic, and 
I'd be more than willing to try out any patch you'd like to toss out to me 
to see if we can get a semi-stable behaviour out of IDE drives with the 
-W1 and -f switches turned on.  As it is, the performance is quite good, 
and under low to medium loads, it seems to be capable of surviving the 
power plug being pulled, so I'm wondering if we can come up with a slight 
delay, that might drop the performance some small percentage while 
greatly decreasing the chance of data corruption.

Is this worth looking into?  I can see plenty of uses for a machine that 
runs on IDE for cost savings, while still providing a reasonable amount of 
data security in case of power failure, but I'm not sure if we can get rid 
of the problem completely or not.



Re: IDE Drives and fsync

From
Manfred Spraul
Date:
scott.marlowe wrote:

>OK, I've done some more testing on our IDE drive machine.
>
>First, some background.  The hard drives we're using are Seagate 
>drives, model number ST380023A.  Firmware version is 3.33.  The machine 
>they are in is running RH9.  The setup string I'm feeding them on startup 
>right now is:  hdparm -c3 -f -W1 /dev/hdx
>
>where:
>
>-c3 sets I/O to 32 bit w/sync (uh huh, sure...)
>
sync has nothing to do with sync to disk. The sync means read from three 
magic io ports before transfering data to or from the device.


>-f sets the drive to flush buffer cache on exit
>
-f shouldn't have any effect: it means that the buffer cache in the OS 
is flushed after hdparm exits, it has no long-term effect on the disk.

>-W1 turns on write caching
>
That's the problem: turning on write caching causes corruptions.
What's needed is partial write caching: write cache on, and fsync() 
sends a barrier to the disk, and only after the disk reports that the 
barrier is completed, then fsync() returns.
I consider that an OS/driver problem, not a problem for postgres.

>The drives come up using DMA.  turning unmask IRQ on / off has no affect 
>on the tests I've been performaing.
>  
>
Of course. irq unmasking is about interrupt latency if DMA is not used: 
DMA off and dma masking off results in dropped bytes on serial links.

>Without the -f switch, data corruption due to sudden power down is an 
>almost certain.
>
It's odd that adding -f reduces the corruptions - probably it changes 
available memory, and thus the writeback of data from kernel to disk.

>Tom, you had mentioned adding a delay of some kind to the fsync logic, and 
>I'd be more than willing to try out any patch you'd like to toss out to me 
>to see if we can get a semi-stable behaviour out of IDE drives with the 
>-W1 and -f switches turned on.
>
I'm not aware that there is any safe delay. Disks with write caches 
reorder io operations, and some hold back write operations indefinitively.

Unfortunately Linux doesn't implement write barriers, and the support in 
some IDE disks is missing, too :-(

--   Manfred



Re: IDE Drives and fsync

From
Tom Lane
Date:
"scott.marlowe" <scott.marlowe@ihs.com> writes:
> Tom, you had mentioned adding a delay of some kind to the fsync logic, and 
> I'd be more than willing to try out any patch you'd like to toss out to me 
> to see if we can get a semi-stable behaviour out of IDE drives with the 
> -W1 and -f switches turned on.

I'd suggest experimenting with the delay in mdsync() in
src/backend/storage/smgr/md.c.  A larger delay should theoretically make
things more reliable.

If you see signs of corruption of the WAL itself, another knob you could
fool with is the wal_sync_method setting in postgresql.conf.  I have no
idea whether different sync methods would improve the odds of getting
the drive to write WAL sectors in the right order, but it'd be worth
experimenting with.

I dunno whether you have the ability to experiment with a dual-drive
machine, but it would certainly be worth revisiting all these tests
on a setup with WAL on a separate drive.
        regards, tom lane