Re: Filesystem benchmarking for pg 8.3.3 server - Mailing list pgsql-performance

From Ron Mayer
Subject Re: Filesystem benchmarking for pg 8.3.3 server
Date
Msg-id 48A36090.6070309@cheapcomplexdevices.com
Whole thread Raw
In response to Re: Filesystem benchmarking for pg 8.3.3 server  ("Scott Marlowe" <scott.marlowe@gmail.com>)
Responses Re: Filesystem benchmarking for pg 8.3.3 server  (Greg Smith <gsmith@gregsmith.com>)
List pgsql-performance
Scott Marlowe wrote:
>IDE came up corrupted every single time.
Greg Smith wrote:
> you've drank the kool-aid ... completely
> ridiculous ...unsafe fsync ... md0 RAID-1
> array (aren't there issues with md and the barriers?)

Alright - I'll eat my words.  Or mostly.

I still haven't found IDE drives that lie; but
if the testing I've done today, I'm starting to
think that:

   1a) ext3 fsync() seems to lie badly.
   1b) but ext3 can be tricked not to lie (but not
       in the way you might think).
   2a) md raid1 fsync() sometimes doesn't actually
       sync
   2b) I can't trick it not to.
   3a) some IDE drives don't even pretend to support
       letting you know when their cache is flushed
   3b) but the kernel will happily tell you about
       any such devices; as well as including md
       raid ones.

In more detail.  I tested on a number of systems
and disks including new (this year) and old (1997)
IDE drives; and EXT3 with and without the "barrier=1"
mount option.


First off - some IDE drives don't even support the
relatively recent ATA command that apparently lets
the software know when a cache flush is complete.
Apparently on those you will get messages in your
system logs:
   %dmesg | grep 'disabling barriers'
   JBD: barrier-based sync failed on md1 - disabling barriers
   JBD: barrier-based sync failed on hda3 - disabling barriers
and
   %hdparm -I /dev/hdf | grep FLUSH_CACHE_EXT
will not show you anything on those devices.
IMHO that's cool; and doesn't count as a lying IDE drive
since it didn't claim to support this.

Second of all - ext3 fsync() appears to me to
be *extremely* stupid.   It only seems to correctly
do the correct flushing (and waiting) for a drive's
cache to be flushed when a file's inode has changed.
For example, in the test program below, it will happily
do a real fsync (i.e. the program take a couple seconds
to run) so long as I have the "fchmod()" statements are in
there.   It will *NOT* wait on my system if I comment those
fchmod()'s out. Sadly, I get the same behavior with and
without the ext3 barrier=1 mount option. :(
==========================================================
/*
** based on http://article.gmane.org/gmane.linux.file-systems/21373
** http://thread.gmane.org/gmane.linux.kernel/646040
*/
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc,char *argv[]) {
   if (argc<2) {
     printf("usage: fs <filename>\n");
     exit(1);
   }
   int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666);
   int i;
   for (i=0;i<100;i++) {
     char byte;
     pwrite (fd, &byte, 1, 0);
     fchmod (fd, 0644); fchmod (fd, 0664);
     fsync (fd);
   }
}
==========================================================
Since it does indeed wait when the inode's touched, I think
it suggests that it's not the hard drive that's lying, but
rather ext3.

So I take back what I said about linux and write barriers
being sane.   They're not.

But AFACT, all the (6 different) IDE drives I've seen work
as advertised, and the kernel happily seems to spews boot
messages when it finds one that doesn't support knowing
when a cache flush finished.


pgsql-performance by date:

Previous
From: Alvaro Herrera
Date:
Subject: autovacuum: use case for indenpedent TOAST table autovac settings
Next
From: Tom Lane
Date:
Subject: Re: autovacuum: use case for indenpedent TOAST table autovac settings