Disk performance, was Re: tablespaces and DB administration - Mailing list pgsql-performance
From | Marty Scholes |
---|---|
Subject | Disk performance, was Re: tablespaces and DB administration |
Date | |
Msg-id | 40BD1A82.7010001@outputservices.com Whole thread Raw |
List | pgsql-performance |
This was a lively debate on what was faster, single spindles or RAID. This is important, because I keep running into people who do not understand the performance dynamics of a RDBMS like Oracle or Pg. Pg and Oracle make a zillion tiny reads and writes and fsync() regularly. If your drive will copy a 4GB file at a sustained rate of 72 MB/s, that tells you nothing about how it will do with an RDBMS. I will throw in my experience on RAID vs spindles. With the RAID write cache disabled, a well balanced set of spindles will kill a RAID system any day. Enable the cache, and the RAID starts inching ahead. My experience is that no one can continuously keep I/O properly balanced across several spindles on a production system. Things change and the I/O mix changes. Then, the RAID is outperforming the spindles. If you want to spend the rest of your career constantly balancing I/O across spindles, then do so. For the rest of us, with a write cache, a hardware RAID wins hands down over the long haul. It might make sense to provide some sort of benchmarking tool for various systems so that we can predict I/O performance. Run the following code both on a hardware RAID and on a single spindle. #include <stdlib.h> #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <time.h> void makefile(int fs) { int i; char buf[8192]; int ld; int blocks=4096; int pos; time_t stim; time_t etim; float avg; unlink("dump.out"); ld=open("dump.out", O_WRONLY | O_CREAT); printf("Writing %d blocks sequentially\n", blocks); time(&stim); for (i=0; i<blocks; i++) { write(ld, buf, sizeof(buf)); if (fs) { fsync(ld); } } time(&etim); avg = (blocks+0.0)/(etim-stim-0.0); printf("Took %d seconds, avg %f iops\n\n", etim-stim, avg); // purge the write cache fsync(ld); printf("Writing %d blocks (somewhat randomly)\n", blocks); time(&stim); for (i=0; i<blocks; i++) { pos = (rand()%blocks)*sizeof(buf); lseek(ld, pos, SEEK_SET); write(ld, buf, sizeof(buf)); if (fs) { fsync(ld); } } time(&etim); avg = (blocks+0.0)/(etim-stim-0.0); printf("Took %d seconds, avg %f iops\n\n", etim-stim, avg); close(ld); unlink("dump.out"); } int main() { printf("No fsync()\n"); makefile(0); printf("With fsync()\n"); makefile(1); return 0; } The first operation shows how well the OS write cache is doing. The second shows how poorly everything runs with fsync(), which is what Pg and Oracle do. My RAID produced the following, but was also running production when I ran it: No fsync() Writing 4096 blocks sequentially Took 1 seconds, avg 4096.000000 iops Writing 4096 blocks (somewhat randomly) Took 4 seconds, avg 1024.000000 iops With fsync() Writing 4096 blocks sequentially Took 40 seconds, avg 102.400002 iops Writing 4096 blocks (somewhat randomly) Took 66 seconds, avg 62.060608 iops When I ran this on a decent fibre channel drive, I got: No fsync() Writing 4096 blocks sequentially Took 1 seconds, avg 4096.000000 iops Writing 4096 blocks (somewhat randomly) Took 7 seconds, avg 585.142883 iops With fsync() Writing 4096 blocks sequentially Took 106 seconds, avg 38.641510 iops Writing 4096 blocks (somewhat randomly) Took 115 seconds, avg 35.617390 iops You can see that the RAID array really helps out with small writes.
pgsql-performance by date: