Thread: Solaris Performance (Again)
This is a well-worn thread title - apologies, but these results seemed interesting, and hopefully useful in the quest to get better performance on Solaris: I was curious to see if the rather uninspiring pgbench performance obtained from a Sun 280R (see General: ATA Disks and RAID controllers for database servers) could be improved if more time was spent tuning. With the help of a fellow workmate who is a bit of a Solaris guy, we decided to have a go. The major performance killer appeared to be mounting the filesystem with the logging option. The next most significant seemed to be the choice of sync_method for Pg - the default (open_datasync), which we initially thought should be the best - appears noticeably slower than fdatasync. We also tried changing some of the tuneable filesystem options using tunefs - without any measurable effect. Are Pg/Solaris folks running with logging on and sync_method default out there ? - or have most of you been through this already ? Pgbench Results (no. clients and transactions/s ) : Setup 1: filesystem mounted with logging No. tps ----------- 1 17 2 17 4 22 8 22 16 28 32 32 64 37 Setup 2: filesystem mounted without logging No. tps ----------- 1 48 2 55 4 57 8 62 16 65 32 82 64 95 Setup 3 : filesystem mounted without logging, Pg sync_method = fdatasync No. tps ----------- 1 89 2 94 4 95 8 93 16 99 32 115 64 122 Note : The Pgbench runs were conducted using -s 10 and -t 1000 -c 1->64, 2 - 3 runs of each setup were performed (averaged figures shown). Mark
On Wed, 10 Dec 2003 18:56:38 +1300 Mark Kirkwood <markir@paradise.net.nz> wrote: > The major performance killer appeared to be mounting the filesystem > with the logging option. The next most significant seemed to be the > choice of sync_method for Pg - the default (open_datasync), which we > initially thought should be the best - appears noticeably slower than > fdatasync. > Some interesting stuff, I'll have to play with it. Currently I'm pleased with my solaris performance. What version of PG? If it is before 7.4 PG compiles with _NO_ optimization by default and was a huge part of the slowness of PG on solaris. -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
Mark Kirkwood <markir@paradise.net.nz> writes: > Note : The Pgbench runs were conducted using -s 10 and -t 1000 -c > 1->64, 2 - 3 runs of each setup were performed (averaged figures > shown). FYI, the pgbench docs state: NOTE: scaling factor should be at least as large as the largest number of clients you intend to test; else you'll mostly be measuring update contention. -Neil
Good point - It is Pg 7.4beta1 , compiled with CFLAGS += -O2 -funroll-loops -fexpensive-optimizations Jeff wrote: > >What version of PG? > >If it is before 7.4 PG compiles with _NO_ optimization by default and >was a huge part of the slowness of PG on solaris. > > > >
yes - originally I was going to stop at 8 clients, but once the bit was between the teeth....If I get another box to myself I will try -s 50 or 100 and see what that shows up. cheers Mark Neil Conway wrote: > FYI, the pgbench docs state: > > NOTE: scaling factor should be at least as large as the largest > number of clients you intend to test; else you'll mostly be > measuring update contention. > >-Neil > > >
Mark Kirkwood wrote: > This is a well-worn thread title - apologies, but these results seemed > interesting, and hopefully useful in the quest to get better performance > on Solaris: > > I was curious to see if the rather uninspiring pgbench performance > obtained from a Sun 280R (see General: ATA Disks and RAID controllers > for database servers) could be improved if more time was spent > tuning. > > With the help of a fellow workmate who is a bit of a Solaris guy, we > decided to have a go. > > The major performance killer appeared to be mounting the filesystem with > the logging option. The next most significant seemed to be the choice of > sync_method for Pg - the default (open_datasync), which we initially > thought should be the best - appears noticeably slower than fdatasync. I thought the default was fdatasync, but looking at the code it seems the default is open_datasync if O_DSYNC is available. I assume the logic is that we usually do only one write() before fsync(), so open_datasync should be faster. Why do we not use O_FSYNC over fsync(). Looking at the code: #if defined(O_SYNC) #define OPEN_SYNC_FLAG O_SYNC #else #if defined(O_FSYNC) #define OPEN_SYNC_FLAG O_FSYNC #endif #endif #if defined(OPEN_SYNC_FLAG) #if defined(O_DSYNC) && (O_DSYNC != OPEN_SYNC_FLAG) #define OPEN_DATASYNC_FLAG O_DSYNC #endif #endif #if defined(OPEN_DATASYNC_FLAG) #define DEFAULT_SYNC_METHOD_STR "open_datasync" #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN #define DEFAULT_SYNC_FLAGBIT OPEN_DATASYNC_FLAG #else #if defined(HAVE_FDATASYNC) #define DEFAULT_SYNC_METHOD_STR "fdatasync" #define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC #define DEFAULT_SYNC_FLAGBIT 0 #else #define DEFAULT_SYNC_METHOD_STR "fsync" #define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC #define DEFAULT_SYNC_FLAGBIT 0 #endif #endif I think the problem is that we prefer O_DSYNC over fdatasync, but do not prefer O_FSYNC over fsync. Running the attached test program shows on BSD/OS 4.3: write 0.000360 write & fsync 0.001391 write, close & fsync 0.001308 open o_fsync, write 0.000924 showing O_FSYNC faster than fsync(). -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 /* * test_fsync.c * tests if fsync can be done from another process than the original write */ #include <sys/types.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <unistd.h> void die(char *str); void print_elapse(struct timeval start_t, struct timeval elapse_t); int main(int argc, char *argv[]) { struct timeval start_t; struct timeval elapse_t; int tmpfile; char *strout = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"; /* write only */ gettimeofday(&start_t, NULL); if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1) die("can't open /var/tmp/test_fsync.out"); write(tmpfile, &strout, 200); close(tmpfile); gettimeofday(&elapse_t, NULL); unlink("/var/tmp/test_fsync.out"); printf("write "); print_elapse(start_t, elapse_t); printf("\n"); /* write & fsync */ gettimeofday(&start_t, NULL); if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1) die("can't open /var/tmp/test_fsync.out"); write(tmpfile, &strout, 200); fsync(tmpfile); close(tmpfile); gettimeofday(&elapse_t, NULL); unlink("/var/tmp/test_fsync.out"); printf("write & fsync "); print_elapse(start_t, elapse_t); printf("\n"); /* write, close & fsync */ gettimeofday(&start_t, NULL); if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1) die("can't open /var/tmp/test_fsync.out"); write(tmpfile, &strout, 200); close(tmpfile); /* reopen file */ if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1) die("can't open /var/tmp/test_fsync.out"); fsync(tmpfile); close(tmpfile); gettimeofday(&elapse_t, NULL); unlink("/var/tmp/test_fsync.out"); printf("write, close & fsync "); print_elapse(start_t, elapse_t); printf("\n"); /* open_fsync, write */ gettimeofday(&start_t, NULL); if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT | O_FSYNC)) == -1) die("can't open /var/tmp/test_fsync.out"); write(tmpfile, &strout, 200); close(tmpfile); gettimeofday(&elapse_t, NULL); unlink("/var/tmp/test_fsync.out"); printf("open o_fsync, write "); print_elapse(start_t, elapse_t); printf("\n"); return 0; } void print_elapse(struct timeval start_t, struct timeval elapse_t) { if (elapse_t.tv_usec < start_t.tv_usec) { elapse_t.tv_sec--; elapse_t.tv_usec += 1000000; } printf("%ld.%06ld", (long) (elapse_t.tv_sec - start_t.tv_sec), (long) (elapse_t.tv_usec - start_t.tv_usec)); } void die(char *str) { fprintf(stderr, "%s", str); exit(1); }
Manfred Spraul <manfred@colorfullife.com> writes: > One advantage of a seperate write and fsync call is better performance > for the writes that are triggered within AdvanceXLInsertBuffer: I'm not > sure how often that's necessary, but it's a write while holding both the > WALWriteLock and WALInsertLock. If every write contains an implicit > sync, that call would be much more expensive than necessary. Ideally that path isn't taken very often. But I'm currently having a discussion off-list with a CMU student who seems to be seeing a case where it happens a lot. (She reports that both WALWriteLock and WALInsertLock are causes of a lot of process blockages, which seems to mean that a lot of the WAL I/O is being done with both held, which would have to mean that AdvanceXLInsertBuffer is doing the I/O. More when we figure out what's going on exactly...) regards, tom lane
Bruce Momjian wrote: > write 0.000360 > write & fsync 0.001391 > write, close & fsync 0.001308 > open o_fsync, write 0.000924 > > That's 1 milliseconds vs. 1.3 milliseconds. Neither value is realistic - I guess the hw cache on and the os doesn't issue cache flush commands. Realistic values are probably 5 ms vs 5.3 ms - 6%, not 30%. How large is the syscall latency with BSD/OS 4.3? One advantage of a seperate write and fsync call is better performance for the writes that are triggered within AdvanceXLInsertBuffer: I'm not sure how often that's necessary, but it's a write while holding both the WALWriteLock and WALInsertLock. If every write contains an implicit sync, that call would be much more expensive than necessary. -- Manfred
I have been poking around with our fsync default options to see if I can improve them. One issue is that we never default to O_SYNC, but default to O_DSYNC if it exists, which seems strange. What I did was to beef up my test program and get it into CVS for folks to run. What I found was that different operating systems have different optimal defaults. On BSD/OS and FreeBSD, fdatasync/fsync was better, but on Linux, O_DSYNC/O_SYNC was faster. BSD/OS 4.3: Simple write timing: write 0.000055 Compare fsync before and after write's close: write, fsync, close 0.000707 write, close, fsync 0.000808 Compare one o_sync write to two: one 16k o_sync write 0.009762 two 8k o_sync writes 0.008799 Compare file sync methods with one 8k write: (o_dsync unavailable) open o_sync, write 0.000658 (fdatasync unavailable) write, fsync, 0.000702 Compare file sync methods with 2 8k writes: (The fastest should be used for wal_sync_method) (o_dsync unavailable) open o_sync, write 0.010402 (fdatasync unavailable) write, fsync, 0.001025 This shows terrible O_SYNC performance for 2 8k writes, but is faster for a single 8k write. Strange. FreeBSD 4.9: Simple write timing: write 0.000083 Compare fsync before and after write's close: write, fsync, close 0.000412 write, close, fsync 0.000453 Compare one o_sync write to two: one 16k o_sync write 0.000409 two 8k o_sync writes 0.000993 Compare file sync methods with one 8k write: (o_dsync unavailable) open o_sync, write 0.000683 (fdatasync unavailable) write, fsync, 0.000405 Compare file sync methods with 2 8k writes: (o_dsync unavailable) open o_sync, write 0.000789 (fdatasync unavailable) write, fsync, 0.000414 This shows fsync to be fastest in both cases. Linux 2.4.9: Simple write timing: write 0.000061 Compare fsync before and after write's close: write, fsync, close 0.000398 write, close, fsync 0.000407 Compare one o_sync write to two: one 16k o_sync write 0.000570 two 8k o_sync writes 0.000340 Compare file sync methods with one 8k write: (o_dsync unavailable) open o_sync, write 0.000166 write, fdatasync 0.000462 write, fsync, 0.000447 Compare file sync methods with 2 8k writes: (o_dsync unavailable) open o_sync, write 0.000334 write, fdatasync 0.000445 write, fsync, 0.000447 This shows O_SYNC to be fastest, even for 2 8k writes. This unapplied patch: ftp://candle.pha.pa.us/pub/postgresql/mypatches/fsync adds DEFAULT_OPEN_SYNC to the bsdi/freebsd/linux template files, which controls the default for those platforms. Platforms with no template default to fdatasync/fsync. Would other users run src/tools/fsync and report their findings so I can update the template files for their OS's? This is a process similar to our thread testing. Thanks. --------------------------------------------------------------------------- Bruce Momjian wrote: > Mark Kirkwood wrote: > > This is a well-worn thread title - apologies, but these results seemed > > interesting, and hopefully useful in the quest to get better performance > > on Solaris: > > > > I was curious to see if the rather uninspiring pgbench performance > > obtained from a Sun 280R (see General: ATA Disks and RAID controllers > > for database servers) could be improved if more time was spent > > tuning. > > > > With the help of a fellow workmate who is a bit of a Solaris guy, we > > decided to have a go. > > > > The major performance killer appeared to be mounting the filesystem with > > the logging option. The next most significant seemed to be the choice of > > sync_method for Pg - the default (open_datasync), which we initially > > thought should be the best - appears noticeably slower than fdatasync. > > I thought the default was fdatasync, but looking at the code it seems > the default is open_datasync if O_DSYNC is available. > > I assume the logic is that we usually do only one write() before > fsync(), so open_datasync should be faster. Why do we not use O_FSYNC > over fsync(). > > Looking at the code: > > #if defined(O_SYNC) > #define OPEN_SYNC_FLAG O_SYNC > #else > #if defined(O_FSYNC) > #define OPEN_SYNC_FLAG O_FSYNC > #endif > #endif > > #if defined(OPEN_SYNC_FLAG) > #if defined(O_DSYNC) && (O_DSYNC != OPEN_SYNC_FLAG) > #define OPEN_DATASYNC_FLAG O_DSYNC > #endif > #endif > > #if defined(OPEN_DATASYNC_FLAG) > #define DEFAULT_SYNC_METHOD_STR "open_datasync" > #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN > #define DEFAULT_SYNC_FLAGBIT OPEN_DATASYNC_FLAG > #else > #if defined(HAVE_FDATASYNC) > #define DEFAULT_SYNC_METHOD_STR "fdatasync" > #define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC > #define DEFAULT_SYNC_FLAGBIT 0 > #else > #define DEFAULT_SYNC_METHOD_STR "fsync" > #define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC > #define DEFAULT_SYNC_FLAGBIT 0 > #endif > #endif > > I think the problem is that we prefer O_DSYNC over fdatasync, but do not > prefer O_FSYNC over fsync. > > Running the attached test program shows on BSD/OS 4.3: > > write 0.000360 > write & fsync 0.001391 > write, close & fsync 0.001308 > open o_fsync, write 0.000924 > > showing O_FSYNC faster than fsync(). > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > /* > * test_fsync.c > * tests if fsync can be done from another process than the original write > */ > > #include <sys/types.h> > #include <fcntl.h> > #include <stdio.h> > #include <stdlib.h> > #include <time.h> > #include <unistd.h> > > void die(char *str); > void print_elapse(struct timeval start_t, struct timeval elapse_t); > > int main(int argc, char *argv[]) > { > struct timeval start_t; > struct timeval elapse_t; > int tmpfile; > char *strout = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"; > > /* write only */ > gettimeofday(&start_t, NULL); > if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1) > die("can't open /var/tmp/test_fsync.out"); > write(tmpfile, &strout, 200); > close(tmpfile); > gettimeofday(&elapse_t, NULL); > unlink("/var/tmp/test_fsync.out"); > printf("write "); > print_elapse(start_t, elapse_t); > printf("\n"); > > /* write & fsync */ > gettimeofday(&start_t, NULL); > if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1) > die("can't open /var/tmp/test_fsync.out"); > write(tmpfile, &strout, 200); > fsync(tmpfile); > close(tmpfile); > gettimeofday(&elapse_t, NULL); > unlink("/var/tmp/test_fsync.out"); > printf("write & fsync "); > print_elapse(start_t, elapse_t); > printf("\n"); > > /* write, close & fsync */ > gettimeofday(&start_t, NULL); > if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1) > die("can't open /var/tmp/test_fsync.out"); > write(tmpfile, &strout, 200); > close(tmpfile); > /* reopen file */ > if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1) > die("can't open /var/tmp/test_fsync.out"); > fsync(tmpfile); > close(tmpfile); > gettimeofday(&elapse_t, NULL); > unlink("/var/tmp/test_fsync.out"); > printf("write, close & fsync "); > print_elapse(start_t, elapse_t); > printf("\n"); > > /* open_fsync, write */ > gettimeofday(&start_t, NULL); > if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT | O_FSYNC)) == -1) > die("can't open /var/tmp/test_fsync.out"); > write(tmpfile, &strout, 200); > close(tmpfile); > gettimeofday(&elapse_t, NULL); > unlink("/var/tmp/test_fsync.out"); > printf("open o_fsync, write "); > print_elapse(start_t, elapse_t); > printf("\n"); > > return 0; > } > > void print_elapse(struct timeval start_t, struct timeval elapse_t) > { > if (elapse_t.tv_usec < start_t.tv_usec) > { > elapse_t.tv_sec--; > elapse_t.tv_usec += 1000000; > } > > printf("%ld.%06ld", (long) (elapse_t.tv_sec - start_t.tv_sec), > (long) (elapse_t.tv_usec - start_t.tv_usec)); > } > > void die(char *str) > { > fprintf(stderr, "%s", str); > exit(1); > } > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I have been poking around with our fsync default options to see if I can > improve them. One issue is that we never default to O_SYNC, but default > to O_DSYNC if it exists, which seems strange. As I recall, that was based on testing on some different platforms. It's not particularly "strange": O_SYNC implies writing at least two places on the disk (file and inode). O_DSYNC or fdatasync should theoretically be the fastest alternatives, O_SYNC and fsync the worst. > Compare fsync before and after write's close: > write, fsync, close 0.000707 > write, close, fsync 0.000808 What does that mean? You can't fsync a closed file. > This shows terrible O_SYNC performance for 2 8k writes, but is faster > for a single 8k write. Strange. I'm not sure I believe these numbers at all... my experience is that getting trustworthy disk I/O numbers is *not* easy. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I have been poking around with our fsync default options to see if I can > > improve them. One issue is that we never default to O_SYNC, but default > > to O_DSYNC if it exists, which seems strange. > > As I recall, that was based on testing on some different platforms. > It's not particularly "strange": O_SYNC implies writing at least two > places on the disk (file and inode). O_DSYNC or fdatasync should > theoretically be the fastest alternatives, O_SYNC and fsync the worst. But why perfer O_DSYNC over fdatasync if you don't prefer O_SYNC over fsync? > > > Compare fsync before and after write's close: > > write, fsync, close 0.000707 > > write, close, fsync 0.000808 > > What does that mean? You can't fsync a closed file. You reopen and fsync. > > This shows terrible O_SYNC performance for 2 8k writes, but is faster > > for a single 8k write. Strange. > > I'm not sure I believe these numbers at all... my experience is that > getting trustworthy disk I/O numbers is *not* easy. These numbers were reproducable on all the platforms I tested. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> As I recall, that was based on testing on some different platforms. > But why perfer O_DSYNC over fdatasync if you don't prefer O_SYNC over > fsync? It's what tested out as the best bet. I think we were using pgbench as the test platform, which as you know I have doubts about, but at least it is testing one actual write/sync pattern Postgres can generate. The choice between the open flags and fdatasync/fsync depends a whole lot on your writing patterns (how much data you tend to write between fsync points), so I don't have a lot of faith in randomly-chosen test programs as a guide to what to use for Postgres. >> What does that mean? You can't fsync a closed file. > You reopen and fsync. Um. I just looked at that test program, and I think it needs a whole lot of work yet. * Some of the test cases count open()/close() overhead, some don't. This is bad, especially on platforms like Solaris where open() is notoriously expensive. * You really cannot put any faith in measuring a single write, especially on a machine that's not *completely* idle otherwise. I'd feel somewhat comfortable if you wrote, say, 1000 8K blocks and measured the time for that. (And you have to think about how far apart the fsyncs are in that sequence; you probably want to repeat the measurement with several different fsync spacings.) It would also be a good idea to compare writing 1000 successive blocks with rewriting the same block 1000 times --- if the latter does not happen roughly at the disk RPM rate, then we know the drive is lying and all the numbers should be discarded as meaningless. * The program is claimed to test whether you can write from one process and fsync from another, but it does no such thing AFAICS. BTW, rather than hard-wiring the test file name, why don't you let it be specified on the command line? That would make it lots easier for people to compare the performance of several disk drives, if they have 'em. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> As I recall, that was based on testing on some different platforms. > > > But why perfer O_DSYNC over fdatasync if you don't prefer O_SYNC over > > fsync? > > It's what tested out as the best bet. I think we were using pgbench > as the test platform, which as you know I have doubts about, but at > least it is testing one actual write/sync pattern Postgres can generate. > The choice between the open flags and fdatasync/fsync depends a whole > lot on your writing patterns (how much data you tend to write between > fsync points), so I don't have a lot of faith in randomly-chosen test > programs as a guide to what to use for Postgres. I assume pgbench has so much variance that trying to see fsync changes in there would be hopeless. > >> What does that mean? You can't fsync a closed file. > > > You reopen and fsync. > > Um. I just looked at that test program, and I think it needs a whole > lot of work yet. > > * Some of the test cases count open()/close() overhead, some don't. > This is bad, especially on platforms like Solaris where open() is > notoriously expensive. The only one I saw that had an extra open() was the fsync after close test. I add a do-nothing open/close to the previous test so they are the same. > * You really cannot put any faith in measuring a single write, > especially on a machine that's not *completely* idle otherwise. > I'd feel somewhat comfortable if you wrote, say, 1000 8K blocks and > measured the time for that. (And you have to think about how far OK, it now measures a loop of 1000. > apart the fsyncs are in that sequence; you probably want to repeat the > measurement with several different fsync spacings.) It would also be > a good idea to compare writing 1000 successive blocks with rewriting > the same block 1000 times --- if the latter does not happen roughly > at the disk RPM rate, then we know the drive is lying and all the > numbers should be discarded as meaningless. > > * The program is claimed to test whether you can write from one process > and fsync from another, but it does no such thing AFAICS. It really just shows whether the fsync fater the close has similar timing to the one before the close. That was the best way I could think to test it. > BTW, rather than hard-wiring the test file name, why don't you let it be > specified on the command line? That would make it lots easier for > people to compare the performance of several disk drives, if they have > 'em. I have updated the test program in CVS. New BSD/OS results: Simple write timing: write 0.034801 Compare fsync times on write() and non-write() descriptor: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close 0.868831 write, close, fsync 0.717281 Compare one o_sync write to two: one 16k o_sync write 10.121422 two 8k o_sync writes 4.405151 Compare file sync methods with one 8k write: (o_dsync unavailable) open o_sync, write 1.542213 (fdatasync unavailable) write, fsync, 1.703689 Compare file sync methods with 2 8k writes: (The fastest should be used for wal_sync_method) (o_dsync unavailable) open o_sync, write 4.498607 (fdatasync unavailable) write, fsync, 2.473842 -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> It's what tested out as the best bet. I think we were using pgbench >> as the test platform, which as you know I have doubts about, but at >> least it is testing one actual write/sync pattern Postgres can generate. > I assume pgbench has so much variance that trying to see fsync changes > in there would be hopeless. The results were fairly reproducible, as I recall; else we'd have looked for another test method. You may want to go back and consult the pghackers archives. >> * Some of the test cases count open()/close() overhead, some don't. > The only one I saw that had an extra open() was the fsync after close > test. I add a do-nothing open/close to the previous test so they are > the same. Why is it sensible to include open/close overhead in the "simple write" case and not in the "o_sync write" cases, for instance? Doesn't seem like a fair comparison to me. Adding the open overhead to all cases might make it "fair", but it would also make it not what we want to measure. >> * The program is claimed to test whether you can write from one process >> and fsync from another, but it does no such thing AFAICS. > It really just shows whether the fsync fater the close has similar > timing to the one before the close. That was the best way I could think > to test it. Sure, but where's the "separate process" part? What this seems to test is whether a single process can sync its own writes through a different file descriptor; which is interesting but by no means the only thing we need to be sure of if we want to make the bgwriter handle syncing. regards, tom lane
Tom Lane wrote: > > It really just shows whether the fsync fater the close has similar > > timing to the one before the close. That was the best way I could think > > to test it. > > Sure, but where's the "separate process" part? What this seems to test > is whether a single process can sync its own writes through a different > file descriptor; which is interesting but by no means the only thing we > need to be sure of if we want to make the bgwriter handle syncing. I am not sure how to easily test if a separate process can do the same. I am sure it can be done, but for me it was enough to see that it works in a single process. Unix isn't very process-centered for I/O, so I don't think it would make much of a difference. Now, Win32, that might be an issue. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Thu, Mar 18, 2004 at 01:50:32PM -0500, Bruce Momjian wrote: > > I'm not sure I believe these numbers at all... my experience is that > > getting trustworthy disk I/O numbers is *not* easy. > > These numbers were reproducable on all the platforms I tested. It's not because they are reproducable that they mean anything in the real world. Kurt