Thread: WAL and commit_delay
I want to give some background on commit_delay, its initial purpose, and possible options. First, looking at the process that happens during a commit: write() - copy WAL dirty page to kernel disk bufferfsync() - force WAL kernel disk buffer to disk platter fsync() take much longer than write(). What Vadim doesn't want is: time backend 1 backend 2 ---- --------- --------- 0 write() 1 fysnc() write() 2 fsync() This would be better as: time backend 1 backend 2 ---- --------- --------- 0 write() 1 write() 2 fsync() fsync() This was the purpose of the commit_delay. Having two fsync()'s is not a problem because only one will see there are dirty buffers. The other will probably either return right away, or wait for the other's fsync() to complete. With the delay, it looks like: time backend 1 backend 2 ---- --------- --------- 0 write() 1 sleep() write() 2 fsync() sleep() 3 fsync() Which shows the second fsync() doing nothing, which is good, because there are no dirty buffers at that time. However, a very possible circumstance is: time backend 1 backend 2 backend 3 ---- --------- --------- --------- 0 write() 1 sleep() write() 2 fsync() sleep() write() 3 fsync() sleep() 4 fsync() In this case, the fsync() by backend 2 does indeed do some work because fsync's backend 3's write(). Frankly, I don't see how the sleep does much except delay things because it doesn't have any smarts about when the delay is useful, and when it is useless. Without that feedback, I recommend removing the entire setting. For single backends, the sleep is clearly a loser. Another situation it can not deal with is: time backend 1 backend 2 ---- --------- --------- 0 write() 1 sleep() 2 fsync() write() 3 sleep() 4 fsync() My solution can't deal with this either. --------------------------------------------------------------------------- The quick fix is to remove the commit_delay code. A more elaborate performance boost would be to have the each backend get feedback from other backends, so they can block and wait for other about-to-fsync backends before fsync(). This allows the write() to bunch up before the fsync(). Here is the single backend case, which experiences no delays: time backend 1 backend 2 ---- --------- --------- 0 get_shlock() 1 write() 2 rel_shlock() 3 get_exlock() 4 rel_exlock() 5 fsync() Here is the two-backend case, which shows both write()'s completing before the fsync()'s: time backend 1 backend 2 ---- --------- --------- 0 get_shlock() 1 write() 2 rel_shlock() get_shlock() 3 get_exlock() write() 4 rel_shlock() 5 rel_exlock() 6 fsync() get_exlock() 7 rel_exlock() 8 fsync() Contrast that with the first 2 backend case presented above: time backend 1 backend 2 ---- --------- --------- 0 write() 1 fysnc() write() 2 fsync() Now, it is my understanding that instead of just shared locking around the write()'s, we could block the entire commit code, so the backend can signal to other about-to-fsync backends to wait. I believe our existing lock code can be used for the locking/unlocking. We can just lock a random, unused table log pg_log or something. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > With the delay, it looks like: > time backend 1 backend 2 > ---- --------- --------- > 0 write() > 1 sleep() write() > 2 fsync() sleep() > 3 fsync() Actually ... take a close look at the code. The delay is done in xact.c between XLogInsert(commitrecord) and XLogFlush(). As near as I can tell, both the write() and the fsync() will happen in XLogFlush(). This means the delay is just plain broken: placed there, it cannot do anything except waste time. Another thing I am wondering about is why we're not using fdatasync(), where available, instead of fsync(). The whole point of preallocating the WAL files is to make fdatasync safe, no? regards, tom lane
I wrote: > Actually ... take a close look at the code. The delay is done in > xact.c between XLogInsert(commitrecord) and XLogFlush(). As near > as I can tell, both the write() and the fsync() will happen in > XLogFlush(). This means the delay is just plain broken: placed > there, it cannot do anything except waste time. Uh ... scratch that ... nevermind. The point is that we've inserted our commit record into the WAL output buffer. Now we are sleeping in the hope that some other backend will do both the write and the fsync for us, and that when we eventually call XLogFlush() it will find nothing to do. So the delay is not in the wrong place. > Another thing I am wondering about is why we're not using fdatasync(), > where available, instead of fsync(). The whole point of preallocating > the WAL files is to make fdatasync safe, no? This still looks like it'd be a win, by reducing the number of seeks needed to complete a WAL logfile flush. Right now, each XLogFlush requires writing both the file's data area and its inode. regards, tom lane
> Actually ... take a close look at the code. The delay is done in > xact.c between XLogInsert(commitrecord) and XLogFlush(). As near > as I can tell, both the write() and the fsync() will happen in > XLogFlush(). This means the delay is just plain broken: placed > there, it cannot do anything except waste time. I see. :-( > Another thing I am wondering about is why we're not using fdatasync(), > where available, instead of fsync(). The whole point of preallocating > the WAL files is to make fdatasync safe, no? I don't have fdatasync() here. How does it compare to fsync(). -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > Another thing I am wondering about is why we're not using fdatasync(), > > where available, instead of fsync(). The whole point of preallocating > > the WAL files is to make fdatasync safe, no? > > This still looks like it'd be a win, by reducing the number of seeks > needed to complete a WAL logfile flush. Right now, each XLogFlush > requires writing both the file's data area and its inode. Don't we have to fsync the inode too? Actually, I was hoping sequential fsync's could sit on the WAL disk track, but I can imagine it has to seek around to hit both areas. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Another thing I am wondering about is why we're not using fdatasync(), > where available, instead of fsync(). The whole point of preallocating > the WAL files is to make fdatasync safe, no? > Don't we have to fsync the inode too? Actually, I was hoping sequential > fsync's could sit on the WAL disk track, but I can imagine it has to > seek around to hit both areas. That's the point: we're trying to get things set up so that successive writes/fsyncs in the WAL file do the minimum amount of seeking. The WAL code tries to preallocate the whole log file (incorrectly, but that's easily fixed, see below) so that we should not need to update the file metadata when we write into the file. > I don't have fdatasync() here. How does it compare to fsync(). HPUX's man page says : fdatasync() causes all modified data and file attributes of fildes : required to retrieve the data to be written to disk. : fsync() causes all modified data and all file attributes of fildes : (including access time, modification time and status change time) to : be written to disk. The implication is that the only thing you can lose after fdatasync is the highly-inessential file mod time. However, I have been told that on some implementations, fdatasync only flushes data blocks, and never writes the inode or indirect blocks. That would mean that if you had allocated new disk space to the file, fdatasync would not guarantee that that allocation was reflected on disk. This is the reason for preallocating the WAL log file (and doing a full fsync *at that time*). Then you know the inode block pointers and indirect blocks are down on disk, and so fdatasync is sufficient even if you have the cheesy version of fdatasync. Right now the WAL preallocation code (XLogFileInit) is not good enough because it does lseek to the 16MB position and then writes 1 byte there. On an implementation that supports holes in files (which is most Unixen) that doesn't cause physical allocation of the intervening space. We'd have to actually write zeroes into all 16MB to ensure the space is allocated ... but that's just a couple more lines of code. regards, tom lane
> Right now the WAL preallocation code (XLogFileInit) is not good enough > because it does lseek to the 16MB position and then writes 1 byte there. > On an implementation that supports holes in files (which is most Unixen) > that doesn't cause physical allocation of the intervening space. We'd > have to actually write zeroes into all 16MB to ensure the space is > allocated ... but that's just a couple more lines of code. Are OS's smart enough to not allocate zero-written blocks? Do we need to write non-zeros? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Bruce Momjian <pgman@candle.pha.pa.us> [010217 14:46]: > > Right now the WAL preallocation code (XLogFileInit) is not good enough > > because it does lseek to the 16MB position and then writes 1 byte there. > > On an implementation that supports holes in files (which is most Unixen) > > that doesn't cause physical allocation of the intervening space. We'd > > have to actually write zeroes into all 16MB to ensure the space is > > allocated ... but that's just a couple more lines of code. > > Are OS's smart enough to not allocate zero-written blocks? Do we need > to write non-zeros? I don't believe so. writing Zeros is valid. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
> * Bruce Momjian <pgman@candle.pha.pa.us> [010217 14:46]: > > > Right now the WAL preallocation code (XLogFileInit) is not good enough > > > because it does lseek to the 16MB position and then writes 1 byte there. > > > On an implementation that supports holes in files (which is most Unixen) > > > that doesn't cause physical allocation of the intervening space. We'd > > > have to actually write zeroes into all 16MB to ensure the space is > > > allocated ... but that's just a couple more lines of code. > > > > Are OS's smart enough to not allocate zero-written blocks? Do we need > > to write non-zeros? > I don't believe so. writing Zeros is valid. The reason I ask is because I know you get zeros when trying to read data from a file with holes, so it seems some OS could actually drop those blocks from storage. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Bruce Momjian <pgman@candle.pha.pa.us> [010217 14:50]: > > * Bruce Momjian <pgman@candle.pha.pa.us> [010217 14:46]: > > > > Right now the WAL preallocation code (XLogFileInit) is not good enough > > > > because it does lseek to the 16MB position and then writes 1 byte there. > > > > On an implementation that supports holes in files (which is most Unixen) > > > > that doesn't cause physical allocation of the intervening space. We'd > > > > have to actually write zeroes into all 16MB to ensure the space is > > > > allocated ... but that's just a couple more lines of code. > > > > > > Are OS's smart enough to not allocate zero-written blocks? Do we need > > > to write non-zeros? > > I don't believe so. writing Zeros is valid. > > The reason I ask is because I know you get zeros when trying to read > data from a file with holes, so it seems some OS could actually drop > those blocks from storage. I've written swap files and such with: dd if=/dev/zero of=SWAPFILE bs=512 count=204800 and all the blocks are allocated. LER > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
Larry Rosenman <ler@lerctr.org> writes: > I've written swap files and such with: > dd if=/dev/zero of=SWAPFILE bs=512 count=204800 > and all the blocks are allocated. I've also confirmed that writing zeroes is sufficient on HPUX (du shows that the correct amount of space is allocated, unlike the current seek-to-the-end method). Some poking around the net shows that pre-2.4 Linux kernels implement fdatasync() as fsync(), and we already knew that BSD hasn't got it at all. So distinguishing fdatasync from fsync won't be helpful for very many people yet --- but I still think we should do it. I'm playing with a test setup in which I just changed pg_fsync to call fdatasync instead of fsync, and on HPUX I'm seeing pgbench tps values around 17, as opposed to 13 yesterday. (The HPUX man page warns that these calls are inefficient for large files, and I wouldn't be surprised if a lot of the run time is now being spent in the kernel scanning through all the buffers that belong to the logfile. 2.4 Linux is apparently reasonably smart about this case, and only looks at the actually dirty buffers.) Is anyone out there running a 2.4 Linux kernel? Would you try pgbench with current sources, commit_delay=0, -B at least 1024, no -F, and see how the results change when pg_fsync is made to call fdatasync instead of fsync? (It's in src/backend/storage/file/fd.c) regards, tom lane
On Sat, Feb 17, 2001 at 03:45:30PM -0500, Bruce Momjian wrote: > > Right now the WAL preallocation code (XLogFileInit) is not good enough > > because it does lseek to the 16MB position and then writes 1 byte there. > > On an implementation that supports holes in files (which is most Unixen) > > that doesn't cause physical allocation of the intervening space. We'd > > have to actually write zeroes into all 16MB to ensure the space is > > allocated ... but that's just a couple more lines of code. > > Are OS's smart enough to not allocate zero-written blocks? No, but some disks are. Writing zeroes is a bit faster on smart disks. This has no real implications for PG, but it is one of the reasons that writing zeroes doesn't really wipe a disk, for forensic purposes. Nathan Myers ncm@zembu.com
On Sat, 17 Feb 2001, Tom Lane wrote: > Another thing I am wondering about is why we're not using fdatasync(), > where available, instead of fsync(). The whole point of preallocating > the WAL files is to make fdatasync safe, no? Linux/x86 fdatasync(2) manpage: BUGS Currently (Linux 2.0.23) fdatasync is equivalent to fsync. -- Dominic J. Eidson "Baruk Khazad! Khazad ai-menu!" - Gimli ------------------------------------------------------------------------------- http://www.the-infinite.org/ http://www.the-infinite.org/~dominic/
On 17 Feb 2001 at 17:56 (-0500), Tom Lane wrote: [snipped] | Is anyone out there running a 2.4 Linux kernel? Would you try pgbench | with current sources, commit_delay=0, -B at least 1024, no -F, and see | how the results change when pg_fsync is made to call fdatasync instead | of fsync? (It's in src/backend/storage/file/fd.c) I've not run this requested test, but glibc-2.2 provides this bit of code for fdatasync, so it /appears/ to me that kernel version will not affect the test case. [glibc-2.2/sysdeps/generic/fdatasync.c] int fdatasync (int fildes) { return fsync (fildes); } hth. brent -- "We want to help, but we wouldn't want to deprive you of a valuable learning experience." http://openbsd.org/mail.html
On Sat, Feb 17, 2001 at 06:30:12PM -0500, Brent Verner wrote: > On 17 Feb 2001 at 17:56 (-0500), Tom Lane wrote: > > [snipped] > > | Is anyone out there running a 2.4 Linux kernel? Would you try pgbench > | with current sources, commit_delay=0, -B at least 1024, no -F, and see > | how the results change when pg_fsync is made to call fdatasync instead > | of fsync? (It's in src/backend/storage/file/fd.c) > > I've not run this requested test, but glibc-2.2 provides this bit > of code for fdatasync, so it /appears/ to me that kernel version > will not affect the test case. > > [glibc-2.2/sysdeps/generic/fdatasync.c] > > int > fdatasync (int fildes) > { > return fsync (fildes); > } In the 2.4 kernel it says (fs/buffer.c) /* this needs further work, at the moment it is identical to fsync() */ down(&inode->i_sem); err = file->f_op->fsync(file,dentry); up(&inode->i_sem); We can probably expect this to be fixed in an upcoming 2.4.x, i.e. well before 2.6. This is moot, though, if you're writing to a raw volume, which you will be if you are really serious. Then, fsync really is equivalent to fdatasync. Nathan Myers ncm@zembu.com
On 17 Feb 2001 at 15:53 (-0800), Nathan Myers wrote: | On Sat, Feb 17, 2001 at 06:30:12PM -0500, Brent Verner wrote: | > On 17 Feb 2001 at 17:56 (-0500), Tom Lane wrote: | > | > [snipped] | > | > | Is anyone out there running a 2.4 Linux kernel? Would you try pgbench | > | with current sources, commit_delay=0, -B at least 1024, no -F, and see | > | how the results change when pg_fsync is made to call fdatasync instead | > | of fsync? (It's in src/backend/storage/file/fd.c) | > | > I've not run this requested test, but glibc-2.2 provides this bit | > of code for fdatasync, so it /appears/ to me that kernel version | > will not affect the test case. | > | > [glibc-2.2/sysdeps/generic/fdatasync.c] | > | > int | > fdatasync (int fildes) | > { | > return fsync (fildes); | > } | | In the 2.4 kernel it says (fs/buffer.c) | | /* this needs further work, at the moment it is identical to fsync() */ | down(&inode->i_sem); | err = file->f_op->fsync(file, dentry); | up(&inode->i_sem); | | We can probably expect this to be fixed in an upcoming 2.4.x, i.e. | well before 2.6. 2.4.0-ac11 already has provisions for fdatasync [fs/buffer.c] 352 asmlinkage long sys_fsync(unsigned int fd) 353 { ... 372 down(&inode->i_sem); 373 filemap_fdatasync(inode->i_mapping);374 err = file->f_op->fsync(file, dentry, 0); 375 filemap_fdatawait(inode->i_mapping);376 up(&inode->i_sem); 384 asmlinkage long sys_fdatasync(unsigned int fd) 385 { ... 403 down(&inode->i_sem); 404 filemap_fdatasync(inode->i_mapping);405 err = file->f_op->fsync(file, dentry, 1); 406 filemap_fdatawait(inode->i_mapping);407 up(&inode->i_sem); ext2 does use this third param of its fsync() operation to (potentially) bypass a call to ext2_sync_inode(inode) b
ncm@zembu.com (Nathan Myers) writes: > In the 2.4 kernel it says (fs/buffer.c) > /* this needs further work, at the moment it is identical to fsync() */ > down(&inode->i_sem); > err = file->f_op->fsync(file, dentry); > up(&inode->i_sem); Hmm, that's the same code that's been there since 2.0 or before. I had trawled the Linux kernel mail lists and found patch submissions from several different people to make fdatasync really work, and what I thought was an indication that at least one had been applied. Evidently not. Oh well... regards, tom lane
On Sat, Feb 17, 2001 at 07:34:22PM -0500, Tom Lane wrote: > ncm@zembu.com (Nathan Myers) writes: > > In the 2.4 kernel it says (fs/buffer.c) > > > /* this needs further work, at the moment it is identical to fsync() */ > > down(&inode->i_sem); > > err = file->f_op->fsync(file, dentry); > > up(&inode->i_sem); > > Hmm, that's the same code that's been there since 2.0 or before. Indeed. All xterms look alike, and I used one connected to the wrong box. Here's what's in 2.4.0: For fsync: filemap_fdatasync(inode->i_mapping); err = file->f_op->fsync(file, dentry, 0); filemap_fdatawait(inode->i_mapping); and for fdatasync: filemap_fdatasync(inode->i_mapping); err = file->f_op->fsync(file, dentry, 1); filemap_fdatawait(inode->i_mapping); (Notice the "1" vs. "0" difference?) So the actual file system (ext2fs, reiserfs, etc.) has the option of equating the two, or not. In fs/ext2/fsync.c, we have int ext2_fsync_inode(struct inode *inode, int datasync) { int err; err = fsync_inode_buffers(inode); if (!(inode->i_state & I_DIRTY)) return err; if (datasync && !(inode->i_state & I_DIRTY_DATASYNC)) return err; err |= ext2_sync_inode(inode); return err ? -EIO : 0; } I.e. yes, Linux 2.4.0 and ext2 do implement the distinction. Sorry for the misinformation. Nathan Myers ncm@zembu.com
ncm@zembu.com (Nathan Myers) writes: > I.e. yes, Linux 2.4.0 and ext2 do implement the distinction. > Sorry for the misinformation. Okay ... meanwhile I've got to report the reverse: I've just confirmed that on HPUX 10.20, there is *not* a distinction between fsync and fdatasync. I was misled by what was apparently an outlier result on my first try with fdatasync plugged in ... but when I couldn't reproduce that, some digging led to the fact that the fsync and fdatasync symbols in libc are at the same place :-(. Still, using fdatasync for the WAL file seems like a forward-looking thing to do, and it'll just take another couple of lines of configure code, so I'll go ahead and plug it in. regards, tom lane
fdatasync() is available on Tru64 and according to the man-page behaves as Tom expects. So it should be a win for us. What do other commercial unixes say? Adriaan
Adriaan Joubert <a.joubert@albourne.com> writes: > fdatasync() is available on Tru64 and according to the man-page behaves > as Tom expects. So it should be a win for us. Careful ... HPUX's man page also claims that fdatasync does something useful, but it doesn't. I'd recommend an experiment. Does today's snapshot run any faster for you (without -F) than before? regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [010218 10:53]: > Adriaan Joubert <a.joubert@albourne.com> writes: > > fdatasync() is available on Tru64 and according to the man-page behaves > > as Tom expects. So it should be a win for us. > > Careful ... HPUX's man page also claims that fdatasync does something > useful, but it doesn't. I'd recommend an experiment. Does today's > snapshot run any faster for you (without -F) than before? BTW, UnixWare 7.1.1 does *NOT* have fdatasync. What standard created this one? > > regards, tom lane -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
Tom Lane <tgl@sss.pgh.pa.us> writes: > The implication is that the only thing you can lose after fdatasync is > the highly-inessential file mod time. However, I have been told that > on some implementations, fdatasync only flushes data blocks, and never > writes the inode or indirect blocks. That would mean that if you had > allocated new disk space to the file, fdatasync would not guarantee > that that allocation was reflected on disk. This is the reason for > preallocating the WAL log file (and doing a full fsync *at that time*). > Then you know the inode block pointers and indirect blocks are down > on disk, and so fdatasync is sufficient even if you have the cheesy > version of fdatasync. Actually, there is also a performance reason. Indeed, fdatasync would not perform any better than fsync if the log file was not preallocated: the file length would change each time a record is appended, and therefore the inode would have to be updated. -- Jerome
Larry Rosenman <ler@lerctr.org> writes: > BTW, UnixWare 7.1.1 does *NOT* have fdatasync. What standard created > this one? HP's manpage quoth: STANDARDS CONFORMANCE fsync(): AES, SVID3, XPG3, XPG4, POSIX.4 fdatasync(): POSIX.4 regards, tom lane
Jerome Vouillon <vouillon@saul.cis.upenn.edu> writes: > Actually, there is also a performance reason. Indeed, fdatasync would > not perform any better than fsync if the log file was not > preallocated: the file length would change each time a record is > appended, and therefore the inode would have to be updated. Good point, but seeking to the 16-meg position and writing one byte was already sufficient to take care of that issue. I think that there may be a performance advantage to pre-filling the logfile even so, assuming that file allocation info is stored in a Berkeley/McKusik-like fashion (note: I have no idea what ext2 or reiserfs actually do). Namely, we'll only sync the file's indirect blocks once, in the fsync() at the end of XLogFileInit. A correct fdatasync implementation would have to sync the last indirect block each time a new filesystem block is added to the logfile, so it would end up doing a lot of seeks for that purpose even if it rarely touches the inode itself. Another point is that if the logfile is pre-filled over a short interval, its blocks are more likely to be allocated close to each other than if it grows to full size over a longer interval. Not much point in avoiding seeks outside the file data if the file data itself is scattered all over the place :-(. Basically we're trading more work in XLogFileInit (which we hope is not time-critical) for less work in typical transaction commits. regards, tom lane
On Sun, Feb 18, 2001 at 11:51:50AM -0500, Tom Lane wrote: > Adriaan Joubert <a.joubert@albourne.com> writes: > > fdatasync() is available on Tru64 and according to the man-page behaves > > as Tom expects. So it should be a win for us. > > Careful ... HPUX's man page also claims that fdatasync does something > useful, but it doesn't. I'd recommend an experiment. Does today's > snapshot run any faster for you (without -F) than before? It's worth noting in documentation that systems that don't have fdatasync(), or that have the phony implementation, can get the same benefit by using a raw volume (partition) for the log file. This applies even on Linux 2.0 and 2.2 without the "raw-i/o" patch. Using raw volumes would have other performance benefits, even on systems that do fully support fdatasync, through bypassing the buffer cache. (The above assumes I understood correctly Vadim's postings about changes he made to support putting logs on raw volumes.) Nathan Myers ncm@zembu.com
On Sun, 18 Feb 2001, Tom Lane wrote: > I think that there may be a performance advantage to pre-filling the > logfile even so, assuming that file allocation info is stored in a > Berkeley/McKusik-like fashion (note: I have no idea what ext2 or > reiserfs actually do). ext2 is a lot like [UF]FS. reiserfs is very different, but does have similar hole semantics. BTW, I have attached two patches which streamline log initialisation a little. The first (xlog-sendfile.diff) adds support for Linux's sendfile system call. FreeBSD and HP/UX have sendfile() too, but the prototype is different. If it's interesting, someone will have to come up with a configure test, as autoconf scares me. The second removes a further three syscalls from the log init path. There are a couple of things to note here:* I don't know why link/unlink is currently preferred over rename. POSIX offersstrong guarantees on the semantics of the latter.* I have assumed that the close/rename/reopen stuff is only therefor the benefit of Windows users, and ifdeffed it for everyone else. Matthew.
On Mon, 19 Feb 2001, Matthew Kirkwood wrote: > BTW, I have attached two patches which streamline log initialisation > a little. The first (xlog-sendfile.diff) adds support for Linux's > sendfile system call. Whoops, don't use this. It looks like Linux won't sendfile() from /dev/zero. I'll endeavour to get this fixed, but it looks like it'll be rather harder to use sendfile for this. Bah. Matthew.
Matthew Kirkwood <matthew@hairy.beasts.org> writes: > BTW, I have attached two patches which streamline log initialisation > a little. The first (xlog-sendfile.diff) adds support for Linux's > sendfile system call. FreeBSD and HP/UX have sendfile() too, but the > prototype is different. If it's interesting, someone will have to > come up with a configure test, as autoconf scares me. I think we don't want to mess with something as unportable as that at this late stage of the release cycle (quite aside from your later note that it doesn't work ;-)). > The second removes a further three syscalls from the log init path. > There are a couple of things to note here: > * I don't know why link/unlink is currently preferred over > rename. POSIX offers strong guarantees on the semantics > of the latter. > * I have assumed that the close/rename/reopen stuff is only > there for the benefit of Windows users, and ifdeffed it > for everyone else. The reason for avoiding rename() is that the POSIX guarantees are the wrong ones: specifically, rename promises to overwrite an existing destination, which is exactly what we *don't* want. In theory two backends cannot be executing this code in parallel, but if they were, we would not want to destroy a logfile that perhaps already contains WAL entries by the time we finish preparing our own logfile. link() will fail if the destination name exists, which is a lot safer. I'm not sure about the close/reopen stuff; I agree it looks unnecessary. But this function is going to be so I/O bound (particularly now that it fills the file) that two more kernel calls are insignificant. regards, tom lane
Tom Lane wrote: > Adriaan Joubert <a.joubert@albourne.com> writes: > > fdatasync() is available on Tru64 and according to the man-page behaves > > as Tom expects. So it should be a win for us. > > Careful ... HPUX's man page also claims that fdatasync does something > useful, but it doesn't. I'd recommend an experiment. Does today's > snapshot run any faster for you (without -F) than before? IIRC your HPUX manpage states that fdatasync() updates only required information to find back the data. It soundedto me that HPUX distinguishes between irrelevant inode info (like modtime) and important things (like blocks). But maybe I'm confused by HP and they can still tell me an X for an U. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com