Thread: performance: use pread instead of lseek+read
Hi all, postgresql tries very hard to avoid calling lseek if not needed, probably to avoid doing unnecessary syscalls. What about removing lseek entirely and using the p{read,write}? pread is identical to the normal read syscall, except that it has one additional parameter: the position from which the data should be read. All recent unices support that, it's part of POSIX.1c. Attached is a patch vs the cvs tree. It seems to work - 7.3.2 with the patch applied passes the regression test suite on RH Linux. Untested with cvs-head, preproc.y causes a parser overflow. What do you think? - configure: I test for existance of pread, and assume that pwrite will exist, too. Acceptable? - Are you interested in further patches? xlog.c contains a few lseeks, but I doubt that they are in the critical path. -- Manfred diff -u -u -r -x configure pgsql.orig/configure.in pgsql/configure.in --- pgsql.orig/configure.in 2003-02-19 05:04:04.000000000 +0100 +++ pgsql/configure.in 2003-02-23 09:38:46.000000000 +0100 @@ -786,7 +786,7 @@ # SunOS doesn't handle negative byte comparisons properly with +/- return AC_FUNC_MEMCMP -AC_CHECK_FUNCS([cbrt fcvt getpeereid memmove pstat setproctitle setsid sigprocmask sysconf waitpid dlopen fdatasync utimeutimes]) +AC_CHECK_FUNCS([cbrt fcvt getpeereid memmove pstat setproctitle setsid sigprocmask sysconf waitpid dlopen fdatasync utimeutimes pread]) AC_CHECK_DECLS(fdatasync, [], [], [#include <unistd.h>]) diff -u -u -r -x configure pgsql.orig/src/backend/storage/file/fd.c pgsql/src/backend/storage/file/fd.c --- pgsql.orig/src/backend/storage/file/fd.c 2002-09-02 08:11:42.000000000 +0200 +++ pgsql/src/backend/storage/file/fd.c 2003-02-23 09:45:41.000000000 +0100 @@ -391,7 +391,9 @@ Delete(file); /* save the seek position */ +#ifndef HAVE_PREAD vfdP->seekPos = (long) lseek(vfdP->fd, 0L, SEEK_CUR); +#endif Assert(vfdP->seekPos != -1L); /* close the file */ @@ -462,6 +464,7 @@ ++nfile; } +#ifndef HAVE_PREAD /* seek to the right position */ if (vfdP->seekPos != 0L) { @@ -470,6 +473,7 @@ returnValue = (long) lseek(vfdP->fd, vfdP->seekPos, SEEK_SET); Assert(returnValue != -1L); } +#endif } /* @@ -877,11 +881,17 @@ VfdCache[file].seekPos, amount, buffer)); FileAccess(file); +#if HAVE_PREAD + returnCode = pread(VfdCache[file].fd, buffer, amount, VfdCache[file].seekPos); + if (returnCode > 0) + VfdCache[file].seekPos += returnCode; +#else returnCode = read(VfdCache[file].fd, buffer, amount); if (returnCode > 0) VfdCache[file].seekPos += returnCode; else VfdCache[file].seekPos = FileUnknownPos; +#endif return returnCode; } @@ -900,16 +910,25 @@ FileAccess(file); errno = 0; +#if HAVE_PREAD + returnCode = pwrite(VfdCache[file].fd, buffer, amount, VfdCache[file].seekPos); +#else returnCode = write(VfdCache[file].fd, buffer, amount); +#endif /* if write didn't set errno, assume problem is no disk space */ if (returnCode != amount && errno == 0) errno = ENOSPC; +#if HAVE_PREAD + if (returnCode > 0) + VfdCache[file].seekPos += returnCode; +#else if (returnCode > 0) VfdCache[file].seekPos += returnCode; else VfdCache[file].seekPos = FileUnknownPos; +#endif return returnCode; } @@ -951,12 +970,20 @@ case SEEK_SET: if (offset < 0) elog(ERROR, "FileSeek: invalid offset: %ld", offset); +#ifdef HAVE_PREAD + VfdCache[file].seekPos = offset; +#else if (VfdCache[file].seekPos != offset) VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence); +#endif break; case SEEK_CUR: +#ifdef HAVE_PREAD + VfdCache[file].seekPos += offset; +#else if (offset != 0 || VfdCache[file].seekPos == FileUnknownPos) VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence); +#endif break; case SEEK_END: VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence); diff -u -u -r -x configure pgsql.orig/src/include/pg_config.h.in pgsql/src/include/pg_config.h.in --- pgsql.orig/src/include/pg_config.h.in 2003-02-19 05:04:04.000000000 +0100 +++ pgsql/src/include/pg_config.h.in 2003-02-23 09:59:47.000000000 +0100 @@ -544,6 +544,9 @@ /* Define if the standard header unistd.h declares fdatasync() */ #undef HAVE_DECL_FDATASYNC +/* Define if you have pread(). Implies pwrite, too. */ +#undef HAVE_PREAD + /* Set to 1 if you have libz.a */ #undef HAVE_LIBZ
Manfred Spraul <manfred@colorfullife.com> writes: > What about removing lseek entirely and using the p{read,write}? Portability. $ man pread No manual entry for pread. $ It seems unlikely to me that eliminating lseek on some platforms would be worth the hassle of maintaining two code paths. lseek is mighty cheap as system calls go. What's worse, a series of preads (as opposed to reads without intervening lseek) might not trigger kernel read-ahead optimizations, in which case this would be a tremendous disimprovement. > Attached is a patch vs the cvs tree. > It seems to work - 7.3.2 with the patch applied passes the regression Can you measure any performance benefit? regards, tom lane
Tom Lane wrote: >Manfred Spraul <manfred@colorfullife.com> writes: > > >>What about removing lseek entirely and using the p{read,write}? >> >> > >Portability. > >$ man pread >No manual entry for pread. > Which OS? google finds manpages for Tru64 4.0, HP UX 11i, solaris 8, aix 4.3 Linux has it since 2.2, FreeBSD at least since 4.0, it's even listed in the SVR-4 emulation of FreeBSD. >$ > >It seems unlikely to me that eliminating lseek on some platforms would >be worth the hassle of maintaining two code paths. lseek is mighty >cheap as system calls go. > It was considered expensive enough to write a syscall avoidance layer that caches the file pointer and skips lseek if fpos==offset. A kernel must perform quite a lot of parameter validation and synchronization - think about a multithreaded app where close and lseek could race. > > >>Attached is a patch vs the cvs tree. >>It seems to work - 7.3.2 with the patch applied passes the regression >> >> > >Can you measure any performance benefit? > > What would be an interesting benchmark? If you want a microbenchmark: lseek+read(,,8192) is around 10% slower than pread(,,,8192) with hot cpu caches on my Celeron Mobile Laptop. Linux-2.4.20. Read-ahead is impossible to answer without looking at the sources. Linux does readahead, actually read is implemented as pread(,,,file->f_pos). -- Manfred
Manfred Spraul <manfred@colorfullife.com> writes: > Tom Lane wrote: >> It seems unlikely to me that eliminating lseek on some platforms would >> be worth the hassle of maintaining two code paths. lseek is mighty >> cheap as system calls go. >> > It was considered expensive enough to write a syscall avoidance layer > that caches the file pointer and skips lseek if fpos==offset. You're missing the point: that layer is mostly there to ensure that we don't foul up the kernel's readahead recognition for sequential fetches. It's nice that Linux doesn't care, but Linux is not the only platform we worry about. regards, tom lane
Tom Lane wrote: >Manfred Spraul <manfred@colorfullife.com> writes: > > >>Tom Lane wrote: >> >> >>>It seems unlikely to me that eliminating lseek on some platforms would >>>be worth the hassle of maintaining two code paths. lseek is mighty >>>cheap as system calls go. >>> >>> >>> >>It was considered expensive enough to write a syscall avoidance layer >>that caches the file pointer and skips lseek if fpos==offset. >> >> > >You're missing the point: that layer is mostly there to ensure that we >don't foul up the kernel's readahead recognition for sequential fetches. >It's nice that Linux doesn't care, but Linux is not the only platform >we worry about. > > Do you know that empty lseeks foul up readahead recognition on some OS? If yes, which OS? I've checked FreeBSD and Linux, they don't do it. Actually I would be really surprised if pread would cause readahead problems - for example samba uses it if possible. What about my other questions: - which benchmark would be interesting? - which OS did you use when you got 'no manpage for pread'? -- Manfred
Manfred Spraul <manfred@colorfullife.com> writes: > Do you know that empty lseeks foul up readahead recognition on some OS? > If yes, which OS? I've checked FreeBSD and Linux, they don't do it. Who knows? But it would be folly to extrapolate from those two datapoints to all the platforms we support. > - which benchmark would be interesting? Something that measures the performance "in context", that is as part of normal database activity, not just the syscall overhead. pgbench is notoriously hard to get reproducible numbers out of, but you could try it if you like. > - which OS did you use when you got 'no manpage for pread'? HPUX 10.20. regards, tom lane
BSD/OS doesn't have pread either. Isn't pread() just a case of merging two system calls into one? Does a single system call cause that much overhead? I didn't think so. Doesn't pread() do the lseek() internally anyway. --------------------------------------------------------------------------- Tom Lane wrote: > Manfred Spraul <manfred@colorfullife.com> writes: > > Do you know that empty lseeks foul up readahead recognition on some OS? > > If yes, which OS? I've checked FreeBSD and Linux, they don't do it. > > Who knows? But it would be folly to extrapolate from those two > datapoints to all the platforms we support. > > > - which benchmark would be interesting? > > Something that measures the performance "in context", that is as part of > normal database activity, not just the syscall overhead. pgbench is > notoriously hard to get reproducible numbers out of, but you could try > it if you like. > > > - which OS did you use when you got 'no manpage for pread'? > > HPUX 10.20. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: >BSD/OS doesn't have pread either. Isn't pread() just a case of merging >two system calls into one? Does a single system call cause that much >overhead? I didn't think so. > > As I wrote, in a microbenchmark lseek+read(,8192) was 10% slower than pread(,,8192). >Doesn't pread() do the lseek() internally anyway. > No. pread doesn't use the file pointer at all. This is a huge advantage if fds are shared: Two threads/processes can read simultaneously from the same fd. This is impossible without pread - there is only one file pointer, the threads would trash each others state. Since postgresql doesn't share fds, the only advantage for postgresql is the lower syscall overhead. >> >> >>>- which benchmark would be interesting? >>> >>> >>Something that measures the performance "in context", that is as part of >>normal database activity, not just the syscall overhead. pgbench is >>notoriously hard to get reproducible numbers out of, but you could try >>it if you like. >> >> I'll try that. -- Manfred
Manfred Spraul wrote: > Bruce Momjian wrote: > > >BSD/OS doesn't have pread either. Isn't pread() just a case of merging > >two system calls into one? Does a single system call cause that much > >overhead? I didn't think so. > > > > > As I wrote, in a microbenchmark lseek+read(,8192) was 10% slower than > pread(,,8192). > > >Doesn't pread() do the lseek() internally anyway. > > > No. pread doesn't use the file pointer at all. > This is a huge advantage if fds are shared: Two threads/processes can > read simultaneously from the same fd. This is impossible without pread - > there is only one file pointer, the threads would trash each others state. > > Since postgresql doesn't share fds, the only advantage for postgresql is > the lower syscall overhead. Yes, I can imaging having file descriptors shared like that would be a big win, and I guess that's why it is called pread (pthread). Anyway, for us, it does seem like just a merged lseek/read() call, and because we can avoid the lseek() sometimes, I wonder if our code may be faster sometimes. I can also imagine the separate lseek()/read() calls to be better optimized by the kernel because a read with out an lseek is more clearly sequential. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073