Thread: performance: use pread instead of lseek+read

performance: use pread instead of lseek+read

From

Manfred Spraul

Date:

24 February 2003, 13:50:33

Hi all,

postgresql tries very hard to avoid calling lseek if not needed,
probably to avoid doing unnecessary syscalls.
What about removing lseek entirely and using the p{read,write}?

pread is identical to the normal read syscall, except that it has one
additional parameter: the position from which the data should be read.
All recent unices support that, it's part of POSIX.1c.

Attached is a patch vs the cvs tree.
It seems to work - 7.3.2 with the patch applied passes the regression
test suite on RH Linux.
Untested with cvs-head, preproc.y causes a parser overflow.

What do you think?
- configure: I test for existance of pread, and assume that pwrite will
exist, too. Acceptable?
- Are you interested in further patches? xlog.c contains a few lseeks,
but I doubt that they are in the critical path.

--
   Manfred
diff -u -u -r -x configure pgsql.orig/configure.in pgsql/configure.in
--- pgsql.orig/configure.in    2003-02-19 05:04:04.000000000 +0100
+++ pgsql/configure.in    2003-02-23 09:38:46.000000000 +0100
@@ -786,7 +786,7 @@
 # SunOS doesn't handle negative byte comparisons properly with +/- return
 AC_FUNC_MEMCMP

-AC_CHECK_FUNCS([cbrt fcvt getpeereid memmove pstat setproctitle setsid sigprocmask sysconf waitpid dlopen fdatasync
utimeutimes]) 
+AC_CHECK_FUNCS([cbrt fcvt getpeereid memmove pstat setproctitle setsid sigprocmask sysconf waitpid dlopen fdatasync
utimeutimes pread]) 

 AC_CHECK_DECLS(fdatasync, [], [], [#include <unistd.h>])

diff -u -u -r -x configure pgsql.orig/src/backend/storage/file/fd.c pgsql/src/backend/storage/file/fd.c
--- pgsql.orig/src/backend/storage/file/fd.c    2002-09-02 08:11:42.000000000 +0200
+++ pgsql/src/backend/storage/file/fd.c    2003-02-23 09:45:41.000000000 +0100
@@ -391,7 +391,9 @@
     Delete(file);

     /* save the seek position */
+#ifndef HAVE_PREAD
     vfdP->seekPos = (long) lseek(vfdP->fd, 0L, SEEK_CUR);
+#endif
     Assert(vfdP->seekPos != -1L);

     /* close the file */
@@ -462,6 +464,7 @@
             ++nfile;
         }

+#ifndef HAVE_PREAD
         /* seek to the right position */
         if (vfdP->seekPos != 0L)
         {
@@ -470,6 +473,7 @@
             returnValue = (long) lseek(vfdP->fd, vfdP->seekPos, SEEK_SET);
             Assert(returnValue != -1L);
         }
+#endif
     }

     /*
@@ -877,11 +881,17 @@
                VfdCache[file].seekPos, amount, buffer));

     FileAccess(file);
+#if HAVE_PREAD
+    returnCode = pread(VfdCache[file].fd, buffer, amount, VfdCache[file].seekPos);
+    if (returnCode > 0)
+        VfdCache[file].seekPos += returnCode;
+#else
     returnCode = read(VfdCache[file].fd, buffer, amount);
     if (returnCode > 0)
         VfdCache[file].seekPos += returnCode;
     else
         VfdCache[file].seekPos = FileUnknownPos;
+#endif

     return returnCode;
 }
@@ -900,16 +910,25 @@
     FileAccess(file);

     errno = 0;
+#if HAVE_PREAD
+    returnCode = pwrite(VfdCache[file].fd, buffer, amount, VfdCache[file].seekPos);
+#else
     returnCode = write(VfdCache[file].fd, buffer, amount);
+#endif

     /* if write didn't set errno, assume problem is no disk space */
     if (returnCode != amount && errno == 0)
         errno = ENOSPC;

+#if HAVE_PREAD
+    if (returnCode > 0)
+        VfdCache[file].seekPos += returnCode;
+#else
     if (returnCode > 0)
         VfdCache[file].seekPos += returnCode;
     else
         VfdCache[file].seekPos = FileUnknownPos;
+#endif

     return returnCode;
 }
@@ -951,12 +970,20 @@
             case SEEK_SET:
                 if (offset < 0)
                     elog(ERROR, "FileSeek: invalid offset: %ld", offset);
+#ifdef HAVE_PREAD
+                VfdCache[file].seekPos = offset;
+#else
                 if (VfdCache[file].seekPos != offset)
                     VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
+#endif
                 break;
             case SEEK_CUR:
+#ifdef HAVE_PREAD
+                VfdCache[file].seekPos += offset;
+#else
                 if (offset != 0 || VfdCache[file].seekPos == FileUnknownPos)
                     VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
+#endif
                 break;
             case SEEK_END:
                 VfdCache[file].seekPos = lseek(VfdCache[file].fd, offset, whence);
diff -u -u -r -x configure pgsql.orig/src/include/pg_config.h.in pgsql/src/include/pg_config.h.in
--- pgsql.orig/src/include/pg_config.h.in    2003-02-19 05:04:04.000000000 +0100
+++ pgsql/src/include/pg_config.h.in    2003-02-23 09:59:47.000000000 +0100
@@ -544,6 +544,9 @@
 /* Define if the standard header unistd.h declares fdatasync() */
 #undef HAVE_DECL_FDATASYNC

+/* Define if you have pread(). Implies pwrite, too. */
+#undef HAVE_PREAD
+
 /* Set to 1 if you have libz.a */
 #undef HAVE_LIBZ

Re: performance: use pread instead of lseek+read

From

Tom Lane

Date:

24 February 2003, 18:27:16

Manfred Spraul <manfred@colorfullife.com> writes:
> What about removing lseek entirely and using the p{read,write}?

Portability.

$ man pread
No manual entry for pread.
$

It seems unlikely to me that eliminating lseek on some platforms would
be worth the hassle of maintaining two code paths.  lseek is mighty
cheap as system calls go.  What's worse, a series of preads (as opposed
to reads without intervening lseek) might not trigger kernel read-ahead
optimizations, in which case this would be a tremendous disimprovement.

> Attached is a patch vs the cvs tree.
> It seems to work - 7.3.2 with the patch applied passes the regression

Can you measure any performance benefit?

            regards, tom lane

Re: performance: use pread instead of lseek+read

From

Manfred Spraul

Date:

24 February 2003, 19:16:37

Tom Lane wrote:

>Manfred Spraul <manfred@colorfullife.com> writes:
>
>
>>What about removing lseek entirely and using the p{read,write}?
>>
>>
>
>Portability.
>
>$ man pread
>No manual entry for pread.
>
Which OS? google finds manpages for Tru64 4.0, HP UX 11i, solaris 8, aix 4.3
Linux has it since 2.2, FreeBSD at least since 4.0, it's even listed in
the SVR-4 emulation of FreeBSD.

>$
>
>It seems unlikely to me that eliminating lseek on some platforms would
>be worth the hassle of maintaining two code paths.  lseek is mighty
>cheap as system calls go.
>
It was considered expensive enough to write a syscall avoidance layer
that caches the file pointer and skips lseek if fpos==offset. A kernel
must perform quite a lot of parameter validation and synchronization -
think about a multithreaded app where close and lseek could race.

>
>
>>Attached is a patch vs the cvs tree.
>>It seems to work - 7.3.2 with the patch applied passes the regression
>>
>>
>
>Can you measure any performance benefit?
>
>
What would be an interesting benchmark?
If you want a microbenchmark: lseek+read(,,8192) is around 10% slower
than pread(,,,8192) with hot cpu caches on my Celeron Mobile Laptop.
Linux-2.4.20.

Read-ahead is impossible to answer without looking at the sources. Linux
does readahead, actually read is implemented as pread(,,,file->f_pos).

--
    Manfred

Re: performance: use pread instead of lseek+read

From

Tom Lane

Date:

24 February 2003, 21:57:50

Manfred Spraul <manfred@colorfullife.com> writes:
> Tom Lane wrote:
>> It seems unlikely to me that eliminating lseek on some platforms would
>> be worth the hassle of maintaining two code paths.  lseek is mighty
>> cheap as system calls go.
>>
> It was considered expensive enough to write a syscall avoidance layer
> that caches the file pointer and skips lseek if fpos==offset.

You're missing the point: that layer is mostly there to ensure that we
don't foul up the kernel's readahead recognition for sequential fetches.
It's nice that Linux doesn't care, but Linux is not the only platform
we worry about.

            regards, tom lane

Re: performance: use pread instead of lseek+read

From

Manfred Spraul

Date:

25 February 2003, 05:09:03

Tom Lane wrote:

>Manfred Spraul <manfred@colorfullife.com> writes:
>
>
>>Tom Lane wrote:
>>
>>
>>>It seems unlikely to me that eliminating lseek on some platforms would
>>>be worth the hassle of maintaining two code paths.  lseek is mighty
>>>cheap as system calls go.
>>>
>>>
>>>
>>It was considered expensive enough to write a syscall avoidance layer
>>that caches the file pointer and skips lseek if fpos==offset.
>>
>>
>
>You're missing the point: that layer is mostly there to ensure that we
>don't foul up the kernel's readahead recognition for sequential fetches.
>It's nice that Linux doesn't care, but Linux is not the only platform
>we worry about.
>
>
Do you know that empty lseeks foul up readahead recognition on some OS?
If yes, which OS? I've checked FreeBSD and Linux, they don't do it.

Actually I would be really surprised if pread would cause readahead
problems - for example samba uses it if possible.

What about my other questions:
- which benchmark would be interesting?
- which OS did you use when you got 'no manpage for pread'?

--
    Manfred

Re: performance: use pread instead of lseek+read

From

Tom Lane

Date:

25 February 2003, 09:36:20

Manfred Spraul <manfred@colorfullife.com> writes:
> Do you know that empty lseeks foul up readahead recognition on some OS?
> If yes, which OS? I've checked FreeBSD and Linux, they don't do it.

Who knows?  But it would be folly to extrapolate from those two
datapoints to all the platforms we support.

> - which benchmark would be interesting?

Something that measures the performance "in context", that is as part of
normal database activity, not just the syscall overhead.  pgbench is
notoriously hard to get reproducible numbers out of, but you could try
it if you like.

> - which OS did you use when you got 'no manpage for pread'?

HPUX 10.20.

            regards, tom lane

Re: performance: use pread instead of lseek+read

From

Bruce Momjian

Date:

06 March 2003, 14:02:18

BSD/OS doesn't have pread either.  Isn't pread() just a case of merging
two system calls into one?  Does a single system call cause that much
overhead?  I didn't think so.

Doesn't pread() do the lseek() internally anyway.

---------------------------------------------------------------------------

Tom Lane wrote:
> Manfred Spraul <manfred@colorfullife.com> writes:
> > Do you know that empty lseeks foul up readahead recognition on some OS?
> > If yes, which OS? I've checked FreeBSD and Linux, they don't do it.
>
> Who knows?  But it would be folly to extrapolate from those two
> datapoints to all the platforms we support.
>
> > - which benchmark would be interesting?
>
> Something that measures the performance "in context", that is as part of
> normal database activity, not just the syscall overhead.  pgbench is
> notoriously hard to get reproducible numbers out of, but you could try
> it if you like.
>
> > - which OS did you use when you got 'no manpage for pread'?
>
> HPUX 10.20.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: performance: use pread instead of lseek+read

From

Manfred Spraul

Date:

06 March 2003, 14:50:03

Bruce Momjian wrote:

>BSD/OS doesn't have pread either.  Isn't pread() just a case of merging
>two system calls into one?  Does a single system call cause that much
>overhead?  I didn't think so.
>
>
As I wrote, in a microbenchmark lseek+read(,8192) was 10% slower than
pread(,,8192).

>Doesn't pread() do the lseek() internally anyway.
>
No. pread doesn't use the file pointer at all.
This is a  huge advantage if fds are shared: Two threads/processes can
read simultaneously from the same fd. This is impossible without pread -
there is only one file pointer, the threads would trash each others state.

Since postgresql doesn't share fds, the only advantage for postgresql is
the lower syscall overhead.

>>
>>
>>>- which benchmark would be interesting?
>>>
>>>
>>Something that measures the performance "in context", that is as part of
>>normal database activity, not just the syscall overhead.  pgbench is
>>notoriously hard to get reproducible numbers out of, but you could try
>>it if you like.
>>
>>
I'll try that.

--
    Manfred

Re: performance: use pread instead of lseek+read

From

Bruce Momjian

Date:

06 March 2003, 16:07:30

Manfred Spraul wrote:
> Bruce Momjian wrote:
>
> >BSD/OS doesn't have pread either.  Isn't pread() just a case of merging
> >two system calls into one?  Does a single system call cause that much
> >overhead?  I didn't think so.
> >
> >
> As I wrote, in a microbenchmark lseek+read(,8192) was 10% slower than
> pread(,,8192).
>
> >Doesn't pread() do the lseek() internally anyway.
> >
> No. pread doesn't use the file pointer at all.
> This is a  huge advantage if fds are shared: Two threads/processes can
> read simultaneously from the same fd. This is impossible without pread -
> there is only one file pointer, the threads would trash each others state.
>
> Since postgresql doesn't share fds, the only advantage for postgresql is
> the lower syscall overhead.

Yes, I can imaging having file descriptors shared like that would be a
big win, and I guess that's why it is called pread (pthread).  Anyway,
for us, it does seem like just a merged lseek/read() call, and because
we can avoid the lseek() sometimes, I wonder if our code may be faster
sometimes.  I can also imagine the separate lseek()/read() calls to be
better optimized by the kernel because a read with out an lseek is more
clearly sequential.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073