[PATCH] Using pread instead of lseek (with analysis) - Mailing list pgsql-patches

From Martijn van Oosterhout
Subject [PATCH] Using pread instead of lseek (with analysis)
Date
Msg-id 20051008154852.GC30988@svana.org
Whole thread Raw
Responses Re: [PATCH] Using pread instead of lseek (with analysis)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-patches
Hi,

Below is a patch that tries to determine (for linux) if the pread is
emulated or not (using an autoconf test), and if so use it instead of
lseek. It doesn't remove all uses of lseek. For example, PostgreSQL
uses lseek to find the end of the file when extending. So in a bulk
load, there is one seek per block written anyway. Also, the xlog seems
to be completely seperate code.

Short summary: Actual benefit, (on Linux anyway) not distinguishable
from the noise.

Longer summary: It reduces CPU usage slightly (few percent user,
overall almost nothing), but the benefit is usually swamped by the
actual cost of the read from disk (system cache miss). I would actually
make the change because it simplifies the code a bit.

Comparisons on other systems may be worthwhile.

What follows is the detailed analysis of the change. My conclusion is
that on Linux at least where the syscall overhead is so low, it's not
worth the change for performance. It is cleaner code-wise IMHO.

----
Initially, the table is 1.5 million rows of random integer data. Worst
case scenario: ordered output run with seqscan disabled. Table is 66MB,
Index is 56MB and includes all three data columns. The queries took 98
+/- 1 seconds. Over the entire query it consistantly loaded about 55MB
of data from disk (about 50% system cache miss rate).

Query: select count(*) from (select * from bigtable order by a,b,c)

Exhibit A below is the vmstat output with 10 second averages. The
interesting columns are cpu us (user) and cpu sy (system). Note, system
is defined as stuff not user and not idle. Waiting for the I/O to
complete is counted as system, even though it's not doing all that
much. It shows a very, very slight reduction in user time, though it is
measurable and repeatable (both run a number of times with same
results).

What this tells me is that lseek time is swamped by actual read time,
so not a major benefit in most cases.

Exhibit B is the same but with only 1 million rows (Table 48MB, Index
29MB), this fitting in cache memory. The original index had some slack
from deleting records (I did a vacuum full though). The queries now
take 60 +/- 1 seconds. As you can see even without accessing the disk,
it still isn't making a big difference.

Exhibit C is the output of strace -c on the second test. As you can see
there is a difference, but marginal.

Exhibit D is the output of a seq scan, in case anyone figured I might
be buggering the queries somehow. As you can see, it shows that the
index scan is extremely expensive. Only 5900 reads for the seq scan
(enough to read the whole table) instead of 890000. The shared buffers
is the default 1000. Upping this would close the gap somewhat but that
is irrelevent for this test (lseek vs pread).

Exhibit E is the strace -c output, for completeness. lseek doesn't
appear in the top 10.

If you read this far, thanks for the attention! If you're running
another system, it would be nice to test. Remember to update the
autoconf code for your system!

Have a nice day,

Misc details:

command line:
  time echo "select count(*) from(select * from bigtable order by a,b,c) x" | \
    ./postgres -f s -D /var/lib/postgres/data test
System:
  Linux kleptog 2.6.8-1-686 #1 Thu Nov 25 04:34:30 UTC 2004 i686 GNU/Linux
  128MB RAM
  Intel(R) Celeron(TM) CPU 1000MHz

Exhibit (A): Index scan, dataset > memory

Using ordinary llseek/read:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0      4   1372    940 107564    0    0   534    18 1070   145 18 56  0 27
 1  0      4   1648    992 107260    0    0   548     3 1071   147 18 63  0 19
 1  0      4   1552    984 107348    0    0   542     4 1069   144 20 69  0 11
 1  0      4   1432    980 107496    0    0   550     6 1071   146 19 69  0 12
 1  0      4   1648    664 107600    0    0   560     3 1071   148 19 67  0 15
 1  0      4   1576    632 107728    0    0   597     3 1076   157 19 68  0 13
 2  0      4   1768    640 107528    0    0   618    14 1080   163 19 67  0 14

Using pread():

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0      4   1680    600 107616    0    0   534    18 1070   145 17 56  0 28
 1  0      4   1656    644 107644    0    0   543     5 1070   147 16 64  0 19
 1  0      4   1344    624 107976    0    0   540     2 1068   142 19 70  0 11
 1  0      4   1656    632 107652    0    0   555     6 1071   147 17 70  0 13
 1  0      4   1608    540 107796    0    0   564    13 1074   149 18 68  0 15
 1  0      4   1392    504 108068    0    0   598     4 1076   158 17 69  0 14
 0  1      4   1368    460 108120    0    0   626     3 1079   165 17 69  0 14

Exhibit (B): Index scan, dataset < memory

Using ordinary llseek/read:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0      4   7500   3084  98000    0    0     0    25 1002     8 22 78  0  0
 1  0      4   7484   3100  98000    0    0     0     3 1001     5 22 78  0  0
 1  0      4   7468   3116  98000    0    0     0     3 1001     5 22 78  0  0
 1  0      4   7444   3140  98000    0    0     0    10 1003     6 22 78  0  0

Using pread():

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0      4   7116   3452  98000    0    0     0    18 1002     8 21 79  0  0
 1  0      4   7100   3468  98000    0    0     0     8 1001     5 21 79  0  0
 1  0      4   7080   3488  98000    0    0     0     4 1001     5 21 79  0  0
 1  0      4   7064   3504  98000    0    0     0    10 1003     5 22 78  0  0

Exhibit (C): strace -c output of backend

Using ordinary llseek/read:

 % time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 85.27   51.906570          58    891526           read
 14.71    8.956573          10    887770           _llseek
  0.00    0.001983          17       114        29 open

Using pread():

 % time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.98   52.977031          59    891479           pread
  0.00    0.002164         114        19           write

Exhibit (D): sequential scan: dataset < memory

Using ordinary llseek/read:

$ time echo "select count(*) from(select * from bigtable) x" | ./postgres.read -D /var/lib/postgres/data test

PostgreSQL stand-alone backend 8.1beta2
backend>         1: count       (typeid = 20, len = 8, typmod = -1, byval = f)
        ----
         1: count = "1080838"   (typeid = 20, len = 8, typmod = -1, byval = f)
        ----
backend>
real    0m1.264s
user    0m0.921s
sys     0m0.328s

Using pread():

$ time echo "select count(*) from(select * from bigtable) x" | ./postgres -D /var/lib/postgres/data test

PostgreSQL stand-alone backend 8.1beta2
backend>         1: count       (typeid = 20, len = 8, typmod = -1, byval = f)
        ----
         1: count = "1080838"   (typeid = 20, len = 8, typmod = -1, byval = f)
        ----
backend>
real    0m1.213s
user    0m0.827s
sys     0m0.371s

Exhibit (E): strace -c of sequential scan case

Using ordinary llseek/read:

 % time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 96.78    0.318179          53      5950           read
  0.68    0.002249         118        19           write
  0.57    0.001879          17       109        29 open
<snip>
  0.06    0.000197           5        40           _llseek

Using pread():

 % time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 95.92    0.316512          54      5903           pread
  0.70    0.002326         122        19           write
  0.62    0.002032         406         5           fsync


--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

pgsql-patches by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] Patching dblink.c to avoid warning about open
Next
From: Bruce Momjian
Date:
Subject: test, please ignore