Thread: Sequential scan speed, mmap, disk i/o

Sequential scan speed, mmap, disk i/o

From

Bruce Momjian

Date:

14 May 1998, 00:50:36

Someone was complaining about sequential scan speed, so I decided to run
a test.

I have run the test by performing a total read-through of a 177MB table.
This exceeds all my cache sizes by over two times, so the cache is
totally useless ("cache wipe").  I have timed PostgreSQL's sequential
scan (returning no rows), and various Unix methods of reading a 177MB
file.

I have found that a sequential scan by PostgreSQL is almost as fast or
faster than various other Unix methods of reading files.  In fact,
mmap() is very slow, perhaps because you are changing the process
virtual table maps for each chunk you read in, and faulting them in,
rather than using the file system for I/O.

Basically, we beat 'wc', which is pretty good considering how little
'wc' does.

My conclusion from this is that we really are not going to gain a lot of
speed by exploring some async solution, because if the data we need is
not in the cache, we really are going to spend most of our time waiting
for disk I/O.

Comments?

---------------------------------------------------------------------------


177MB file, BSD/OS 3.1, 64MB RAM, PostgreSQL current

wc                            41 sec
wc -l                            31 sec
dd if=/u/pg/data/base/test/testv of=/dev/null bs=512    32 sec
dd if=/u/pg/data/base/test/testv of=/dev/null bs=8k    31 sec
dd if=/u/pg/data/base/test/testv of=/dev/null bs=256k    31 sec
dd if=/u/pg/data/base/test/testv of=/dev/null bs=1m    30 sec
mmap() of file in 8k chunks                99 sec
mmap() of file in 8mb chunks                40 sec
mmap() of file in 32mb chunks                56 sec

PostgreSQL sequential scan                37 sec


---------------------------------------------------------------------------

/* mmap() test program */
#include <stdio.h>
#include <fcntl.h>
#include <assert.h>
#include <sys/types.h>
#include <sys/mman.h>

#define MMAP_SIZE 8192   /* chunk size */

int main(int argc, char *argv[], char *envp[])
{
    int i, j, fd, spaces = 0;
    int off;
    char *addr;

    fd = open("/u/pg/data/base/test/testv", O_RDONLY, 0);
    assert(fd != 0);

    for (off = 0; 1; off += MMAP_SIZE)
    {
        addr = mmap(0, MMAP_SIZE, PROT_READ, 0, fd, off);
        assert(addr != NULL);

        for (j = 0; j < MMAP_SIZE; j++)
            if (*(addr + j)    != ' ')
                spaces++;
        munmap(addr,MMAP_SIZE);
    }
    printf("%d\n",spaces);
    return 0;
}

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Sequential scan speed, mmap, disk i/o

From

Bruce Momjian

Date:

14 May 1998, 01:29:19

>
> Someone was complaining about sequential scan speed, so I decided to run
> a test.

> wc                            41 sec
> wc -l                            31 sec
> dd if=/u/pg/data/base/test/testv of=/dev/null bs=512    32 sec
> dd if=/u/pg/data/base/test/testv of=/dev/null bs=8k    31 sec
> dd if=/u/pg/data/base/test/testv of=/dev/null bs=256k    31 sec
> dd if=/u/pg/data/base/test/testv of=/dev/null bs=1m    30 sec
> mmap() of file in 8k chunks                99 sec
> mmap() of file in 8mb chunks                40 sec
> mmap() of file in 32mb chunks                56 sec
>
> PostgreSQL sequential scan                37 sec

Let me add, these times are on a PP200, with SCSI Ultra Barracuda
drives, BSD/OS 3.1, 64MB RAM.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Sequential scan speed, mmap, disk i/o

From

dg@illustra.com (David Gould)

Date:

14 May 1998, 03:27:51

> > Someone was complaining about sequential scan speed, so I decided to run
> > a test.
>
> > wc                            41 sec
> > wc -l                            31 sec
> > dd if=/u/pg/data/base/test/testv of=/dev/null bs=512    32 sec
> > dd if=/u/pg/data/base/test/testv of=/dev/null bs=8k    31 sec
> > dd if=/u/pg/data/base/test/testv of=/dev/null bs=256k    31 sec
> > dd if=/u/pg/data/base/test/testv of=/dev/null bs=1m    30 sec
> > mmap() of file in 8k chunks                99 sec
> > mmap() of file in 8mb chunks                40 sec
> > mmap() of file in 32mb chunks                56 sec
> >
> > PostgreSQL sequential scan                37 sec
>
> Let me add, these times are on a PP200, with SCSI Ultra Barracuda
> drives, BSD/OS 3.1, 64MB RAM.
>
> --
> Bruce Momjian                          |  830 Blythe Avenue
> maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
>   +  If your life is a hard drive,     |  (610) 353-9879(w)
>   +  Christ can be your backup.        |  (610) 853-3000(h)

Very interesting. Is it possible to get the schema, the query, and a
a sample of the data or a generator program for the data? I am quite surprised
to see us do so well, I would have guess that the per row overhead would
have us down far below wc.

Although, on second though, dd has to write the data as well as read it, and
we don't, and wc has to examine every character, where if the "where clause"
applies to only a portion of the row, we don't.

Still, It would be nice to see more info about this test.

Btw, I hope to post tommorrow some hopefully interesting results about the
speed of TAS and S_LOCK...

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"Of course, someone who knows more about this will correct me if I'm wrong,
 and someone who knows less will correct me if I'm right."
               --David Palmer (palmer@tybalt.caltech.edu)

Re: [HACKERS] Sequential scan speed, mmap, disk i/o

From

Bruce Momjian

Date:

14 May 1998, 10:58:02

>
> >
> > Someone was complaining about sequential scan speed, so I decided to run
> > a test.
>
> > wc                            41 sec
> > wc -l                            31 sec
> > dd if=/u/pg/data/base/test/testv of=/dev/null bs=512    32 sec
> > dd if=/u/pg/data/base/test/testv of=/dev/null bs=8k    31 sec
> > dd if=/u/pg/data/base/test/testv of=/dev/null bs=256k    31 sec
> > dd if=/u/pg/data/base/test/testv of=/dev/null bs=1m    30 sec
> > mmap() of file in 8k chunks                99 sec
> > mmap() of file in 8mb chunks                40 sec
> > mmap() of file in 32mb chunks                56 sec
> >
> > PostgreSQL sequential scan                37 sec
>
> Let me add, these times are on a PP200, with SCSI Ultra Barracuda
> drives, BSD/OS 3.1, 64MB RAM.

Also, the table was very small, with two ints, a char(10), and a
varchar(50), so PostgreSQL was processing most of the 177MB of data in
terms of having to read most of each block.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Sequential scan speed, mmap, disk i/o

From

Bruce Momjian

Date:

14 May 1998, 10:58:41

> Very interesting. Is it possible to get the schema, the query, and a
> a sample of the data or a generator program for the data? I am quite surprised
> to see us do so well, I would have guess that the per row overhead would
> have us down far below wc.

Sure.

    create table test (x1 int, x2 int, x3 char(10), x4 varchar(50));
    insert into test values (3, 8, 'asdf','asdf');

    insert into test select * from test;  <- continue until test is large

    select * from test where x1 = 23423; <- this is what I timed


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Sequential scan speed, mmap, disk i/o

From

Michal Mosiewicz

Date:

14 May 1998, 10:58:43

Bruce Momjian wrote:
> In fact,
> mmap() is very slow, perhaps because you are changing the process
> virtual table maps for each chunk you read in, and faulting them in,
> rather than using the file system for I/O.

Huh, very slow? I wouldn't agree. I rewrote your mmap program to allow
for using reads or mmaps.

I tested it on 111MB file. I decided to use 8192 bytes buffer size
(standard postgres page size). My system is Linux, P166, 64MBs of RAM
(note that I have a lot of software running currently so the cache size
is less than 25MBs. I also changed the for(j..) step size to j+=256 just
to make sure that it won't influence the results too much and you will
see the difference better. mmap was run with (PROT_READ, MAP_SHARED)

Average results are (for sequential reading):
Using reads: total time - 21.39 (0.44user, 6.09system, 31%CPU)
Using mmaps: total time - 21.10 (0.57user, 4.92system, 25%CPU)

Note, that in case of reads the program spends much more time in system
calls and uses more CPU. You may notice that in case of Linux using mmap
is about 20% cheapper than read. In case of random reading it's slightly
more than 20% as I remember. Total time is in both cases similiar since
the throughput limit of my HD.

BTW. Are you sure, that your program was counting mmaps properly? When I
run it on my system it counts much more than what it should. On my
system offset crossed over file's boundary then it worked a minute or
more before it stopped. I attach my version (with hardcoded 111MBs file
size to prevent it, of course you may change it)

Mike

--
WWW: http://www.lodz.pdi.net/~mimo  tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz  *  Bugaj 66 m.54 *  95-200 Pabianice  *  POLAND

Attachment

mmap.c

Async I/O

From

Michal Mosiewicz

Date:

15 May 1998, 20:09:49

Bruce Momjian wrote:

> My conclusion from this is that we really are not going to gain a lot of
> speed by exploring some async solution, because if the data we need is
> not in the cache, we really are going to spend most of our time waiting
> for disk I/O.
>
> Comments?

Well, I've just found an intersting article on AFP (Asynchronous
Prefetch) at Sybase site.

http://www.sybase.com/Partners/sun/apftech.html

What is worth to note, and you seem to forget. If your app is spending
it's time on waiting for single IO operation, you want save anything.
However, if you manage to have multiple I/O requests served
asynchronically you may get better performance on RAID systems, also
your I/O hardware may work better since the controllers may batch
requests, requeue them and optimise them (Of course not in case of IDE
disks).

Also, somebody asked about clustered indexes. I was looking for
informations on this technique at Sybase (which is a great source of
information on various DB hints). If you read above document between
lines, the conclusion comes that clustered index is something that
allows the data from table to be mixed with index. I suppose that index
pages are clustered with data pages so if you find aproporiate record in
index, the data that this index entry points to is on the same page or
close.

At Sybase I have also found some interesting materials on Bitmap Indexes
(this idea is relatively simple) which looks very interesting in case of
some types of queries.

Mike

--
WWW: http://www.lodz.pdi.net/~mimo  tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz  *  Bugaj 66 m.54 *  95-200 Pabianice  *  POLAND

Re: [HACKERS] Sequential scan speed, mmap, disk i/o

From

Bruce Momjian

Date:

15 May 1998, 21:30:38

> > mmap() is very slow, perhaps because you are changing the process
> > virtual table maps for each chunk you read in, and faulting them in,
> > rather than using the file system for I/O.
>
> Huh, very slow? I wouldn't agree. I rewrote your mmap program to allow
> for using reads or mmaps.
>
> I tested it on 111MB file. I decided to use 8192 bytes buffer size
> (standard postgres page size). My system is Linux, P166, 64MBs of RAM
> (note that I have a lot of software running currently so the cache size
> is less than 25MBs. I also changed the for(j..) step size to j+=256 just
> to make sure that it won't influence the results too much and you will
> see the difference better. mmap was run with (PROT_READ, MAP_SHARED)
>
> Average results are (for sequential reading):
> Using reads: total time - 21.39 (0.44user, 6.09system, 31%CPU)
> Using mmaps: total time - 21.10 (0.57user, 4.92system, 25%CPU)
>
> Note, that in case of reads the program spends much more time in system
> calls and uses more CPU. You may notice that in case of Linux using mmap
> is about 20% cheapper than read. In case of random reading it's slightly
> more than 20% as I remember. Total time is in both cases similiar since
> the throughput limit of my HD.
>
> BTW. Are you sure, that your program was counting mmaps properly? When I
> run it on my system it counts much more than what it should. On my
> system offset crossed over file's boundary then it worked a minute or
> more before it stopped. I attach my version (with hardcoded 111MBs file
> size to prevent it, of course you may change it)

OK, here are my results using your test program:

Basically, Linux is double my speed for 8k mmap'ed chunks.  Around 32k
chunks, I get closer, and 8mb chunks are the same.  Glad to hear Linux
has optimized mmap() recently, because BSD/OS looks much slower than
Linux on this.

Now, why does PostgreSQL sequential scan a 160MB files in 37 seconds,
using standard its 8k buffers, when even your read test for me using 8k
buffers takes 54 seconds?

In storage/file/fd.c, I see it using read(), and I assume they are 8k
chunks being read:

    returnCode = read(VfdCache[file].fd, buffer, amount);


Also attached is a modified version of my mmap() program, that uses
fstat() to check the file size to know when to stop.  However, I have
also have modified it to use a file size to match your file size.

Not sure what to conclude from these numbers.

---------------------------------------------------------------------------

mmap, 8k
       47.81 real         0.66 user        33.12 sys

read, 8k
       54.60 real         0.51 user        46.80 sys

mmap, 32k
       29.80 real         0.23 user        13.81 sys

read, 32k
       26.80 real         0.12 user        14.82 sys

mmap, 8mb
       21.25 real         0.03 user         5.49 sys

read, 8mb
       20.43 real         0.14 user         3.60 sys


my mmap, 8k, your file size
       64.67 real        15.99 user        34.00 sys

my mmap, 32k, your file size
       43.12 real        15.95 user        14.29 sys

my mmap, 8mb, your file size
       34.31 real        15.88 user         5.39 sys


---------------------------------------------------------------------------

#include <stdio.h>
#include <fcntl.h>
#include <assert.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>

#define MMAP_SIZE 8192 * 1024

int main(int argc, char *argv[], char *envp[])
{
    int i, j, fd, spaces = 0;
    int off;
    char *addr;
    struct stat filestat;

    fd = open("/u/pg/data/base/test/test", O_RDONLY, 0);
    assert(fd != -1);
    assert(fstat(fd, &filestat) == 0);

    filestat.st_size = 111329280;

    for (off = 0; 1; off += MMAP_SIZE)
    {
        addr = mmap(0, MMAP_SIZE, PROT_READ, MAP_SHARED, fd, off);
        assert(addr != NULL);
        madvise(addr, MMAP_SIZE, MADV_SEQUENTIAL);

        for (j = 0; j < MMAP_SIZE; j++)
        {
            if (*(addr + j)    != ' ')
                spaces++;
            if (off + j + 1 == filestat.st_size)
                goto done;
        }
        munmap(addr,MMAP_SIZE);
    }
done:
    printf("%d\n",spaces);
    return 0;
}

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: Async I/O

From

Bruce Momjian

Date:

15 May 1998, 22:30:49

> What is worth to note, and you seem to forget. If your app is spending
> it's time on waiting for single IO operation, you want save anything.
> However, if you manage to have multiple I/O requests served
> asynchronically you may get better performance on RAID systems, also
> your I/O hardware may work better since the controllers may batch
> requests, requeue them and optimise them (Of course not in case of IDE
> disks).

Yes, perhaps using readv would be a win, and perhaps easy to do.

>
> Also, somebody asked about clustered indexes. I was looking for
> informations on this technique at Sybase (which is a great source of
> information on various DB hints). If you read above document between
> lines, the conclusion comes that clustered index is something that
> allows the data from table to be mixed with index. I suppose that index
> pages are clustered with data pages so if you find aproporiate record in
> index, the data that this index entry points to is on the same page or
> close.

Sounds a lot like ISAM to me.  And ISAM is a big win for static tables
if you need throughput.  Remember the word fragment issue, which
CLUSTER fixed.  I had an Ingres word fragment app that was terrible on
btree, and Marteen experienced.  CLUSTER does simulate that for static
tables, so it may not be such a big win.  I suppose if you needed such
performance, and the table changed a lot, it may be good.

>
> At Sybase I have also found some interesting materials on Bitmap Indexes
> (this idea is relatively simple) which looks very interesting in case of
> some types of queries.
>
> Mike
>
> --
> WWW: http://www.lodz.pdi.net/~mimo  tel: Int. Acc. Code + 48 42 148340
> add: Michal Mosiewicz  *  Bugaj 66 m.54 *  95-200 Pabianice  *  POLAND
>


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Sequential scan speed, mmap, disk i/o

From

"Thomas G. Lockhart"

Date:

15 May 1998, 22:37:14

> Basically, Linux is double my speed for 8k mmap'ed chunks.  Around 32k
> chunks, I get closer, and 8mb chunks are the same.  Glad to hear Linux
> has optimized mmap() recently, because BSD/OS looks much slower than
> Linux on this.

Well Bruce, don't be too happy. Most people aren't yet running the
optimized kernel; don't know if any of the benchmarks came from someone
running a bleeding-edge development version, which is what 2.1.99 would
be; first feature-freeze release in preparation for v2.2 afaik :)

And scrappy, no need to note that _all_ Linux kernels are bleeding edge
releases :)

                       - Tom (from his Linux box...)

Re: [HACKERS] Sequential scan speed, mmap, disk i/o

From

Bruce Momjian

Date:

16 May 1998, 01:00:49

>
> > Basically, Linux is double my speed for 8k mmap'ed chunks.  Around 32k
> > chunks, I get closer, and 8mb chunks are the same.  Glad to hear Linux
> > has optimized mmap() recently, because BSD/OS looks much slower than
> > Linux on this.
>
> Well Bruce, don't be too happy. Most people aren't yet running the
> optimized kernel; don't know if any of the benchmarks came from someone
> running a bleeding-edge development version, which is what 2.1.99 would
> be; first feature-freeze release in preparation for v2.2 afaik :)
>
> And scrappy, no need to note that _all_ Linux kernels are bleeding edge
> releases :)

[FYI, BSDI, Linux dev. release is beating BSDI for 8k mmaps() by 2x.]

I must say, when I saw Linux beating BSDI by 2x, I started wondering if
the great BSDI engineers were sleeping or something.  Now that I
understand that this improvement is a somewhat new effort, I feel a
little better.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Sequential scan speed, mmap, disk i/o

From

The Hermit Hacker

Date:

16 May 1998, 01:42:40

On Sat, 16 May 1998, Thomas G. Lockhart wrote:

> And scrappy, no need to note that _all_ Linux kernels are bleeding edge
> releases :)

    Moi?  *innocent look*

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org

Re: [HACKERS] Sequential scan speed, mmap, disk i/o

From

The Hermit Hacker

Date:

16 May 1998, 01:46:30

running a relatively 'bleeding edge' FreeBSD on hub.org, do you want to
try the same tests there?  Not sure how much memory that it will require,
but I'm running pretty much the same revision at home as at the
office...how are you generating your 117Meg file for testing with?  I'm
willing to run through the tests here and report on it...



Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org

Re: [HACKERS] Async I/O

From

dg@illustra.com (David Gould)

Date:

27 May 1998, 04:13:59

Michal Mosiewicz wrote a few weeks ago:
>
> Also, somebody asked about clustered indexes. I was looking for
> informations on this technique at Sybase (which is a great source of
> information on various DB hints). If you read above document between
> lines, the conclusion comes that clustered index is something that
> allows the data from table to be mixed with index. I suppose that index
> pages are clustered with data pages so if you find aproporiate record in
> index, the data that this index entry points to is on the same page or
> close.

Sybase clustered indexes are pretty standard stuff.

Our B-tree indexes have the index leaf pages storing index rows containing a
key and a 'tid' that points to a data row in a separate heap.

A clustered index is the same in the upper levels as our B-tree, but the leaf
pages contain the actual data rows. Thus the data is maintained in sorted
order for the clustering key. Also, in Sybase, all table pages are chained
together. This has the side effect of possibly speeding up sequential scans.

Very nice except that key updates, or even index splits cause rows to move so
all the secondary indexes must be updated. And, maintaining the page chain
links not only adds overhead, but also proves to be very error prone, leading
to crosslinked tables and other horrors.

Still, for a huge class of applications clustered indexes are a big win. It
would be well worth adding this to pgsql. I would not do the page chaining
though.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
Error 605 [tm] is a trademark of Sybase Inc.  -- dg

Re: [HACKERS] Sequential scan speed, mmap, disk i/o

From

dg@illustra.com (David Gould)

Date:

28 May 1998, 05:15:14

> > Very interesting. Is it possible to get the schema, the query, and a
> > a sample of the data or a generator program for the data? I am quite surprised
> > to see us do so well, I would have guess that the per row overhead would
> > have us down far below wc.
>
> Sure.
>
>     create table test (x1 int, x2 int, x3 char(10), x4 varchar(50));
>     insert into test values (3, 8, 'asdf','asdf');
>
>     insert into test select * from test;  <- continue until test is large
>
>     select * from test where x1 = 23423; <- this is what I timed
>

So I finally got around to playing with this a little and I get on my

P133 (HX mb) 32 Mb mem, Linux 2.0.32 (glibc) with Quantum Atlas 2.1G disk
on NCR810 SCSI

for test at 1048576 rows, file size is 80281600 bytes.

- time cat pg/test/data/base/dg/test >/dev/null
  0.02user 3.38system 0:14.34elapsed 23%CPU          = 5467 KB per second.

- time wc pg/test/data/base/dg/test
  9.12user 2.83system 0:15.38elapsed 77%CPU          = 5098 KB per second.

- time psql -c "select * from test where x1 = 23423;"
  0:30.59elapsed (cpu for psql not meaningful, but top said 95% for postgres)
                                                     = 2563 KB per second.
  Not bad!

- time psql -c "select count(*) from test;"
  0:50.46elapsed                                     = 1554 KB per second.
  (trivial aggragate adds 20 seconds or 65%)

- time psql -c "select count(*) from test where x1 = 3;"
  1:03.22elapsed                                     = 1240 KB per second.
  (trivial where clause adds another 13 seconds)

- time psql -c "select count(*) from test where x4 = 'asdf';"
  1:10.96elapsed                                     = 1105 KB per second.
  (varchar compare vs int compare adds only 7.7 seconds).


Btw, during all this, the disk hardly even made any noise (seeking).
ext2 seems to lay things out pretty well. The data dir right now is on a
/home, which is 86% full and it still managed to stream the 'cat' at about
full disk bandwidth.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"I believe OS/2 is destined to be the most important operating
 system, and possibly program, of all time" - Bill Gates, Nov, 1987.

Re: [HACKERS] Sequential scan speed, mmap, disk i/o]

From

Bruce Momjian

Date:

29 May 1998, 13:29:40

>
> > > Very interesting. Is it possible to get the schema, the query, and a
> > > a sample of the data or a generator program for the data? I am quite surprised
> > > to see us do so well, I would have guess that the per row overhead would
> > > have us down far below wc.
> >
> > Sure.
> >
> >     create table test (x1 int, x2 int, x3 char(10), x4 varchar(50));
> >     insert into test values (3, 8, 'asdf','asdf');
> >
> >     insert into test select * from test;  <- continue until test is large
> >
> >     select * from test where x1 = 23423; <- this is what I timed
> >
>
> So I finally got around to playing with this a little and I get on my
>
> P133 (HX mb) 32 Mb mem, Linux 2.0.32 (glibc) with Quantum Atlas 2.1G disk
> on NCR810 SCSI

OK, I have a Barracuda drive, which is probably the same speed as the
Atlas(Ultra SCSI), but have a PP200, which may be why my PostgreSQL
could keep up better with the disks.

My dd's showed ~6,000 KB/sec, postgresql was 4,800 KB/sec, and wc was
4,500 KB/sec.   Interesting how the speed fell off with the count().
That is executor overhead, I am sure.

>
> for test at 1048576 rows, file size is 80281600 bytes.
>
> - time cat pg/test/data/base/dg/test >/dev/null
>   0.02user 3.38system 0:14.34elapsed 23%CPU          = 5467 KB per second.
>
> - time wc pg/test/data/base/dg/test
>   9.12user 2.83system 0:15.38elapsed 77%CPU          = 5098 KB per second.
>
> - time psql -c "select * from test where x1 = 23423;"
>   0:30.59elapsed (cpu for psql not meaningful, but top said 95% for postgres)
>                                                      = 2563 KB per second.
>   Not bad!
>
> - time psql -c "select count(*) from test;"
>   0:50.46elapsed                                     = 1554 KB per second.
>   (trivial aggragate adds 20 seconds or 65%)
>
> - time psql -c "select count(*) from test where x1 = 3;"
>   1:03.22elapsed                                     = 1240 KB per second.
>   (trivial where clause adds another 13 seconds)
>
> - time psql -c "select count(*) from test where x4 = 'asdf';"
>   1:10.96elapsed                                     = 1105 KB per second.
>   (varchar compare vs int compare adds only 7.7 seconds).
>
>
> Btw, during all this, the disk hardly even made any noise (seeking).
> ext2 seems to lay things out pretty well. The data dir right now is on a
> /home, which is 86% full and it still managed to stream the 'cat' at about
> full disk bandwidth.


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)