Thread: Sequential scan speed, mmap, disk i/o
Someone was complaining about sequential scan speed, so I decided to run a test. I have run the test by performing a total read-through of a 177MB table. This exceeds all my cache sizes by over two times, so the cache is totally useless ("cache wipe"). I have timed PostgreSQL's sequential scan (returning no rows), and various Unix methods of reading a 177MB file. I have found that a sequential scan by PostgreSQL is almost as fast or faster than various other Unix methods of reading files. In fact, mmap() is very slow, perhaps because you are changing the process virtual table maps for each chunk you read in, and faulting them in, rather than using the file system for I/O. Basically, we beat 'wc', which is pretty good considering how little 'wc' does. My conclusion from this is that we really are not going to gain a lot of speed by exploring some async solution, because if the data we need is not in the cache, we really are going to spend most of our time waiting for disk I/O. Comments? --------------------------------------------------------------------------- 177MB file, BSD/OS 3.1, 64MB RAM, PostgreSQL current wc 41 sec wc -l 31 sec dd if=/u/pg/data/base/test/testv of=/dev/null bs=512 32 sec dd if=/u/pg/data/base/test/testv of=/dev/null bs=8k 31 sec dd if=/u/pg/data/base/test/testv of=/dev/null bs=256k 31 sec dd if=/u/pg/data/base/test/testv of=/dev/null bs=1m 30 sec mmap() of file in 8k chunks 99 sec mmap() of file in 8mb chunks 40 sec mmap() of file in 32mb chunks 56 sec PostgreSQL sequential scan 37 sec --------------------------------------------------------------------------- /* mmap() test program */ #include <stdio.h> #include <fcntl.h> #include <assert.h> #include <sys/types.h> #include <sys/mman.h> #define MMAP_SIZE 8192 /* chunk size */ int main(int argc, char *argv[], char *envp[]) { int i, j, fd, spaces = 0; int off; char *addr; fd = open("/u/pg/data/base/test/testv", O_RDONLY, 0); assert(fd != 0); for (off = 0; 1; off += MMAP_SIZE) { addr = mmap(0, MMAP_SIZE, PROT_READ, 0, fd, off); assert(addr != NULL); for (j = 0; j < MMAP_SIZE; j++) if (*(addr + j) != ' ') spaces++; munmap(addr,MMAP_SIZE); } printf("%d\n",spaces); return 0; } -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> > Someone was complaining about sequential scan speed, so I decided to run > a test. > wc 41 sec > wc -l 31 sec > dd if=/u/pg/data/base/test/testv of=/dev/null bs=512 32 sec > dd if=/u/pg/data/base/test/testv of=/dev/null bs=8k 31 sec > dd if=/u/pg/data/base/test/testv of=/dev/null bs=256k 31 sec > dd if=/u/pg/data/base/test/testv of=/dev/null bs=1m 30 sec > mmap() of file in 8k chunks 99 sec > mmap() of file in 8mb chunks 40 sec > mmap() of file in 32mb chunks 56 sec > > PostgreSQL sequential scan 37 sec Let me add, these times are on a PP200, with SCSI Ultra Barracuda drives, BSD/OS 3.1, 64MB RAM. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> > Someone was complaining about sequential scan speed, so I decided to run > > a test. > > > wc 41 sec > > wc -l 31 sec > > dd if=/u/pg/data/base/test/testv of=/dev/null bs=512 32 sec > > dd if=/u/pg/data/base/test/testv of=/dev/null bs=8k 31 sec > > dd if=/u/pg/data/base/test/testv of=/dev/null bs=256k 31 sec > > dd if=/u/pg/data/base/test/testv of=/dev/null bs=1m 30 sec > > mmap() of file in 8k chunks 99 sec > > mmap() of file in 8mb chunks 40 sec > > mmap() of file in 32mb chunks 56 sec > > > > PostgreSQL sequential scan 37 sec > > Let me add, these times are on a PP200, with SCSI Ultra Barracuda > drives, BSD/OS 3.1, 64MB RAM. > > -- > Bruce Momjian | 830 Blythe Avenue > maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 > + If your life is a hard drive, | (610) 353-9879(w) > + Christ can be your backup. | (610) 853-3000(h) Very interesting. Is it possible to get the schema, the query, and a a sample of the data or a generator program for the data? I am quite surprised to see us do so well, I would have guess that the per row overhead would have us down far below wc. Although, on second though, dd has to write the data as well as read it, and we don't, and wc has to examine every character, where if the "where clause" applies to only a portion of the row, we don't. Still, It would be nice to see more info about this test. Btw, I hope to post tommorrow some hopefully interesting results about the speed of TAS and S_LOCK... -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "Of course, someone who knows more about this will correct me if I'm wrong, and someone who knows less will correct me if I'm right." --David Palmer (palmer@tybalt.caltech.edu)
> > > > > Someone was complaining about sequential scan speed, so I decided to run > > a test. > > > wc 41 sec > > wc -l 31 sec > > dd if=/u/pg/data/base/test/testv of=/dev/null bs=512 32 sec > > dd if=/u/pg/data/base/test/testv of=/dev/null bs=8k 31 sec > > dd if=/u/pg/data/base/test/testv of=/dev/null bs=256k 31 sec > > dd if=/u/pg/data/base/test/testv of=/dev/null bs=1m 30 sec > > mmap() of file in 8k chunks 99 sec > > mmap() of file in 8mb chunks 40 sec > > mmap() of file in 32mb chunks 56 sec > > > > PostgreSQL sequential scan 37 sec > > Let me add, these times are on a PP200, with SCSI Ultra Barracuda > drives, BSD/OS 3.1, 64MB RAM. Also, the table was very small, with two ints, a char(10), and a varchar(50), so PostgreSQL was processing most of the 177MB of data in terms of having to read most of each block. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> Very interesting. Is it possible to get the schema, the query, and a > a sample of the data or a generator program for the data? I am quite surprised > to see us do so well, I would have guess that the per row overhead would > have us down far below wc. Sure. create table test (x1 int, x2 int, x3 char(10), x4 varchar(50)); insert into test values (3, 8, 'asdf','asdf'); insert into test select * from test; <- continue until test is large select * from test where x1 = 23423; <- this is what I timed -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
Bruce Momjian wrote: > In fact, > mmap() is very slow, perhaps because you are changing the process > virtual table maps for each chunk you read in, and faulting them in, > rather than using the file system for I/O. Huh, very slow? I wouldn't agree. I rewrote your mmap program to allow for using reads or mmaps. I tested it on 111MB file. I decided to use 8192 bytes buffer size (standard postgres page size). My system is Linux, P166, 64MBs of RAM (note that I have a lot of software running currently so the cache size is less than 25MBs. I also changed the for(j..) step size to j+=256 just to make sure that it won't influence the results too much and you will see the difference better. mmap was run with (PROT_READ, MAP_SHARED) Average results are (for sequential reading): Using reads: total time - 21.39 (0.44user, 6.09system, 31%CPU) Using mmaps: total time - 21.10 (0.57user, 4.92system, 25%CPU) Note, that in case of reads the program spends much more time in system calls and uses more CPU. You may notice that in case of Linux using mmap is about 20% cheapper than read. In case of random reading it's slightly more than 20% as I remember. Total time is in both cases similiar since the throughput limit of my HD. BTW. Are you sure, that your program was counting mmaps properly? When I run it on my system it counts much more than what it should. On my system offset crossed over file's boundary then it worked a minute or more before it stopped. I attach my version (with hardcoded 111MBs file size to prevent it, of course you may change it) Mike -- WWW: http://www.lodz.pdi.net/~mimo tel: Int. Acc. Code + 48 42 148340 add: Michal Mosiewicz * Bugaj 66 m.54 * 95-200 Pabianice * POLAND
Attachment
Bruce Momjian wrote: > My conclusion from this is that we really are not going to gain a lot of > speed by exploring some async solution, because if the data we need is > not in the cache, we really are going to spend most of our time waiting > for disk I/O. > > Comments? Well, I've just found an intersting article on AFP (Asynchronous Prefetch) at Sybase site. http://www.sybase.com/Partners/sun/apftech.html What is worth to note, and you seem to forget. If your app is spending it's time on waiting for single IO operation, you want save anything. However, if you manage to have multiple I/O requests served asynchronically you may get better performance on RAID systems, also your I/O hardware may work better since the controllers may batch requests, requeue them and optimise them (Of course not in case of IDE disks). Also, somebody asked about clustered indexes. I was looking for informations on this technique at Sybase (which is a great source of information on various DB hints). If you read above document between lines, the conclusion comes that clustered index is something that allows the data from table to be mixed with index. I suppose that index pages are clustered with data pages so if you find aproporiate record in index, the data that this index entry points to is on the same page or close. At Sybase I have also found some interesting materials on Bitmap Indexes (this idea is relatively simple) which looks very interesting in case of some types of queries. Mike -- WWW: http://www.lodz.pdi.net/~mimo tel: Int. Acc. Code + 48 42 148340 add: Michal Mosiewicz * Bugaj 66 m.54 * 95-200 Pabianice * POLAND
> > mmap() is very slow, perhaps because you are changing the process > > virtual table maps for each chunk you read in, and faulting them in, > > rather than using the file system for I/O. > > Huh, very slow? I wouldn't agree. I rewrote your mmap program to allow > for using reads or mmaps. > > I tested it on 111MB file. I decided to use 8192 bytes buffer size > (standard postgres page size). My system is Linux, P166, 64MBs of RAM > (note that I have a lot of software running currently so the cache size > is less than 25MBs. I also changed the for(j..) step size to j+=256 just > to make sure that it won't influence the results too much and you will > see the difference better. mmap was run with (PROT_READ, MAP_SHARED) > > Average results are (for sequential reading): > Using reads: total time - 21.39 (0.44user, 6.09system, 31%CPU) > Using mmaps: total time - 21.10 (0.57user, 4.92system, 25%CPU) > > Note, that in case of reads the program spends much more time in system > calls and uses more CPU. You may notice that in case of Linux using mmap > is about 20% cheapper than read. In case of random reading it's slightly > more than 20% as I remember. Total time is in both cases similiar since > the throughput limit of my HD. > > BTW. Are you sure, that your program was counting mmaps properly? When I > run it on my system it counts much more than what it should. On my > system offset crossed over file's boundary then it worked a minute or > more before it stopped. I attach my version (with hardcoded 111MBs file > size to prevent it, of course you may change it) OK, here are my results using your test program: Basically, Linux is double my speed for 8k mmap'ed chunks. Around 32k chunks, I get closer, and 8mb chunks are the same. Glad to hear Linux has optimized mmap() recently, because BSD/OS looks much slower than Linux on this. Now, why does PostgreSQL sequential scan a 160MB files in 37 seconds, using standard its 8k buffers, when even your read test for me using 8k buffers takes 54 seconds? In storage/file/fd.c, I see it using read(), and I assume they are 8k chunks being read: returnCode = read(VfdCache[file].fd, buffer, amount); Also attached is a modified version of my mmap() program, that uses fstat() to check the file size to know when to stop. However, I have also have modified it to use a file size to match your file size. Not sure what to conclude from these numbers. --------------------------------------------------------------------------- mmap, 8k 47.81 real 0.66 user 33.12 sys read, 8k 54.60 real 0.51 user 46.80 sys mmap, 32k 29.80 real 0.23 user 13.81 sys read, 32k 26.80 real 0.12 user 14.82 sys mmap, 8mb 21.25 real 0.03 user 5.49 sys read, 8mb 20.43 real 0.14 user 3.60 sys my mmap, 8k, your file size 64.67 real 15.99 user 34.00 sys my mmap, 32k, your file size 43.12 real 15.95 user 14.29 sys my mmap, 8mb, your file size 34.31 real 15.88 user 5.39 sys --------------------------------------------------------------------------- #include <stdio.h> #include <fcntl.h> #include <assert.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #define MMAP_SIZE 8192 * 1024 int main(int argc, char *argv[], char *envp[]) { int i, j, fd, spaces = 0; int off; char *addr; struct stat filestat; fd = open("/u/pg/data/base/test/test", O_RDONLY, 0); assert(fd != -1); assert(fstat(fd, &filestat) == 0); filestat.st_size = 111329280; for (off = 0; 1; off += MMAP_SIZE) { addr = mmap(0, MMAP_SIZE, PROT_READ, MAP_SHARED, fd, off); assert(addr != NULL); madvise(addr, MMAP_SIZE, MADV_SEQUENTIAL); for (j = 0; j < MMAP_SIZE; j++) { if (*(addr + j) != ' ') spaces++; if (off + j + 1 == filestat.st_size) goto done; } munmap(addr,MMAP_SIZE); } done: printf("%d\n",spaces); return 0; } -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> What is worth to note, and you seem to forget. If your app is spending > it's time on waiting for single IO operation, you want save anything. > However, if you manage to have multiple I/O requests served > asynchronically you may get better performance on RAID systems, also > your I/O hardware may work better since the controllers may batch > requests, requeue them and optimise them (Of course not in case of IDE > disks). Yes, perhaps using readv would be a win, and perhaps easy to do. > > Also, somebody asked about clustered indexes. I was looking for > informations on this technique at Sybase (which is a great source of > information on various DB hints). If you read above document between > lines, the conclusion comes that clustered index is something that > allows the data from table to be mixed with index. I suppose that index > pages are clustered with data pages so if you find aproporiate record in > index, the data that this index entry points to is on the same page or > close. Sounds a lot like ISAM to me. And ISAM is a big win for static tables if you need throughput. Remember the word fragment issue, which CLUSTER fixed. I had an Ingres word fragment app that was terrible on btree, and Marteen experienced. CLUSTER does simulate that for static tables, so it may not be such a big win. I suppose if you needed such performance, and the table changed a lot, it may be good. > > At Sybase I have also found some interesting materials on Bitmap Indexes > (this idea is relatively simple) which looks very interesting in case of > some types of queries. > > Mike > > -- > WWW: http://www.lodz.pdi.net/~mimo tel: Int. Acc. Code + 48 42 148340 > add: Michal Mosiewicz * Bugaj 66 m.54 * 95-200 Pabianice * POLAND > -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> Basically, Linux is double my speed for 8k mmap'ed chunks. Around 32k > chunks, I get closer, and 8mb chunks are the same. Glad to hear Linux > has optimized mmap() recently, because BSD/OS looks much slower than > Linux on this. Well Bruce, don't be too happy. Most people aren't yet running the optimized kernel; don't know if any of the benchmarks came from someone running a bleeding-edge development version, which is what 2.1.99 would be; first feature-freeze release in preparation for v2.2 afaik :) And scrappy, no need to note that _all_ Linux kernels are bleeding edge releases :) - Tom (from his Linux box...)
> > > Basically, Linux is double my speed for 8k mmap'ed chunks. Around 32k > > chunks, I get closer, and 8mb chunks are the same. Glad to hear Linux > > has optimized mmap() recently, because BSD/OS looks much slower than > > Linux on this. > > Well Bruce, don't be too happy. Most people aren't yet running the > optimized kernel; don't know if any of the benchmarks came from someone > running a bleeding-edge development version, which is what 2.1.99 would > be; first feature-freeze release in preparation for v2.2 afaik :) > > And scrappy, no need to note that _all_ Linux kernels are bleeding edge > releases :) [FYI, BSDI, Linux dev. release is beating BSDI for 8k mmaps() by 2x.] I must say, when I saw Linux beating BSDI by 2x, I started wondering if the great BSDI engineers were sleeping or something. Now that I understand that this improvement is a somewhat new effort, I feel a little better. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
On Sat, 16 May 1998, Thomas G. Lockhart wrote: > And scrappy, no need to note that _all_ Linux kernels are bleeding edge > releases :) Moi? *innocent look* Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
running a relatively 'bleeding edge' FreeBSD on hub.org, do you want to try the same tests there? Not sure how much memory that it will require, but I'm running pretty much the same revision at home as at the office...how are you generating your 117Meg file for testing with? I'm willing to run through the tests here and report on it... Marc G. Fournier Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
Michal Mosiewicz wrote a few weeks ago: > > Also, somebody asked about clustered indexes. I was looking for > informations on this technique at Sybase (which is a great source of > information on various DB hints). If you read above document between > lines, the conclusion comes that clustered index is something that > allows the data from table to be mixed with index. I suppose that index > pages are clustered with data pages so if you find aproporiate record in > index, the data that this index entry points to is on the same page or > close. Sybase clustered indexes are pretty standard stuff. Our B-tree indexes have the index leaf pages storing index rows containing a key and a 'tid' that points to a data row in a separate heap. A clustered index is the same in the upper levels as our B-tree, but the leaf pages contain the actual data rows. Thus the data is maintained in sorted order for the clustering key. Also, in Sybase, all table pages are chained together. This has the side effect of possibly speeding up sequential scans. Very nice except that key updates, or even index splits cause rows to move so all the secondary indexes must be updated. And, maintaining the page chain links not only adds overhead, but also proves to be very error prone, leading to crosslinked tables and other horrors. Still, for a huge class of applications clustered indexes are a big win. It would be well worth adding this to pgsql. I would not do the page chaining though. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 Error 605 [tm] is a trademark of Sybase Inc. -- dg
> > Very interesting. Is it possible to get the schema, the query, and a > > a sample of the data or a generator program for the data? I am quite surprised > > to see us do so well, I would have guess that the per row overhead would > > have us down far below wc. > > Sure. > > create table test (x1 int, x2 int, x3 char(10), x4 varchar(50)); > insert into test values (3, 8, 'asdf','asdf'); > > insert into test select * from test; <- continue until test is large > > select * from test where x1 = 23423; <- this is what I timed > So I finally got around to playing with this a little and I get on my P133 (HX mb) 32 Mb mem, Linux 2.0.32 (glibc) with Quantum Atlas 2.1G disk on NCR810 SCSI for test at 1048576 rows, file size is 80281600 bytes. - time cat pg/test/data/base/dg/test >/dev/null 0.02user 3.38system 0:14.34elapsed 23%CPU = 5467 KB per second. - time wc pg/test/data/base/dg/test 9.12user 2.83system 0:15.38elapsed 77%CPU = 5098 KB per second. - time psql -c "select * from test where x1 = 23423;" 0:30.59elapsed (cpu for psql not meaningful, but top said 95% for postgres) = 2563 KB per second. Not bad! - time psql -c "select count(*) from test;" 0:50.46elapsed = 1554 KB per second. (trivial aggragate adds 20 seconds or 65%) - time psql -c "select count(*) from test where x1 = 3;" 1:03.22elapsed = 1240 KB per second. (trivial where clause adds another 13 seconds) - time psql -c "select count(*) from test where x4 = 'asdf';" 1:10.96elapsed = 1105 KB per second. (varchar compare vs int compare adds only 7.7 seconds). Btw, during all this, the disk hardly even made any noise (seeking). ext2 seems to lay things out pretty well. The data dir right now is on a /home, which is 86% full and it still managed to stream the 'cat' at about full disk bandwidth. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "I believe OS/2 is destined to be the most important operating system, and possibly program, of all time" - Bill Gates, Nov, 1987.
> > > > Very interesting. Is it possible to get the schema, the query, and a > > > a sample of the data or a generator program for the data? I am quite surprised > > > to see us do so well, I would have guess that the per row overhead would > > > have us down far below wc. > > > > Sure. > > > > create table test (x1 int, x2 int, x3 char(10), x4 varchar(50)); > > insert into test values (3, 8, 'asdf','asdf'); > > > > insert into test select * from test; <- continue until test is large > > > > select * from test where x1 = 23423; <- this is what I timed > > > > So I finally got around to playing with this a little and I get on my > > P133 (HX mb) 32 Mb mem, Linux 2.0.32 (glibc) with Quantum Atlas 2.1G disk > on NCR810 SCSI OK, I have a Barracuda drive, which is probably the same speed as the Atlas(Ultra SCSI), but have a PP200, which may be why my PostgreSQL could keep up better with the disks. My dd's showed ~6,000 KB/sec, postgresql was 4,800 KB/sec, and wc was 4,500 KB/sec. Interesting how the speed fell off with the count(). That is executor overhead, I am sure. > > for test at 1048576 rows, file size is 80281600 bytes. > > - time cat pg/test/data/base/dg/test >/dev/null > 0.02user 3.38system 0:14.34elapsed 23%CPU = 5467 KB per second. > > - time wc pg/test/data/base/dg/test > 9.12user 2.83system 0:15.38elapsed 77%CPU = 5098 KB per second. > > - time psql -c "select * from test where x1 = 23423;" > 0:30.59elapsed (cpu for psql not meaningful, but top said 95% for postgres) > = 2563 KB per second. > Not bad! > > - time psql -c "select count(*) from test;" > 0:50.46elapsed = 1554 KB per second. > (trivial aggragate adds 20 seconds or 65%) > > - time psql -c "select count(*) from test where x1 = 3;" > 1:03.22elapsed = 1240 KB per second. > (trivial where clause adds another 13 seconds) > > - time psql -c "select count(*) from test where x4 = 'asdf';" > 1:10.96elapsed = 1105 KB per second. > (varchar compare vs int compare adds only 7.7 seconds). > > > Btw, during all this, the disk hardly even made any noise (seeking). > ext2 seems to lay things out pretty well. The data dir right now is on a > /home, which is 86% full and it still managed to stream the 'cat' at about > full disk bandwidth. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)