Re: [HACKERS] Sequential scan speed, mmap, disk i/o - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: [HACKERS] Sequential scan speed, mmap, disk i/o
Date
Msg-id 199805160106.VAA26055@candle.pha.pa.us
Whole thread Raw
In response to Re: [HACKERS] Sequential scan speed, mmap, disk i/o  (Michal Mosiewicz <mimo@interdata.com.pl>)
Responses Re: [HACKERS] Sequential scan speed, mmap, disk i/o
List pgsql-hackers
> > mmap() is very slow, perhaps because you are changing the process
> > virtual table maps for each chunk you read in, and faulting them in,
> > rather than using the file system for I/O.
>
> Huh, very slow? I wouldn't agree. I rewrote your mmap program to allow
> for using reads or mmaps.
>
> I tested it on 111MB file. I decided to use 8192 bytes buffer size
> (standard postgres page size). My system is Linux, P166, 64MBs of RAM
> (note that I have a lot of software running currently so the cache size
> is less than 25MBs. I also changed the for(j..) step size to j+=256 just
> to make sure that it won't influence the results too much and you will
> see the difference better. mmap was run with (PROT_READ, MAP_SHARED)
>
> Average results are (for sequential reading):
> Using reads: total time - 21.39 (0.44user, 6.09system, 31%CPU)
> Using mmaps: total time - 21.10 (0.57user, 4.92system, 25%CPU)
>
> Note, that in case of reads the program spends much more time in system
> calls and uses more CPU. You may notice that in case of Linux using mmap
> is about 20% cheapper than read. In case of random reading it's slightly
> more than 20% as I remember. Total time is in both cases similiar since
> the throughput limit of my HD.
>
> BTW. Are you sure, that your program was counting mmaps properly? When I
> run it on my system it counts much more than what it should. On my
> system offset crossed over file's boundary then it worked a minute or
> more before it stopped. I attach my version (with hardcoded 111MBs file
> size to prevent it, of course you may change it)

OK, here are my results using your test program:

Basically, Linux is double my speed for 8k mmap'ed chunks.  Around 32k
chunks, I get closer, and 8mb chunks are the same.  Glad to hear Linux
has optimized mmap() recently, because BSD/OS looks much slower than
Linux on this.

Now, why does PostgreSQL sequential scan a 160MB files in 37 seconds,
using standard its 8k buffers, when even your read test for me using 8k
buffers takes 54 seconds?

In storage/file/fd.c, I see it using read(), and I assume they are 8k
chunks being read:

    returnCode = read(VfdCache[file].fd, buffer, amount);


Also attached is a modified version of my mmap() program, that uses
fstat() to check the file size to know when to stop.  However, I have
also have modified it to use a file size to match your file size.

Not sure what to conclude from these numbers.

---------------------------------------------------------------------------

mmap, 8k
       47.81 real         0.66 user        33.12 sys

read, 8k
       54.60 real         0.51 user        46.80 sys

mmap, 32k
       29.80 real         0.23 user        13.81 sys

read, 32k
       26.80 real         0.12 user        14.82 sys

mmap, 8mb
       21.25 real         0.03 user         5.49 sys

read, 8mb
       20.43 real         0.14 user         3.60 sys


my mmap, 8k, your file size
       64.67 real        15.99 user        34.00 sys

my mmap, 32k, your file size
       43.12 real        15.95 user        14.29 sys

my mmap, 8mb, your file size
       34.31 real        15.88 user         5.39 sys


---------------------------------------------------------------------------

#include <stdio.h>
#include <fcntl.h>
#include <assert.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>

#define MMAP_SIZE 8192 * 1024

int main(int argc, char *argv[], char *envp[])
{
    int i, j, fd, spaces = 0;
    int off;
    char *addr;
    struct stat filestat;

    fd = open("/u/pg/data/base/test/test", O_RDONLY, 0);
    assert(fd != -1);
    assert(fstat(fd, &filestat) == 0);

    filestat.st_size = 111329280;

    for (off = 0; 1; off += MMAP_SIZE)
    {
        addr = mmap(0, MMAP_SIZE, PROT_READ, MAP_SHARED, fd, off);
        assert(addr != NULL);
        madvise(addr, MMAP_SIZE, MADV_SEQUENTIAL);

        for (j = 0; j < MMAP_SIZE; j++)
        {
            if (*(addr + j)    != ' ')
                spaces++;
            if (off + j + 1 == filestat.st_size)
                goto done;
        }
        munmap(addr,MMAP_SIZE);
    }
done:
    printf("%d\n",spaces);
    return 0;
}

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

pgsql-hackers by date:

Previous
From: Michal Mosiewicz
Date:
Subject: Async I/O
Next
From: "Thomas G. Lockhart"
Date:
Subject: Re: CREATE DATABASE