Re: Sequential Scan Read-Ahead - Mailing list pgsql-hackers

From Kyle
Subject Re: Sequential Scan Read-Ahead
Date
Msg-id 15560.41493.529847.635632@doppelbock.patentinvestor.com
Whole thread Raw
In response to Re: Sequential Scan Read-Ahead  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Sequential Scan Read-Ahead  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-hackers
Tom Lane wrote:
> ...
> Curt Sampson <cjs@cynic.net> writes:
> > 3. Proof by testing. I wrote a little ruby program to seek to a
> > random point in the first 2 GB of my raw disk partition and read
> > 1-8 8K blocks of data. (This was done as one I/O request.) (Using
> > the raw disk partition I avoid any filesystem buffering.)
> 
> And also ensure that you aren't testing the point at issue.
> The point at issue is that *in the presence of kernel read-ahead*
> it's quite unclear that there's any benefit to a larger request size.
> Ideally the kernel will have the next block ready for you when you
> ask, no matter what the request is.
> ...

I have to agree with Tom.  I think the numbers below show that with
kernel read-ahead, block size isn't an issue.

The big_file1 file used below is 2.0 gig of random data, and the
machine has 512 mb of main memory.  This ensures that we're not
just getting cached data.

foreach i (4k 8k 16k 32k 64k 128k) echo $i time dd bs=$i if=big_file1 of=/dev/null
end

and the results:

bs    user    kernel   elapsed
4k:   0.260   7.740    1:27.25
8k:   0.210   8.060    1:30.48
16k:  0.090   7.790    1:30.88
32k:  0.060   8.090    1:32.75
64k:  0.030   8.190    1:29.11
128k: 0.070   9.830    1:28.74

so with kernel read-ahead, we have basically the same elapsed (wall
time) regardless of block size.  Sure, user time drops to a low at 64k
blocksize, but kernel time is increasing.


You could argue that this is a contrived example, no other I/O is
being done.  Well I created a second 2.0g file (big_file2) and did two
simultaneous reads from the same disk.  Sure performance went to hell
but it shows blocksize is still irrelevant in a multi I/O environment
with sequential read-ahead.

foreach i ( 4k 8k 16k 32k 64k 128k ) echo $i time dd bs=$i if=big_file1 of=/dev/null & time dd bs=$i if=big_file2
of=/dev/null& wait
 
end

bs    user    kernel   elapsed
4k:   0.480   8.290    6:34.13  bigfile1     0.320   8.730    6:34.33  bigfile2
8k:   0.250   7.580    6:31.75     0.180   8.450    6:31.88
16k:  0.150   8.390    6:32.47     0.100   7.900    6:32.55
32k:  0.190   8.460    6:24.72     0.060   8.410    6:24.73
64k:  0.060   9.350    6:25.05     0.150   9.240    6:25.13
128k: 0.090  10.610    6:33.14     0.110  11.320    6:33.31


the differences in read times are basically in the mud.  Blocksize
just doesn't matter much with the kernel doing readahead.

-Kyle


pgsql-hackers by date:

Previous
From: Vince Vielhaber
Date:
Subject: Re: Vote totals for SET in aborted transaction
Next
From: "Marc G. Fournier"
Date:
Subject: Re: Vote totals for SET in aborted transaction