Re: Sequential Scan Read-Ahead - Mailing list pgsql-hackers

From Curt Sampson
Subject Re: Sequential Scan Read-Ahead
Date
Msg-id Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net
Whole thread Raw
In response to Re: Sequential Scan Read-Ahead  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Sequential Scan Read-Ahead  (Curt Sampson <cjs@cynic.net>)
Re: Sequential Scan Read-Ahead  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Sequential Scan Read-Ahead  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-hackers
On Wed, 24 Apr 2002, Tom Lane wrote:

> Curt Sampson <cjs@cynic.net> writes:
> > Grabbing bigger chunks is always optimal, AFICT, if they're not
> > *too* big and you use the data. A single 64K read takes very little
> > longer than a single 8K read.
>
> Proof?

Well, there are various sorts of "proof" for this assertion. What
sort do you want?

Here's a few samples; if you're looking for something different to
satisfy you, let's discuss it.

1. Theoretical proof: two components of the delay in retrieving a
block from disk are the disk arm movement and the wait for the
right block to rotate under the head.

When retrieving, say, eight adjacent blocks, these will be spread
across no more than two cylinders (with luck, only one). The worst
case access time for a single block is the disk arm movement plus
the full rotational wait; this is the same as the worst case for
eight blocks if they're all on one cylinder. If they're not on one
cylinder, they're still on adjacent cylinders, requiring a very
short seek.

2. Proof by others using it: SQL server uses 64K reads when doing
table scans, as they say that their research indicates that the
major limitation is usually the number of I/O requests, not the
I/O capacity of the disk. BSD's explicitly separates the optimum
allocation size for storage (1K fragments) and optimum read size
(8K blocks) because they found performance to be much better when
a larger size block was read. Most file system vendors, too, do
read-ahead for this very reason.

3. Proof by testing. I wrote a little ruby program to seek to a
random point in the first 2 GB of my raw disk partition and read
1-8 8K blocks of data. (This was done as one I/O request.) (Using
the raw disk partition I avoid any filesystem buffering.) Here are
typical results:
125 reads of 16x8K blocks: 1.9 sec, 66.04 req/sec. 15.1 ms/req, 0.946 ms/block250 reads of  8x8K blocks: 1.9 sec, 132.3
req/sec.7.56 ms/req, 0.945 ms/block500 reads of  4x8K blocks: 2.5 sec, 199 req/sec.   5.03 ms/req, 1.26 ms/block
 
1000 reads of  2x8K blocks: 3.8 sec, 261.6 req/sec. 3.82 ms/req, 1.91 ms/block
2000 reads of  1x8K blocks: 6.4 sec, 310.4 req/sec. 3.22 ms/req, 3.22 ms/block

The ratios of data retrieval speed per read for groups of adjacent
8K blocks, assuming a single 8K block reads in 1 time unit, are:
   1 block    1.00   2 blocks    1.18   4 blocks    1.56   8 blocks    2.34   16 blocks    4.68

At less than 20% more expensive, certainly two-block read requests
could be considered to cost "very little more" than one-block read
requests. Even four-block read requests are only half-again as
expensive. And if you know you're really going to be using the
data, read in 8 block chunks and your cost per block (in terms of
time) drops to less than a third of the cost of single-block reads.

Let me put paid to comments about multiple simultaneous readers
making this invalid. Here's a typical result I get with four
instances of the program running simultaneously:

125 reads of 16x8K blocks: 4.4 sec, 28.21 req/sec. 35.4 ms/req, 2.22 ms/block
250 reads of 8x8K blocks: 3.9 sec, 64.88 req/sec. 15.4 ms/req, 1.93 ms/block
500 reads of 4x8K blocks: 5.8 sec, 86.52 req/sec. 11.6 ms/req, 2.89 ms/block
1000 reads of 2x8K blocks: 10 sec, 100.2 req/sec. 9.98 ms/req, 4.99 ms/block
2000 reads of 1x8K blocks: 18 sec, 110 req/sec. 9.09 ms/req, 9.09 ms/block

Here's the ratio table again, with another column comparing the
aggregate number of requests per second for one process and four
processes:
   1 block    1.00        310 : 440   2 blocks    1.10        262 : 401   4 blocks    1.28        199 : 346   8 blocks
 1.69        132 : 260   16 blocks    3.89         66 : 113
 

Note that, here the relative increase in performance for increasing
sizes of reads is even *better* until we get past 64K chunks. The
overall throughput is better, of course, because with more requests
per second coming in, the disk seek ordering code has more to work
with and the average seek time spent seeking vs. reading will be
reduced.

You know, this is not rocket science; I'm sure there must be papers
all over the place about this. If anybody still disagrees that it's
a good thing to read chunks up to 64K or so when the blocks are
adjacent and you know you'll need the data, I'd like to see some
tangible evidence to support that.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 



pgsql-hackers by date:

Previous
From: "Luis Alberto Amigo Navarro"
Date:
Subject: Re: PostgreSQL index usage discussion.
Next
From: Curt Sampson
Date:
Subject: Re: Sequential Scan Read-Ahead