Re: How to improve db performance with $7K? - Mailing list pgsql-performance

From Kevin Brown
Subject Re: How to improve db performance with $7K?
Date
Msg-id 20050414055655.GB19518@filer
Whole thread Raw
In response to Re: How to improve db performance with $7K?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: How to improve db performance with $7K?
List pgsql-performance
Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
> > In any case the issue with the IDE protocol is that fundamentally you
> > can only have a single command pending. SCSI can have many commands
> > pending.
>
> That's the bottom line: the SCSI protocol was designed (twenty years ago!)
> to allow the drive to do physical I/O scheduling, because the CPU can
> issue multiple commands before the drive has to report completion of the
> first one.  IDE isn't designed to do that.  I understand that the latest
> revisions to the IDE/ATA specs allow the drive to do this sort of thing,
> but support for it is far from widespread.

My question is: why does this (physical I/O scheduling) seem to matter
so much?

Before you flame me for asking a terribly idiotic question, let me
provide some context.

The operating system maintains a (sometimes large) buffer cache, with
each buffer being mapped to a "physical" (which in the case of RAID is
really a virtual) location on the disk.  When the kernel needs to
flush the cache (e.g., during a sync(), or when it needs to free up
some pages), it doesn't write the pages in memory address order, it
writes them in *device* address order.  And it, too, maintains a queue
of disk write requests.

Now, unless some of the blocks on the disk are remapped behind the
scenes such that an ordered list of blocks in the kernel translates to
an out of order list on the target disk (which should be rare, since
such remapping usually happens only when the target block is bad), how
can the fact that the disk controller doesn't do tagged queuing
*possibly* make any real difference unless the kernel's disk
scheduling algorithm is suboptimal?  In fact, if the kernel's
scheduling algorithm is close to optimal, wouldn't the disk queuing
mechanism *reduce* the overall efficiency of disk writes?  After all,
the kernel's queue is likely to be much larger than the disk
controller's, and the kernel has knowledge of things like the
filesystem layout that the disk controller and disks do not have.  If
the controller is only able to execute a subset of the write commands
that the kernel has in its queue, at the very least the controller may
end up leaving the head(s) in a suboptimal position relative to the
next set of commands that it hasn't received yet, unless it simply
writes the blocks in the order it receives it, right (admittedly, this
is somewhat trivially dealt with by having the controller exclude the
first and last blocks in the request from its internal sort).


I can see how you might configure the RAID controller so that the
kernel's scheduling algorithm will screw things up horribly.  For
instance, if the controller has several RAID volumes configured in
such a way that the volumes share spindles, the kernel isn't likely to
know about that (since each volume appears as its own device), so
writes to multiple volumes can cause head movement where the kernel
might be treating the volumes as completely independent.  But that
just means that you can't be dumb about how you configure your RAID
setup.


So what gives?  Given the above, why is SCSI so much more efficient
than plain, dumb SATA?  And why wouldn't you be much better off with a
set of dumb controllers in conjunction with (kernel-level) software
RAID?


--
Kevin Brown                          kevin@sysexperts.com

pgsql-performance by date:

Previous
From: Slavisa Garic
Date:
Subject: Re: [NOVICE] Many connections lingering
Next
From: Greg Stark
Date:
Subject: Re: How to improve db performance with $7K?