Re: How to improve db performance with $7K? - Mailing list pgsql-performance
From | Kevin Brown |
---|---|
Subject | Re: How to improve db performance with $7K? |
Date | |
Msg-id | 20050414055655.GB19518@filer Whole thread Raw |
In response to | Re: How to improve db performance with $7K? (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: How to improve db performance with $7K?
|
List | pgsql-performance |
Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > In any case the issue with the IDE protocol is that fundamentally you > > can only have a single command pending. SCSI can have many commands > > pending. > > That's the bottom line: the SCSI protocol was designed (twenty years ago!) > to allow the drive to do physical I/O scheduling, because the CPU can > issue multiple commands before the drive has to report completion of the > first one. IDE isn't designed to do that. I understand that the latest > revisions to the IDE/ATA specs allow the drive to do this sort of thing, > but support for it is far from widespread. My question is: why does this (physical I/O scheduling) seem to matter so much? Before you flame me for asking a terribly idiotic question, let me provide some context. The operating system maintains a (sometimes large) buffer cache, with each buffer being mapped to a "physical" (which in the case of RAID is really a virtual) location on the disk. When the kernel needs to flush the cache (e.g., during a sync(), or when it needs to free up some pages), it doesn't write the pages in memory address order, it writes them in *device* address order. And it, too, maintains a queue of disk write requests. Now, unless some of the blocks on the disk are remapped behind the scenes such that an ordered list of blocks in the kernel translates to an out of order list on the target disk (which should be rare, since such remapping usually happens only when the target block is bad), how can the fact that the disk controller doesn't do tagged queuing *possibly* make any real difference unless the kernel's disk scheduling algorithm is suboptimal? In fact, if the kernel's scheduling algorithm is close to optimal, wouldn't the disk queuing mechanism *reduce* the overall efficiency of disk writes? After all, the kernel's queue is likely to be much larger than the disk controller's, and the kernel has knowledge of things like the filesystem layout that the disk controller and disks do not have. If the controller is only able to execute a subset of the write commands that the kernel has in its queue, at the very least the controller may end up leaving the head(s) in a suboptimal position relative to the next set of commands that it hasn't received yet, unless it simply writes the blocks in the order it receives it, right (admittedly, this is somewhat trivially dealt with by having the controller exclude the first and last blocks in the request from its internal sort). I can see how you might configure the RAID controller so that the kernel's scheduling algorithm will screw things up horribly. For instance, if the controller has several RAID volumes configured in such a way that the volumes share spindles, the kernel isn't likely to know about that (since each volume appears as its own device), so writes to multiple volumes can cause head movement where the kernel might be treating the volumes as completely independent. But that just means that you can't be dumb about how you configure your RAID setup. So what gives? Given the above, why is SCSI so much more efficient than plain, dumb SATA? And why wouldn't you be much better off with a set of dumb controllers in conjunction with (kernel-level) software RAID? -- Kevin Brown kevin@sysexperts.com
pgsql-performance by date: