Re: SCSI vs. IDE performance test - Mailing list pgsql-general
From | Tom Lane |
---|---|
Subject | Re: SCSI vs. IDE performance test |
Date | |
Msg-id | 18118.1067299544@sss.pgh.pa.us Whole thread Raw |
In response to | Re: SCSI vs. IDE performance test ("Rick Gigger" <rick@alpinenetworking.com>) |
List | pgsql-general |
"Rick Gigger" <rick@alpinenetworking.com> writes: > ahhh. "lies about write order" is the phrase that I was looking for. That > seemed to make sense but I didn't know if I could go directly from "lying > about fsync" to that. Obviously I don't understand exactly what fsync is > doing. What we actually care about is write order: WAL entries have to hit the platter before the corresponding data-file changes do. Unfortunately we have no portable means of expressing that exact constraint to the kernel. We use fsync() (or related constructs) instead: issue the WAL writes, fsync the WAL file, then issue the data-file writes. This constrains the write ordering more than is really needed, but it's the best we can do in a portable Unix application. The problem is that the kernel thinks fsync is done when the disk drive reports the writes are complete. When we say a drive lies about this, we mean it accepts a sector of data into its on-board RAM and then immediately claims write-complete, when in reality the data hasn't hit the platter yet and will be lost if power dies before the drive gets around to writing it. So we can have a scenario where we think WAL is down to disk and go ahead with issuing data-file writes. These will also be shoved over to the drive and stored in its on-board RAM. Now the drive has multiple sectors pending write in its buffers. If it chooses to write these in some order other than the order they were given to it, it could write the data file updates to disk first. If power drops *now*, we lose, because the data files are inconsistent and there's no WAL entry to tell us to fix it. Got it? It's really the combination of "lie about write completion" and "write pending sectors out of order" that can mess things up. The reason IDE drives have to do this for reasonable performance is that the IDE interface is single-threaded: you can only have one read or write in process at a time, from the point of view of the kernel-to-drive interface. But in order to schedule reads and writes in a way that makes sense physically (minimizes seeks), the drive has to have multiple read and write requests pending that it can pick and choose from. The only possibility to do that in the IDE world is to let a write "complete" in interface terms before it's really done ... that is, lie. The reason SCSI drives do *not* do this is that the SCSI interface is logically multi-threaded: you can have multiple reads or writes pending at once. When you want to write on a SCSI drive, you send over a command that says "write this data at this sector". Sometime later the drive sends back a status report "yessir boss, I done did that write". Similarly, a read consists of a command "read this sector", followed sometime later by a response that delivers the requested data. But you can send other commands to read or write other sectors meanwhile, and the drive is free to reorder them to suit its convenience. So in the SCSI world, there is no need for the drive to lie in order to do its own read/write scheduling. The kernel knows the truth about whether a given sector has hit disk, and so it won't conclude that the WAL file has been completely fsync'd until it really is all down to the platter. This is also why SCSI disks shine on the read side when you have lots of processes doing reads: in an IDE drive, there is no way for the drive to satisfy read requests in any order but the one they're issued in. If the kernel guesses wrong about the best ordering for a set of read requests, then everybody waits for the seeks needed to get the earlier processes' data. A SCSI drive can fetch the "nearest" data first, and then that requester is freed to make progress in the CPU while the other guys wait for their longer seeks. There's no win here with a single active user process (since it probably wants specific data in a specific order), but it's a huge win if lots of processes are making unrelated read requests. Clear now? (In a previous lifetime I wrote SCSI disk driver code ...) regards, tom lane
pgsql-general by date: