Re: SCSI vs. IDE performance test - Mailing list pgsql-general
From | Rick Gigger |
---|---|
Subject | Re: SCSI vs. IDE performance test |
Date | |
Msg-id | 010e01c39ced$685c0870$0700a8c0@trogdor Whole thread Raw |
In response to | Re: SCSI vs. IDE performance test (Bruce Momjian <pgman@candle.pha.pa.us>) |
Responses |
Re: SCSI vs. IDE performance test
|
List | pgsql-general |
Thanks! Now it is much, much more clear. It leaves me with a few additional questions though. Question 1: "we have no portable means of expressing that exact constraint to the kernel" Does this mean that specific operating systems have a better way of dealing with this? Which ones and how? I'm guessing that it couldn't make to big of a performance difference or it would probably be implemented already. Question 2: Do serial ATA drives suffer from the same issue? ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Rick Gigger" <rick@alpinenetworking.com> Cc: <pgsql-general@postgresql.org> Sent: Monday, October 27, 2003 5:05 PM Subject: Re: [GENERAL] SCSI vs. IDE performance test > "Rick Gigger" <rick@alpinenetworking.com> writes: > > ahhh. "lies about write order" is the phrase that I was looking for. That > > seemed to make sense but I didn't know if I could go directly from "lying > > about fsync" to that. Obviously I don't understand exactly what fsync is > > doing. > > What we actually care about is write order: WAL entries have to hit the > platter before the corresponding data-file changes do. Unfortunately we > have no portable means of expressing that exact constraint to the > kernel. We use fsync() (or related constructs) instead: issue the WAL > writes, fsync the WAL file, then issue the data-file writes. This > constrains the write ordering more than is really needed, but it's the > best we can do in a portable Unix application. > > The problem is that the kernel thinks fsync is done when the disk drive > reports the writes are complete. When we say a drive lies about this, > we mean it accepts a sector of data into its on-board RAM and then > immediately claims write-complete, when in reality the data hasn't hit > the platter yet and will be lost if power dies before the drive gets > around to writing it. > > So we can have a scenario where we think WAL is down to disk and go > ahead with issuing data-file writes. These will also be shoved over to > the drive and stored in its on-board RAM. Now the drive has multiple > sectors pending write in its buffers. If it chooses to write these in > some order other than the order they were given to it, it could write > the data file updates to disk first. If power drops *now*, we lose, > because the data files are inconsistent and there's no WAL entry to tell > us to fix it. > > Got it? It's really the combination of "lie about write completion" and > "write pending sectors out of order" that can mess things up. > > The reason IDE drives have to do this for reasonable performance is that > the IDE interface is single-threaded: you can only have one read or > write in process at a time, from the point of view of the > kernel-to-drive interface. But in order to schedule reads and writes in > a way that makes sense physically (minimizes seeks), the drive has to > have multiple read and write requests pending that it can pick and > choose from. The only possibility to do that in the IDE world is to > let a write "complete" in interface terms before it's really done ... > that is, lie. > > The reason SCSI drives do *not* do this is that the SCSI interface is > logically multi-threaded: you can have multiple reads or writes pending > at once. When you want to write on a SCSI drive, you send over a > command that says "write this data at this sector". Sometime later the > drive sends back a status report "yessir boss, I done did that write". > Similarly, a read consists of a command "read this sector", followed > sometime later by a response that delivers the requested data. But you > can send other commands to read or write other sectors meanwhile, and > the drive is free to reorder them to suit its convenience. So in the > SCSI world, there is no need for the drive to lie in order to do its own > read/write scheduling. The kernel knows the truth about whether a given > sector has hit disk, and so it won't conclude that the WAL file has been > completely fsync'd until it really is all down to the platter. > > This is also why SCSI disks shine on the read side when you have lots of > processes doing reads: in an IDE drive, there is no way for the drive to > satisfy read requests in any order but the one they're issued in. If the > kernel guesses wrong about the best ordering for a set of read requests, > then everybody waits for the seeks needed to get the earlier processes' > data. A SCSI drive can fetch the "nearest" data first, and then that > requester is freed to make progress in the CPU while the other guys wait > for their longer seeks. There's no win here with a single active user > process (since it probably wants specific data in a specific order), but > it's a huge win if lots of processes are making unrelated read requests. > > Clear now? > > (In a previous lifetime I wrote SCSI disk driver code ...) > > regards, tom lane >
pgsql-general by date: