Thread: SCSI vs. IDE performance test
http://hardware.devchannel.org/hardwarechannel/03/10/20/1953249.shtml?tid=20&tid=38&tid=49 -- ----------------------------------------------------------------- Ron Johnson, Jr. ron.l.johnson@cox.net Jefferson, LA USA I can't make you have an abortion, but you can *make* me pay child support for 18 years? However, if I want the child (and all the expenses that entails) for the *rest*of*my*life*, and you don't want it for 9 months, tough luck???
The SCSI improvement over IDE seems overrated in the test. I would have expected at most a 30% improvement. Other reviews seem to point out that IDE performs just as well or better. See Tom's hardware: http://www20.tomshardware.com/storage/20020305/index.html Stephen "Ron Johnson" <ron.l.johnson@cox.net> wrote in message news:1066837102.12532.176.camel@haggis... > http://hardware.devchannel.org/hardwarechannel/03/10/20/1953249.shtml?tid=20 &tid=38&tid=49 > > -- > ----------------------------------------------------------------- > Ron Johnson, Jr. ron.l.johnson@cox.net > Jefferson, LA USA > > I can't make you have an abortion, but you can *make* me pay > child support for 18 years? However, if I want the child (and > all the expenses that entails) for the *rest*of*my*life*, and you > don't want it for 9 months, tough luck??? > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
> -----Original Message----- > From: Stephen [mailto:jleelim@xxxxxx.com] > Sent: Wednesday, October 22, 2003 9:02 AM > To: pgsql-general@postgresql.org > Subject: Re: [GENERAL] SCSI vs. IDE performance test > > > The SCSI improvement over IDE seems overrated in the test. I > would have expected at most a 30% improvement. Other reviews > seem to point out that IDE performs just as well or better. > > See Tom's hardware: > http://www20.tomshardware.com/storage/20020305> /index.html > My own tests show that 15K RPM ultra 320 SCSI drives are considerably faster than any IDE storage. This ATA drive: http://www.wdc.com/en/products/WD360GD.asp Performs as well or better than many SCSI drives, and are not terribly expensive. Therefore, these are a very good choice if price performance is more important than absolute performance. But if you need absolute horsepower, then one of these (or other 15K Ultra320 equivalent) won't be beaten: http://www.storagereview.com/articles/200304/200304068C073x0_1.html
Unwrap this link (if your newsreader folds it) and click on it for hard drive performance: http://www.storagereview.com/php/benchmark/compare_rtg_2001.php?typeID=1 0&testbedID=3&osID=4&raidconfigID=1&numDrives=1&devID_0=232&devID_1=237& devID_2=213&devID_3=221&devID_4=216&devID_5=249&devID_6=250&devCnt=7 The important part for database is "Server Suite"
On Wed, 2003-10-22 at 11:01, Stephen wrote: > The SCSI improvement over IDE seems overrated in the test. I would have > expected at most a 30% improvement. Other reviews seem to point out that IDE > performs just as well or better. > > See Tom's hardware: > http://www20.tomshardware.com/storage/20020305/index.html When TCQ becomes a reality in IDE drives, they'll have a fighting chance, but the slower seek times and rotational speeds will still do them in. Also, does an 8MB cache *really* make that much of a difference? After all, it can only cache 0.0067% of a 120GB drive, and 0.00267% of the new 300GB disks. Speaking of which, that 300GB HDD sounds like a dream for near- line storage, and even for nightly backups, if it is ever put in SBB-type packaging. http://www20.tomshardware.com/storage/20031008/index.html Imagine a scheme where you rapidly pg_dump to the 300GB drive, then, at leisure, tar the dump file to tape. Stripe a few together, and keep a month of backups on-line for quick recovery, along with the tape archives, in case the stripeset gets wasted, too. > "Ron Johnson" <ron.l.johnson@cox.net> wrote in message > news:1066837102.12532.176.camel@haggis... > > > http://hardware.devchannel.org/hardwarechannel/03/10/20/1953249.shtml?tid=20 > &tid=38&tid=49 -- ----------------------------------------------------------------- Ron Johnson, Jr. ron.l.johnson@cox.net Jefferson, LA USA "Adventure is a sign of incompetence" Stephanson, great polar explorer
Dann Corbit wrote: >Unwrap this link (if your newsreader folds it) and click on it for hard >drive performance: >http://www.storagereview.com/php/benchmark/compare_rtg_2001.php?typeID=1 >0&testbedID=3&osID=4&raidconfigID=1&numDrives=1&devID_0=232&devID_1=237& >devID_2=213&devID_3=221&devID_4=216&devID_5=249&devID_6=250&devCnt=7 > >The important part for database is "Server Suite" > >---------------------------(end of broadcast)--------------------------- >TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > > > Fairly old data, but it shows AMAZING differences in head seek time. I didn't know head seeks were below 8ms for anything, even today. Also, from what I've read, the SATA drives of those days were non existent? The earliest SATA drives I've read about were just SATA interfaces on OLDER IDE hardware - the manufacutrers had not really signed up on the concept enough to put their good hardware underneath the interface. -- "You are behaving like a man", is an insult from some women, a compliment from an good woman.
I just ran some benchmarks against a 10K SCSI drive and 7200 RPM IDE drive here: http://fsbench.netnation.com/ The results vary quite a bit, and it seems the file system you use can make a huge difference. SCSI is obviously faster, but a 20% performance gain for 5x the cost is only worth it for a very small percentage of people, I would think. On Wed, 2003-10-22 at 09:01, Stephen wrote: > The SCSI improvement over IDE seems overrated in the test. I would have > expected at most a 30% improvement. Other reviews seem to point out that IDE > performs just as well or better. > > See Tom's hardware: > http://www20.tomshardware.com/storage/20020305/index.html > > Stephen > > > "Ron Johnson" <ron.l.johnson@cox.net> wrote in message > news:1066837102.12532.176.camel@haggis... > > > http://hardware.devchannel.org/hardwarechannel/03/10/20/1953249.shtml?tid=20 > &tid=38&tid=49 > > > > -- > > ----------------------------------------------------------------- > > Ron Johnson, Jr. ron.l.johnson@cox.net > > Jefferson, LA USA > > > > I can't make you have an abortion, but you can *make* me pay > > child support for 18 years? However, if I want the child (and > > all the expenses that entails) for the *rest*of*my*life*, and you > > don't want it for 9 months, tough luck??? > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match -- Best Regards, Mike Benoit
Mike Benoit wrote: > I just ran some benchmarks against a 10K SCSI drive and 7200 RPM IDE > drive here: > > http://fsbench.netnation.com/ > > The results vary quite a bit, and it seems the file system you use > can make a huge difference. > > SCSI is obviously faster, but a 20% performance gain for 5x the cost is > only worth it for a very small percentage of people, I would think. Did you turn off the IDE write cache? If not, the SCSI drive is reliable in case of OS failure, while the IDE is not. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
It seems to me file system journaling should fix the whole problem by giving you a record of what was actually commited to disk and what was not. I must not understand journaling correctly. Can anyone explain to me how journaling works. ----- Original Message ----- From: "Bruce Momjian" <pgman@candle.pha.pa.us> To: <mikeb@netnation.com> Cc: "Stephen" <jleelim@xxxxxx.com>; <pgsql-general@postgresql.org> Sent: Monday, October 27, 2003 12:14 PM Subject: Re: [GENERAL] SCSI vs. IDE performance test > Mike Benoit wrote: > > I just ran some benchmarks against a 10K SCSI drive and 7200 RPM IDE > > drive here: > > > > http://fsbench.netnation.com/ > > > > The results vary quite a bit, and it seems the file system you use > > can make a huge difference. > > > > SCSI is obviously faster, but a 20% performance gain for 5x the cost is > > only worth it for a very small percentage of people, I would think. > > Did you turn off the IDE write cache? If not, the SCSI drive is > reliable in case of OS failure, while the IDE is not. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org >
"Rick Gigger" <rick@alpinenetworking.com> writes: > It seems to me file system journaling should fix the whole problem by giving > you a record of what was actually commited to disk and what was not. Nope, a journaling FS has exactly the same problem Postgres does (because the underlying "WAL" concept is the same: write the log entries before you change the files they describe). If the drive lies about write order, the FS can be screwed just as badly. Now the FS code might have a low-level way to force write order that Postgres doesn't have access to ... but simply uttering the magic incantation "journaling file system" will not make this problem disappear. regards, tom lane
ahhh. "lies about write order" is the phrase that I was looking for. That seemed to make sense but I didn't know if I could go directly from "lying about fsync" to that. Obviously I don't understand exactly what fsync is doing. I assume this means that if you were to turn fsync off you would get considerably better performance but introduce the possibility of corrupting the files in your database. Thank you. This makes a lot more sense now. ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Rick Gigger" <rick@alpinenetworking.com> Cc: <pgsql-general@postgresql.org> Sent: Monday, October 27, 2003 3:39 PM Subject: Re: [GENERAL] SCSI vs. IDE performance test > "Rick Gigger" <rick@alpinenetworking.com> writes: > > It seems to me file system journaling should fix the whole problem by giving > > you a record of what was actually commited to disk and what was not. > > Nope, a journaling FS has exactly the same problem Postgres does > (because the underlying "WAL" concept is the same: write the log entries > before you change the files they describe). If the drive lies about > write order, the FS can be screwed just as badly. Now the FS code might > have a low-level way to force write order that Postgres doesn't have > access to ... but simply uttering the magic incantation "journaling file > system" will not make this problem disappear. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html >
Tom, this discussion brings up something that's been bugging me about the recommendations for getting more performance out of PG.. in particular the one that suggests you put your WAL files on a different physical drive from the database. Consider the following scenario: Database on drive1 WAL on drive2 1. PG write of some sort occurs. 2. PG writes out the WAL. 3. PG writes out the data. 4. PG updates the WAL to reflect data actually written. 5. System crashes/reboots/whatever. With the DB and the WAL on different drives, it seems possible to me that drive2 could've fsync()'d or otherwise properly written all of the data out, but drive1 could have failed somewhere along the way and not actually written the data to the DB. The next time PG is brought up, the WAL would indicate the transaction, as it were, was a success.. but the data wouldn't actually be there. In the case of using only one drive, the rollback (from a FS perspective) couldn't possibly occur in such a way as to leave step 4 as a success, but step 3 as a failure -- worst case, the data would be written out but the WAL wouldn't have been updated (rolled back say by the FS) and thus PG will roll back the data itself, or use whatever mechanism it uses to insure data integrity is consistent with the WAL. Am I smoking something here or is this a real, if rare in practice, risk that occurs when you have the WAL on a different drive than the data is on? At 17:39 10/27/2003, Tom Lane wrote: >"Rick Gigger" <rick@alpinenetworking.com> writes: > > It seems to me file system journaling should fix the whole problem by > giving > > you a record of what was actually commited to disk and what was not. > >Nope, a journaling FS has exactly the same problem Postgres does >(because the underlying "WAL" concept is the same: write the log entries >before you change the files they describe). If the drive lies about >write order, the FS can be screwed just as badly. Now the FS code might >have a low-level way to force write order that Postgres doesn't have >access to ... but simply uttering the magic incantation "journaling file >system" will not make this problem disappear. > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html
"Rick Gigger" <rick@alpinenetworking.com> writes: > ahhh. "lies about write order" is the phrase that I was looking for. That > seemed to make sense but I didn't know if I could go directly from "lying > about fsync" to that. Obviously I don't understand exactly what fsync is > doing. What we actually care about is write order: WAL entries have to hit the platter before the corresponding data-file changes do. Unfortunately we have no portable means of expressing that exact constraint to the kernel. We use fsync() (or related constructs) instead: issue the WAL writes, fsync the WAL file, then issue the data-file writes. This constrains the write ordering more than is really needed, but it's the best we can do in a portable Unix application. The problem is that the kernel thinks fsync is done when the disk drive reports the writes are complete. When we say a drive lies about this, we mean it accepts a sector of data into its on-board RAM and then immediately claims write-complete, when in reality the data hasn't hit the platter yet and will be lost if power dies before the drive gets around to writing it. So we can have a scenario where we think WAL is down to disk and go ahead with issuing data-file writes. These will also be shoved over to the drive and stored in its on-board RAM. Now the drive has multiple sectors pending write in its buffers. If it chooses to write these in some order other than the order they were given to it, it could write the data file updates to disk first. If power drops *now*, we lose, because the data files are inconsistent and there's no WAL entry to tell us to fix it. Got it? It's really the combination of "lie about write completion" and "write pending sectors out of order" that can mess things up. The reason IDE drives have to do this for reasonable performance is that the IDE interface is single-threaded: you can only have one read or write in process at a time, from the point of view of the kernel-to-drive interface. But in order to schedule reads and writes in a way that makes sense physically (minimizes seeks), the drive has to have multiple read and write requests pending that it can pick and choose from. The only possibility to do that in the IDE world is to let a write "complete" in interface terms before it's really done ... that is, lie. The reason SCSI drives do *not* do this is that the SCSI interface is logically multi-threaded: you can have multiple reads or writes pending at once. When you want to write on a SCSI drive, you send over a command that says "write this data at this sector". Sometime later the drive sends back a status report "yessir boss, I done did that write". Similarly, a read consists of a command "read this sector", followed sometime later by a response that delivers the requested data. But you can send other commands to read or write other sectors meanwhile, and the drive is free to reorder them to suit its convenience. So in the SCSI world, there is no need for the drive to lie in order to do its own read/write scheduling. The kernel knows the truth about whether a given sector has hit disk, and so it won't conclude that the WAL file has been completely fsync'd until it really is all down to the platter. This is also why SCSI disks shine on the read side when you have lots of processes doing reads: in an IDE drive, there is no way for the drive to satisfy read requests in any order but the one they're issued in. If the kernel guesses wrong about the best ordering for a set of read requests, then everybody waits for the seeks needed to get the earlier processes' data. A SCSI drive can fetch the "nearest" data first, and then that requester is freed to make progress in the CPU while the other guys wait for their longer seeks. There's no win here with a single active user process (since it probably wants specific data in a specific order), but it's a huge win if lots of processes are making unrelated read requests. Clear now? (In a previous lifetime I wrote SCSI disk driver code ...) regards, tom lane
On Mon, 2003-10-27 at 12:44, Mike Benoit wrote: > I just ran some benchmarks against a 10K SCSI drive and 7200 RPM IDE > drive here: > > http://fsbench.netnation.com/ > > The results vary quite a bit, and it seems the file system you use > can make a huge difference. > > SCSI is obviously faster, but a 20% performance gain for 5x the cost is > only worth it for a very small percentage of people, I would think. Running bonnie++ in 4 or 5 parallel runs would be interesting, to see how IDE & SCSI in a multi-user environment. > On Wed, 2003-10-22 at 09:01, Stephen wrote: > > The SCSI improvement over IDE seems overrated in the test. I would have > > expected at most a 30% improvement. Other reviews seem to point out that IDE > > performs just as well or better. > > > > See Tom's hardware: > > http://www20.tomshardware.com/storage/20020305/index.html > > > > Stephen > > > > > > "Ron Johnson" <ron.l.johnson@cox.net> wrote in message > > news:1066837102.12532.176.camel@haggis... > > > > > http://hardware.devchannel.org/hardwarechannel/03/10/20/1953249.shtml?tid=20 > > &tid=38&tid=49 -- ----------------------------------------------------------------- Ron Johnson, Jr. ron.l.johnson@cox.net Jefferson, LA USA "Why should we not accept all in favor of woman suffrage to our platform and association even though they be rabid pro-slavery Democrats." Susan B. Anthony, _History_of_Woman_Suffrage_ http://www.ifeminists.com/introduction/essays/introduction.html
Thanks! Now it is much, much more clear. It leaves me with a few additional questions though. Question 1: "we have no portable means of expressing that exact constraint to the kernel" Does this mean that specific operating systems have a better way of dealing with this? Which ones and how? I'm guessing that it couldn't make to big of a performance difference or it would probably be implemented already. Question 2: Do serial ATA drives suffer from the same issue? ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Rick Gigger" <rick@alpinenetworking.com> Cc: <pgsql-general@postgresql.org> Sent: Monday, October 27, 2003 5:05 PM Subject: Re: [GENERAL] SCSI vs. IDE performance test > "Rick Gigger" <rick@alpinenetworking.com> writes: > > ahhh. "lies about write order" is the phrase that I was looking for. That > > seemed to make sense but I didn't know if I could go directly from "lying > > about fsync" to that. Obviously I don't understand exactly what fsync is > > doing. > > What we actually care about is write order: WAL entries have to hit the > platter before the corresponding data-file changes do. Unfortunately we > have no portable means of expressing that exact constraint to the > kernel. We use fsync() (or related constructs) instead: issue the WAL > writes, fsync the WAL file, then issue the data-file writes. This > constrains the write ordering more than is really needed, but it's the > best we can do in a portable Unix application. > > The problem is that the kernel thinks fsync is done when the disk drive > reports the writes are complete. When we say a drive lies about this, > we mean it accepts a sector of data into its on-board RAM and then > immediately claims write-complete, when in reality the data hasn't hit > the platter yet and will be lost if power dies before the drive gets > around to writing it. > > So we can have a scenario where we think WAL is down to disk and go > ahead with issuing data-file writes. These will also be shoved over to > the drive and stored in its on-board RAM. Now the drive has multiple > sectors pending write in its buffers. If it chooses to write these in > some order other than the order they were given to it, it could write > the data file updates to disk first. If power drops *now*, we lose, > because the data files are inconsistent and there's no WAL entry to tell > us to fix it. > > Got it? It's really the combination of "lie about write completion" and > "write pending sectors out of order" that can mess things up. > > The reason IDE drives have to do this for reasonable performance is that > the IDE interface is single-threaded: you can only have one read or > write in process at a time, from the point of view of the > kernel-to-drive interface. But in order to schedule reads and writes in > a way that makes sense physically (minimizes seeks), the drive has to > have multiple read and write requests pending that it can pick and > choose from. The only possibility to do that in the IDE world is to > let a write "complete" in interface terms before it's really done ... > that is, lie. > > The reason SCSI drives do *not* do this is that the SCSI interface is > logically multi-threaded: you can have multiple reads or writes pending > at once. When you want to write on a SCSI drive, you send over a > command that says "write this data at this sector". Sometime later the > drive sends back a status report "yessir boss, I done did that write". > Similarly, a read consists of a command "read this sector", followed > sometime later by a response that delivers the requested data. But you > can send other commands to read or write other sectors meanwhile, and > the drive is free to reorder them to suit its convenience. So in the > SCSI world, there is no need for the drive to lie in order to do its own > read/write scheduling. The kernel knows the truth about whether a given > sector has hit disk, and so it won't conclude that the WAL file has been > completely fsync'd until it really is all down to the platter. > > This is also why SCSI disks shine on the read side when you have lots of > processes doing reads: in an IDE drive, there is no way for the drive to > satisfy read requests in any order but the one they're issued in. If the > kernel guesses wrong about the best ordering for a set of read requests, > then everybody waits for the seeks needed to get the earlier processes' > data. A SCSI drive can fetch the "nearest" data first, and then that > requester is freed to make progress in the CPU while the other guys wait > for their longer seeks. There's no win here with a single active user > process (since it probably wants specific data in a specific order), but > it's a huge win if lots of processes are making unrelated read requests. > > Clear now? > > (In a previous lifetime I wrote SCSI disk driver code ...) > > regards, tom lane >
On Mon, 2003-10-27 at 17:18, Rick Gigger wrote: > ahhh. "lies about write order" is the phrase that I was looking for. That > seemed to make sense but I didn't know if I could go directly from "lying > about fsync" to that. Obviously I don't understand exactly what fsync is > doing. I assume this means that if you were to turn fsync off you would get > considerably better performance but introduce the possibility of corrupting > the files in your database. Yes. There was a recent thread (in -general or -performance) regarding putting the WAL files on a different disk, and changing wal_sync_- method to open_sync (or open_datasync, don't remember). This will allow the device(s) that the database is on to run asynchronously, while the WAL is synchronous, for safety. > Thank you. This makes a lot more sense now. > > > > ----- Original Message ----- > From: "Tom Lane" <tgl@sss.pgh.pa.us> > To: "Rick Gigger" <rick@alpinenetworking.com> > Cc: <pgsql-general@postgresql.org> > Sent: Monday, October 27, 2003 3:39 PM > Subject: Re: [GENERAL] SCSI vs. IDE performance test > > > > "Rick Gigger" <rick@alpinenetworking.com> writes: > > > It seems to me file system journaling should fix the whole problem by > giving > > > you a record of what was actually commited to disk and what was not. > > > > Nope, a journaling FS has exactly the same problem Postgres does > > (because the underlying "WAL" concept is the same: write the log entries > > before you change the files they describe). If the drive lies about > > write order, the FS can be screwed just as badly. Now the FS code might > > have a low-level way to force write order that Postgres doesn't have > > access to ... but simply uttering the magic incantation "journaling file > > system" will not make this problem disappear. > > > > regards, tom lane > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 5: Have you checked our extensive FAQ? > > > > http://www.postgresql.org/docs/faqs/FAQ.html > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend -- ----------------------------------------------------------------- Ron Johnson, Jr. ron.l.johnson@cox.net Jefferson, LA USA Some former UNSCOM officials are alarmed, however. Terry Taylor, a British senior UNSCOM inspector from 1993 to 1997, says the figure of 95 percent disarmament is "complete nonsense because inspectors never learned what 100 percent was. UNSCOM found a great deal and destroyed a great deal, but we knew [Iraq's] work was continuing while we were there, and I'm sure it continues," says Mr. Taylor, now head of the Washington http://www.csmonitor.com/2002/0829/p01s03-wosc.html
"Rick Gigger" <rick@alpinenetworking.com> writes: >> "we have no portable means of expressing that exact constraint to the >> kernel" > Does this mean that specific operating systems have a better way of dealing > with this? Which ones and how? I'm not aware of any that offer a way of expressing "write these particular blocks before those particular blocks". It doesn't seem like it would require rocket scientists to devise such an API, but no one's got round to it yet. Part of the problem is that the issue would have to be approached at multiple levels: there is no point in offering an OS-level API for this when the hardware underlying the bus-level API (IDE) is doing its level best to sabotage the entire semantics. > Do serial ATA drives suffer from the same issue? Um, not an expert, but I think ATA is the same as IDE except for bus width and transfer rate. If either one allows for multiple concurrent read/write transactions I'll be very surprised. regards, tom lane
On Tue, Oct 28, 2003 at 12:17:59AM -0500, Tom Lane wrote: > "Rick Gigger" <rick@alpinenetworking.com> writes: > > Do serial ATA drives suffer from the same issue? > > Um, not an expert, but I think ATA is the same as IDE except for bus > width and transfer rate. If either one allows for multiple concurrent > read/write transactions I'll be very surprised. Well, some googleing around seems to indicate that Serial ATA I/ATA-6 has Tagged Command Queueing (TCQ) which is adding this feature specifically. Whether it is a mandatory part of the spec I don't know. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > "All that is needed for the forces of evil to triumph is for enough good > men to do nothing." - Edmond Burke > "The penalty good people pay for not being interested in politics is to be > governed by people worse than themselves." - Plato
Attachment
Martijn van Oosterhout <kleptog@svana.org> writes: > Well, some googleing around seems to indicate that Serial ATA I/ATA-6 has > Tagged Command Queueing (TCQ) which is adding this feature specifically. > Whether it is a mandatory part of the spec I don't know. Yeah? If so, and *if fully implemented* on both sides of the interface, this would eliminate the architectural advantages I was just sketching for SCSI. I can't claim to be up on what's happening in the IDE/ATA world though... regards, tom lane
Allen Landsidel <all@biosys.net> writes: > Tom, this discussion brings up something that's been bugging me about the > recommendations for getting more performance out of PG.. in particular the > one that suggests you put your WAL files on a different physical drive from > the database. > ... > With the DB and the WAL on different drives, it seems possible to me that > drive2 could've fsync()'d or otherwise properly written all of the data > out, but drive1 could have failed somewhere along the way and not actually > written the data to the DB. Drive failure, in terms of losing something the drive claimed it had written successfully, is not something that we can protect against. For that, you go to your backup tapes. I don't see that it makes any difference whether the database is spread across one drive or several; you could still have a scenario where the claimed-complete write to a data file failed to happen and then we recorded a checkpoint anyway. Now, if the data drive fails to write and we can detect that, then we're OK, because we won't record a checkpoint. We can redo the write based on the contents of WAL after the problem's been fixed. This is another reason why the IDE lie-about-write-completion behavior is a Bad Idea: if the drive accepts data and then later has a problem writing it, there is no way for it to report that fact --- and it's too late anyhow since we've already taken other actions on the assumption that the write is done. I'm not at all sure what IDE drives do when they have a failure writing out cached buffers; anyone have experience with that? regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes: > On Tue, Oct 28, 2003 at 12:17:59AM -0500, Tom Lane wrote: > > "Rick Gigger" <rick@alpinenetworking.com> writes: > > > Do serial ATA drives suffer from the same issue? > > > > Um, not an expert, but I think ATA is the same as IDE except for bus > > width and transfer rate. If either one allows for multiple concurrent > > read/write transactions I'll be very surprised. > > Well, some googleing around seems to indicate that Serial ATA I/ATA-6 has > Tagged Command Queueing (TCQ) which is adding this feature specifically. > Whether it is a mandatory part of the spec I don't know. The post on linux-kernel from the maxtor guy seemed to indicate we would have to wait for ATA-7 drives (which are not out in the market yet) before the features we really need are there. Currently the linux-kernel folks are talking about how to integrate an IDE SYNC operation into the world. It looks like filesystems with journals will issue an IDE SYNC to checkpoint the journal, but it doesn't really look like they're planning to hook it into fsync unless people speak up and explain what databases need in that regard. However SYNC flushes the entire cache and means that all other writes are blocked until the SYNC completes. Apparently the feature needed to *really* implement fsync is called FUA which would give real feedback of the status of the write without preventing all other writes from proceeding. That's what isn't going to appear until ATA-7. All this is from a few posts on linux-kernel. e.g.: http://www.ussg.iu.edu/hypermail/linux/kernel/0304.1/0450.html -- greg
Tom Lane <tgl@sss.pgh.pa.us> writes: > I'm not at all sure what IDE drives do when they have a failure writing out > cached buffers; anyone have experience with that? There's a looooong discussion about this too on linux-kernel, search for "blockbusting". I think the conclusion is "it depends". Often write failures aren't detected until the block is subsequently read. In that case of course there's no hope. What's worse is the drive might not remap the block on a read, so the problem can stick around even after the error. If the write failure is caused by a bad block and the drive detects this at the time it's written then the drive can actually remap that block to one of its spare blocks. This is invisible to the host. If it runs out of spare blocks, then you're in trouble. And there's no warning that you're running low on spare blocks in any particular region unless you use special utilities to query the drive. Also if the failure is caused by environmental factors like vibrations or heat then you can be in trouble too. -- greg
> It seems to me file system journaling should fix the whole problem by giving > you a record of what was actually commited to disk and what was not. I must > not understand journaling correctly. Can anyone explain to me how > journaling works. Journaling depends, absolutely critically, on the OS knowing what data has actually been written to disk. It can't be any other way; with an in-disk write cache the OS has no way to know when the *journal* has been written to disk, therefore journaling can't work. -- Scott Ribe scott_ribe@killerbytes.com http://www.killerbytes.com/ (303) 665-7007 voice
Greg Stark wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > I'm not at all sure what IDE drives do when they have a failure writing out > > cached buffers; anyone have experience with that? > > There's a looooong discussion about this too on linux-kernel, search for > "blockbusting". I think the conclusion is "it depends". > > Often write failures aren't detected until the block is subsequently read. In > that case of course there's no hope. What's worse is the drive might not remap > the block on a read, so the problem can stick around even after the error. > > If the write failure is caused by a bad block and the drive detects this at > the time it's written then the drive can actually remap that block to one of > its spare blocks. This is invisible to the host. > > If it runs out of spare blocks, then you're in trouble. And there's no warning > that you're running low on spare blocks in any particular region unless you > use special utilities to query the drive. Also if the failure is caused by > environmental factors like vibrations or heat then you can be in trouble too. My Buslogic/Mylex plain SCSI controller would beep when it hit a bad block --- I didn't know why my computer was beeping for a while until I figured it out --- can't beat that service. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
> "Rick Gigger" <rick@alpinenetworking.com> writes: > >> "we have no portable means of expressing that exact constraint to the > >> kernel" > > Does this mean that specific operating systems have a better way of dealing > > with this? Which ones and how? > > I'm not aware of any that offer a way of expressing "write these > particular blocks before those particular blocks". It doesn't seem like > it would require rocket scientists to devise such an API, but no one's > got round to it yet. Part of the problem is that the issue would have > to be approached at multiple levels: there is no point in offering an > OS-level API for this when the hardware underlying the bus-level API > (IDE) is doing its level best to sabotage the entire semantics. But for those of us using scsi wouldn't it be possible to get a performance gain here? Would the gain be worth the effort?
On Tue, Oct 28, 2003 at 01:04:27PM -0500, Greg Stark wrote: > If it runs out of spare blocks, then you're in trouble. And there's no warning > that you're running low on spare blocks in any particular region unless you > use special utilities to query the drive. Also if the failure is caused by > environmental factors like vibrations or heat then you can be in trouble too. Actually, drives have S.M.A.R.T for reporting these kind of issues. The idea being that a counter decrements every time a block is remapped. When it reaches a declared threshold the drive declares an error and if it's in warranty that's enough to convince the manufacturer to send you a new disk. Not that many people use this feature, but it is there. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > "All that is needed for the forces of evil to triumph is for enough good > men to do nothing." - Edmond Burke > "The penalty good people pay for not being interested in politics is to be > governed by people worse than themselves." - Plato
Attachment
Re: SCSI vs. IDE performance test
From
list-pgsql-general@news.cistron.nl ("Miquel van Smoorenburg" )
Date:
In article <87n0blgs9g.fsf@stark.dyndns.tv>, Greg Stark <gsstark@mit.edu> wrote: >Currently the linux-kernel folks are talking about how to integrate an IDE >SYNC operation into the world. It looks like filesystems with journals will >issue an IDE SYNC to checkpoint the journal, but it doesn't really look like >they're planning to hook it into fsync unless people speak up and explain what >databases need in that regard. However SYNC flushes the entire cache and means >that all other writes are blocked until the SYNC completes. > >http://www.ussg.iu.edu/hypermail/linux/kernel/0304.1/0450.html Also, if you're interested in this kind of stuff and what's going on in the Linux kernel development circles, Google for "ide write barrier". For example http://lkml.org/lkml/2003/10/13/87 Mike.
>>> "we have no portable means of expressing that exact constraint to the >>> kernel" >> Does this mean that specific operating systems have a better way of >> dealing with this? Which ones and how? > > I'm not aware of any that offer a way of expressing "write these > particular blocks before those particular blocks". It doesn't seem like > it would require rocket scientists to devise such an API, but no one's > got round to it yet. Part of the problem is that the issue would have > to be approached at multiple levels: there is no point in offering an > OS-level API for this when the hardware underlying the bus-level API > (IDE) is doing its level best to sabotage the entire semantics. [sNip] Actually, NetWare is one OS that does this, and has been doing so since the 1980s with version 2 (version 6.5 is the current version today). They have a Patented caching algorithm called "Elevator Seeking" which both prolongs the life of the drive by reducing wear-and-tear and improving read/write performance by minimizing seek operations. With IDE it seems that this caching algorithm is also beneficial, but it really shines with SCSI drives. In all my experience, SCSI drives are much faster and far more reliable than IDE drives. I've always assumed that it boils down to "you get what you pay for." -- Randolf Richardson - rr@8x.ca Inter-Corporate Computer & Network Services, Inc. Vancouver, British Columbia, Canada http://www.8x.ca/ This message originated from within a secure, reliable, high-performance network ... a Novell NetWare network.
On Wed, Nov 19, 2003 at 09:29:21PM +0000, Randolf Richardson, DevNet SysOp 29 wrote: > Actually, NetWare is one OS that does this, and has been doing so > since the 1980s with version 2 (version 6.5 is the current version today). > They have a Patented caching algorithm called "Elevator Seeking" which both > prolongs the life of the drive by reducing wear-and-tear and improving > read/write performance by minimizing seek operations. Huh, is this different from your ordinary elevator algorithm? I'd be surprised if there was an OS which didn't use something like that ... -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "La verdad no siempre es bonita, pero el hambre de ella sí"
"Randolf Richardson, DevNet SysOp 29" <rr@8x.ca> writes: > Actually, NetWare is one OS that does this, and has been doing so > since the 1980s with version 2 (version 6.5 is the current version today). > They have a Patented caching algorithm called "Elevator Seeking" which both They've managed to patent ye olde elevator algorithm?? The USPTO really is without a clue, isn't it :-( regards, tom lane
[sNip] > They've managed to patent ye olde elevator algorithm?? The USPTO really > is without a clue, isn't it :-( It's not the USPTO's fault -- the problem is that nobody objected to it while it was in the "Patent Pending" state. -- Randolf Richardson - rr@8x.ca Vancouver, British Columbia, Canada Please do not eMail me directly when responding to my postings in the newsgroups.
Randolf Richardson <rr@8x.ca> writes: >> They've managed to patent ye olde elevator algorithm?? The USPTO really >> is without a clue, isn't it :-( > It's not the USPTO's fault -- the problem is that nobody objected to it > while it was in the "Patent Pending" state. If their examiner had even *minimal* competency in the field, it would not have gotten to the "Patent Pending" state. Algorithms that are well documented in the standard textbooks of thirty years ago do not qualify as something people should have to stand guard against. Perhaps I should try to patent base-two arithmetic, and hope no one notices till it goes through ... certainly the USPTO won't notice ... regards, tom lane
Base-two artihmetic sounds pretty broad. If only you could come up with a scheme for division and multiplication by powers of two through bitshifting..... On Wed, 26 Nov 2003, Tom Lane wrote: > Randolf Richardson <rr@8x.ca> writes: > >> They've managed to patent ye olde elevator algorithm?? The USPTO really > >> is without a clue, isn't it :-( > > > It's not the USPTO's fault -- the problem is that nobody objected to it > > while it was in the "Patent Pending" state. > > If their examiner had even *minimal* competency in the field, it would > not have gotten to the "Patent Pending" state. Algorithms that are well > documented in the standard textbooks of thirty years ago do not qualify > as something people should have to stand guard against. > > Perhaps I should try to patent base-two arithmetic, and hope no one > notices till it goes through ... certainly the USPTO won't notice ... > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
Ben wrote: > Base-two artihmetic sounds pretty broad. If only you could come up with a > scheme for division and multiplication by powers of two through > bitshifting..... I already have that patent! :-) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Martijn van Oosterhout wrote: > On Tue, Oct 28, 2003 at 01:04:27PM -0500, Greg Stark wrote: > >>If it runs out of spare blocks, then you're in trouble. And there's no warning >>that you're running low on spare blocks in any particular region unless you >>use special utilities to query the drive. Also if the failure is caused by >>environmental factors like vibrations or heat then you can be in trouble too. > > > Actually, drives have S.M.A.R.T for reporting these kind of issues. The idea > being that a counter decrements every time a block is remapped. When it > reaches a declared threshold the drive declares an error and if it's in > warranty that's enough to convince the manufacturer to send you a new disk. > > Not that many people use this feature, but it is there. I used smartsuite (http://sourceforge.net/projects/smartsuite/) to view the status of the drives, but the relocated sector count appears only available on ide drives. Does anyone know if that is the nature of scsi drives or is it just a limitation of that tool?
Joseph Shraibman wrote: > > Actually, drives have S.M.A.R.T for reporting these kind of issues. The idea > > being that a counter decrements every time a block is remapped. When it > > reaches a declared threshold the drive declares an error and if it's in > > warranty that's enough to convince the manufacturer to send you a new disk. > > > > Not that many people use this feature, but it is there. > > I used smartsuite (http://sourceforge.net/projects/smartsuite/) to view > the status of the drives, but the relocated sector count appears only > available on ide drives. Does anyone know if that is the nature of scsi > drives or is it just a limitation of that tool? Do SCSI drives even do relocation? I had a Seagate SCSI drive that would beep every time I tried to access a bad block, basically telling me to replace the drive. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Fri, 2003-11-28 at 21:45, Bruce Momjian wrote: > Do SCSI drives even do relocation? I had a Seagate SCSI drive that > would beep every time I tried to access a bad block, basically telling > me to replace the drive. I'm pretty sure that SCSI drives, or at least more modern ones, do. The ones I've used have a list of bad blocks stored internally and will relocate blocks automatically. The drives allowed you to reset this list by running a low level format from the scsi controller. The drives would then clear the bad blocks list and recheck disk blocks again. -- Suchandra Thapa <s-thapa-11@alumni.uchicago.edu>
Attachment
[sNip] >> I used smartsuite (http://sourceforge.net/projects/smartsuite/) to view >> the status of the drives, but the relocated sector count appears only >> available on ide drives. Does anyone know if that is the nature of >> scsi drives or is it just a limitation of that tool? > > Do SCSI drives even do relocation? I had a Seagate SCSI drive that > would beep every time I tried to access a bad block, basically telling > me to replace the drive. Normally this should be handled by the OS since a judgement can be made on data reliability whereas the hard drive wouldn't know which algorithm to use (e.g., CRC, etc.). Perhaps the following would be "food for thought" on future table space implementation so as to do something that Oracle hasn't thought of... On NetWare v2.x (c. 1980) through v6.5 (the current version, released in 2003) a section of each Partition was designated as a "HotFix" area (the percentage is configurable at the time of formatting) which is automatically used in place of bad blocks as they are discovered, and error messages are generated in system logs and on the System Console whenever one is found. The default percentage originally started out at 2% but has eventually be lowered to 0.2% over time due to a number of factors including the following: 1. Larger capacity hard drives; and, 2. Fewer defects on new hard drives -- in the old days (20 years ago definitely qualifies as "old days" in the computer industry) it was common for new hard drives to come with errors on the drive, but now all hard drives come with zero bad sectors (I assume this is due to improved techniques and practices at the manufacturing level). I'd be quite happy to write the documentation explaining table spaces in PostgreSQL should it become a feature in a future release. In fact, I would really enjoy doing this, and so I believe that my contribution could be very helpful. -- Randolf Richardson - rr@8x.ca Vancouver, British Columbia, Canada Please do not eMail me directly when responding to my postings in the newsgroups.
>> Base-two artihmetic sounds pretty broad. If only you could come up with a >> scheme for division and multiplication by powers of two through >> bitshifting..... > > I already have that patent! :-) Please share your licensing agreement with the rest of us so that we may decide to applaud you or throw tomatoes at you (throwing tomatoes, as far as I'm aware, is a "process" which hasn't been Patented yet). =D -- Randolf Richardson - rr@8x.ca Vancouver, British Columbia, Canada Please do not eMail me directly when responding to my postings in the newsgroups.