Thread: On Linux Filesystems
Bruce Momjian commented: "Uh, the ext2 developers say it isn't 100% reliable" ... "I mentioned it while I was visiting Red Hat, and they didn't refute it." 1. Nobody has gone through any formal proofs, and there are few systems _anywhere_ that are 100% reliable. NASA has occasionally lost spacecraft to software bugs, so nobody will be making such rash claims about ext2. 2. Several projects have taken on the task of introducing journalled filesystems, most notably ext3 (sponsored by RHAT via Stephen Tweedy) and ReiserFS (oft sponsored by SuSE). (I leave off JFS/XFS since they existed long before they had any relationship with Linux.) Participants in such projects certainly have interest in presenting the notion that they provide improved reliability over ext2. 3. There is no "apologist" for ext2 that will either (stupidly and futilely) claim it to be flawless. Nor is there substantial interest in improving it; the sort people that would be interested in that sort of thing are working on the other FSes. This also means that there's no one interested in going into the guaranteed-to-be-unsung effort involved in trying to prove ext2 to be "formally reliable." 4. It would be silly to minimize the impact of commercial interest. RHAT has been paying for the development of a would-be ext2 successor. For them to refute your comments wouldn't be in their interests. Note that these are "warm and fuzzy" comments, the whole lot. The 80-some thousand lines of code involved in ext2, ext3, reiserfs, and jfs are no more amenable to absolute mathematical proof of reliability than the corresponding BSD FFS code. 6. Such efforts would be futile, anyways. Disks are mechanical devices, and, as such, suffer from substantial reliability issues irrespective of the reliability of the software. I have lost sleep on too many occasions due to failures of: a) Disk drives, b) Disk controllers [the worst Oracle failure I encountered resulted from this], and c) OS memory management. I used ReiserFS back in its "bleeding edge" days, and find myself a lot more worried about losing data to flakey disk controllers. It frankly seems insulting to focus on ext2 in this way when: a) There aren't _hard_ conclusions to point to, just soft ones; b) The reasons for you hearing vaguely negative things about ext2 are much more likely political than they are technical. I wish there were more "hard and fast" conclusions to draw, to be able to conclusively say that one or another Linux filesystem was unambiguously preferable for use with PostgreSQL. There are not conclusive metrics, either in terms of speed or of some notion of "reliability." I'd expect ReiserFS to be the poorest choice, and for XFS to be the best, but I only have fuzzy reasons, as opposed to metrics. The absence of measurable metrics of the sort is _NOT_ a proof that (say) FreeBSD is conclusively preferable, whatever your own preferences (I'll try to avoid characterizing it as "prejudices," as that would be unkind) may be. That would represent a quite separate debate, and one that doesn't belong here, certainly not on a thread where the underlying question was "Which Linux FS is preferred?" If the OSDB TPC-like benchmarks can get "packaged" up well enough to easily run and rerun them, there's hope of getting better answers, perhaps even including performance metrics for *BSD. That, not Linux-baiting, is the answer... -- select 'cbbrowne' || '@' || 'acm.org'; http://www.ntlug.org/~cbbrowne/sap.html (eq? 'truth 'beauty) ; to avoid unassigned-var error, since compiled code ; will pick up previous value to var set!-ed, ; the unassigned object. -- from BBN-CL's cl-parser.scm
As I remember, there were clear cases that ext2 would fail to recover, and it was known to be a limitation of the file system implementation. Some of the ext2 developers were in the room at Red Hat when I said that, so if it was incorrect, they would hopefully have spoken up. I addressed the comments directly to them. To be recoverasble, you have to be careful how you sync metadata to disk. All the journalling file systems, and the BSD UFS do that. I am told ext2 does not. I don't know much more than that. As I remember years ago, ext2 was faster than UFS, but it was true because ext2 didn't guarantee failure recovery. Now, with UFS soft updates, the have similar performance characteristics, but UFS is still crash-safe. However, I just tried google and couldn't find any documented evidence that ext2 isn't crash-safe, so maybe I am wrong. --------------------------------------------------------------------------- Christopher Browne wrote: > Bruce Momjian commented: > > "Uh, the ext2 developers say it isn't 100% reliable" ... "I mentioned > it while I was visiting Red Hat, and they didn't refute it." > > 1. Nobody has gone through any formal proofs, and there are few > systems _anywhere_ that are 100% reliable. NASA has occasionally lost > spacecraft to software bugs, so nobody will be making such rash claims > about ext2. > > 2. Several projects have taken on the task of introducing journalled > filesystems, most notably ext3 (sponsored by RHAT via Stephen Tweedy) > and ReiserFS (oft sponsored by SuSE). (I leave off JFS/XFS since they > existed long before they had any relationship with Linux.) > > Participants in such projects certainly have interest in presenting > the notion that they provide improved reliability over ext2. > > 3. There is no "apologist" for ext2 that will either (stupidly and > futilely) claim it to be flawless. Nor is there substantial interest > in improving it; the sort people that would be interested in that sort > of thing are working on the other FSes. > > This also means that there's no one interested in going into the > guaranteed-to-be-unsung effort involved in trying to prove ext2 to be > "formally reliable." > > 4. It would be silly to minimize the impact of commercial interest. > RHAT has been paying for the development of a would-be ext2 successor. > For them to refute your comments wouldn't be in their interests. > > Note that these are "warm and fuzzy" comments, the whole lot. The > 80-some thousand lines of code involved in ext2, ext3, reiserfs, and > jfs are no more amenable to absolute mathematical proof of reliability > than the corresponding BSD FFS code. > > 6. Such efforts would be futile, anyways. Disks are mechanical > devices, and, as such, suffer from substantial reliability issues > irrespective of the reliability of the software. I have lost sleep on > too many occasions due to failures of: > a) Disk drives, > b) Disk controllers [the worst Oracle failure I encountered resulted > from this], and > c) OS memory management. > > I used ReiserFS back in its "bleeding edge" days, and find myself a > lot more worried about losing data to flakey disk controllers. > > It frankly seems insulting to focus on ext2 in this way when: > > a) There aren't _hard_ conclusions to point to, just soft ones; > > b) The reasons for you hearing vaguely negative things about ext2 > are much more likely political than they are technical. > > I wish there were more "hard and fast" conclusions to draw, to be able > to conclusively say that one or another Linux filesystem was > unambiguously preferable for use with PostgreSQL. There are not > conclusive metrics, either in terms of speed or of some notion of > "reliability." I'd expect ReiserFS to be the poorest choice, and for > XFS to be the best, but I only have fuzzy reasons, as opposed to > metrics. > > The absence of measurable metrics of the sort is _NOT_ a proof that > (say) FreeBSD is conclusively preferable, whatever your own > preferences (I'll try to avoid characterizing it as "prejudices," as > that would be unkind) may be. That would represent a quite separate > debate, and one that doesn't belong here, certainly not on a thread > where the underlying question was "Which Linux FS is preferred?" > > If the OSDB TPC-like benchmarks can get "packaged" up well enough to > easily run and rerun them, there's hope of getting better answers, > perhaps even including performance metrics for *BSD. That, not > Linux-baiting, is the answer... > -- > select 'cbbrowne' || '@' || 'acm.org'; > http://www.ntlug.org/~cbbrowne/sap.html > (eq? 'truth 'beauty) ; to avoid unassigned-var error, since compiled code > ; will pick up previous value to var set!-ed, > ; the unassigned object. > -- from BBN-CL's cl-parser.scm > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Here is one talking about ext2 corruption from power failure from 2002: http://groups.google.com/groups?q=ext2+corrupt+%22power+failure%22&hl=en&lr=&ie=UTF-8&selm=alvrj5%249in%241%40usc.edu&rnum=9 --------------------------------------------------------------------------- pgman wrote: > > As I remember, there were clear cases that ext2 would fail to recover, > and it was known to be a limitation of the file system implementation. > Some of the ext2 developers were in the room at Red Hat when I said > that, so if it was incorrect, they would hopefully have spoken up. I > addressed the comments directly to them. > > To be recoverasble, you have to be careful how you sync metadata to > disk. All the journalling file systems, and the BSD UFS do that. I am > told ext2 does not. I don't know much more than that. > > As I remember years ago, ext2 was faster than UFS, but it was true > because ext2 didn't guarantee failure recovery. Now, with UFS soft > updates, the have similar performance characteristics, but UFS is still > crash-safe. > > However, I just tried google and couldn't find any documented evidence > that ext2 isn't crash-safe, so maybe I am wrong. > > --------------------------------------------------------------------------- > > Christopher Browne wrote: > > Bruce Momjian commented: > > > > "Uh, the ext2 developers say it isn't 100% reliable" ... "I mentioned > > it while I was visiting Red Hat, and they didn't refute it." > > > > 1. Nobody has gone through any formal proofs, and there are few > > systems _anywhere_ that are 100% reliable. NASA has occasionally lost > > spacecraft to software bugs, so nobody will be making such rash claims > > about ext2. > > > > 2. Several projects have taken on the task of introducing journalled > > filesystems, most notably ext3 (sponsored by RHAT via Stephen Tweedy) > > and ReiserFS (oft sponsored by SuSE). (I leave off JFS/XFS since they > > existed long before they had any relationship with Linux.) > > > > Participants in such projects certainly have interest in presenting > > the notion that they provide improved reliability over ext2. > > > > 3. There is no "apologist" for ext2 that will either (stupidly and > > futilely) claim it to be flawless. Nor is there substantial interest > > in improving it; the sort people that would be interested in that sort > > of thing are working on the other FSes. > > > > This also means that there's no one interested in going into the > > guaranteed-to-be-unsung effort involved in trying to prove ext2 to be > > "formally reliable." > > > > 4. It would be silly to minimize the impact of commercial interest. > > RHAT has been paying for the development of a would-be ext2 successor. > > For them to refute your comments wouldn't be in their interests. > > > > Note that these are "warm and fuzzy" comments, the whole lot. The > > 80-some thousand lines of code involved in ext2, ext3, reiserfs, and > > jfs are no more amenable to absolute mathematical proof of reliability > > than the corresponding BSD FFS code. > > > > 6. Such efforts would be futile, anyways. Disks are mechanical > > devices, and, as such, suffer from substantial reliability issues > > irrespective of the reliability of the software. I have lost sleep on > > too many occasions due to failures of: > > a) Disk drives, > > b) Disk controllers [the worst Oracle failure I encountered resulted > > from this], and > > c) OS memory management. > > > > I used ReiserFS back in its "bleeding edge" days, and find myself a > > lot more worried about losing data to flakey disk controllers. > > > > It frankly seems insulting to focus on ext2 in this way when: > > > > a) There aren't _hard_ conclusions to point to, just soft ones; > > > > b) The reasons for you hearing vaguely negative things about ext2 > > are much more likely political than they are technical. > > > > I wish there were more "hard and fast" conclusions to draw, to be able > > to conclusively say that one or another Linux filesystem was > > unambiguously preferable for use with PostgreSQL. There are not > > conclusive metrics, either in terms of speed or of some notion of > > "reliability." I'd expect ReiserFS to be the poorest choice, and for > > XFS to be the best, but I only have fuzzy reasons, as opposed to > > metrics. > > > > The absence of measurable metrics of the sort is _NOT_ a proof that > > (say) FreeBSD is conclusively preferable, whatever your own > > preferences (I'll try to avoid characterizing it as "prejudices," as > > that would be unkind) may be. That would represent a quite separate > > debate, and one that doesn't belong here, certainly not on a thread > > where the underlying question was "Which Linux FS is preferred?" > > > > If the OSDB TPC-like benchmarks can get "packaged" up well enough to > > easily run and rerun them, there's hope of getting better answers, > > perhaps even including performance metrics for *BSD. That, not > > Linux-baiting, is the answer... > > -- > > select 'cbbrowne' || '@' || 'acm.org'; > > http://www.ntlug.org/~cbbrowne/sap.html > > (eq? 'truth 'beauty) ; to avoid unassigned-var error, since compiled code > > ; will pick up previous value to var set!-ed, > > ; the unassigned object. > > -- from BBN-CL's cl-parser.scm > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 4: Don't 'kill -9' the postmaster > > > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Mon, Aug 11, 2003 at 10:58:18PM -0400, Christopher Browne wrote: > 1. Nobody has gone through any formal proofs, and there are few > systems _anywhere_ that are 100% reliable. I think the problem is that ext2 is known to be not perfectly crash safe. That is, fsck on reboot after a crash can cause, in some extreme cases, recently-fscynced data to end up in lost+found/. The data may or may not be recoverable from there. I don't think anyone would object to such a characterisation of ext2. It was not designed, ever, for perfect data safety -- it was designed as a reasonably good compromise for most cases. _Every_ filesystem entails some compromises. This happens to be the one entailed by ext2. For production use with valuable data, for my money (or, more precisely, my time when a system panics for no good reason), it is always worth the additional speed penalty to use something like metadata journalling. Maybe others have more time to spare. > perhaps even including performance metrics for *BSD. That, not > Linux-baiting, is the answer... I didn't see anyone Linux-baiting. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Tue, 12 Aug 2003, Andrew Sullivan wrote: > On Mon, Aug 11, 2003 at 10:58:18PM -0400, Christopher Browne wrote: > > 1. Nobody has gone through any formal proofs, and there are few > > systems _anywhere_ that are 100% reliable. > > I think the problem is that ext2 is known to be not perfectly crash > safe. That is, fsck on reboot after a crash can cause, in some > extreme cases, recently-fscynced data to end up in lost+found/. The > data may or may not be recoverable from there. > > I don't think anyone would object to such a characterisation of ext2. > It was not designed, ever, for perfect data safety -- it was designed > as a reasonably good compromise for most cases. _Every_ filesystem > entails some compromises. This happens to be the one entailed by > ext2. > > For production use with valuable data, for my money (or, more > precisely, my time when a system panics for no good reason), it is > always worth the additional speed penalty to use something like > metadata journalling. Maybe others have more time to spare. I think the issue here is if you are running with the async mount option, then it is quite likely that your volume will be corrupted if there are writes going on and power fails. I'm pretty sure that as long as the partition is mounted sync, this isn't a problem. I have seen reports where ext3 caused the data corruption (old kernels, 2.4.4 and before I believe) problem, not ext2. I.e. the addition of journaling caused data loss. Given that possibility, it may well have been at one time that ext2 was a safer bet than ext3. > > perhaps even including performance metrics for *BSD. That, not > > Linux-baiting, is the answer... > > I didn't see anyone Linux-baiting. No more than the typical, light hearted stuff we toss back and forth. I certainly wasn't upset by any of it.
People: > On Mon, Aug 11, 2003 at 10:58:18PM -0400, Christopher Browne wrote: > > 1. Nobody has gone through any formal proofs, and there are few > > systems _anywhere_ that are 100% reliable. > > I think the problem is that ext2 is known to be not perfectly crash > safe. That is, fsck on reboot after a crash can cause, in some > extreme cases, recently-fscynced data to end up in lost+found/. The > data may or may not be recoverable from there. Aside from that, as recently as eighteen months ago I had to manually fsck an ext2 system after an unexpected power-out. After my interactive session the system recovered and no data was lost. However, the client lost 3.5 hours of work time ... 2.5 hours for me to get to the site, and 1 hour to recover the server (mostly waiting time). So it's a tradeoff with loss of performance vs. recovery time. In a server room with redundant backup power supplies, "clean room" security and fail-over services, I can certainly imagine that data journalling would not be needed. That is, however, the minority ... -- Josh Berkus Aglio Database Solutions San Francisco
On Tue, Aug 12, 2003 at 09:36:21AM -0700, Josh Berkus wrote: > So it's a tradeoff with loss of performance vs. recovery time. In > a server room with redundant backup power supplies, "clean room" > security and fail-over services, I can certainly imagine that data > journalling would not be needed. You can have all the redundant power, high availability hardware, and ultra-Robocop security going, and still have crashes: so far as I know, _nobody_ makes perfectly reliable hardware, and the harder you push it, the more likely you are to find trouble. And certainly, when you have a surprise outage because the CPU where the kernel happened to be burned itself up, an extra hour or two offline while you do fsck is liable to make you cry out variations of those four letters more than once. :-/ A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Christopher, I appreciate your comments. At the end it goes down to personal experience with one or the other file system. From that I can tell, that I have made good experience with UFS, EXT2, and XFS. I made catastrophic ex- perience with ReiserFS (not during operation but you are a looser when it fails because the recovery methods are likely to be insufficient) So at the end if somebody runs technical equipment, regardless whether it's a computer or a chemical fab. It can fail and you need to make up your mind about contingency. This is due even before you start operating the equipment. So waste too much time on thinking about the perfect file system. But evaluate the potential damage that can result from failure. Develop a Backup&Recovery strategy and test it, test it and test it again, so that you can do it blindly when it's due. Ciao, Toni >I wish there were more "hard and fast" conclusions to draw, to be able >to conclusively say that one or another Linux filesystem was >unambiguously preferable for use with PostgreSQL. There are not >conclusive metrics, either in terms of speed or of some notion of >"reliability." I'd expect ReiserFS to be the poorest choice, and for >XFS to be the best, but I only have fuzzy reasons, as opposed to >metrics. > >The absence of measurable metrics of the sort is _NOT_ a proof that >(say) FreeBSD is conclusively preferable, whatever your own >preferences (I'll try to avoid characterizing it as "prejudices," as >that would be unkind) may be. That would represent a quite separate >debate, and one that doesn't belong here, certainly not on a thread >where the underlying question was "Which Linux FS is preferred?" > >If the OSDB TPC-like benchmarks can get "packaged" up well enough to >easily run and rerun them, there's hope of getting better answers, >perhaps even including performance metrics for *BSD. That, not >Linux-baiting, is the answer... > >