Thread: New Linux xfs/reiser file systems
I was talking to a Linux user yesterday, and he said that performance using the xfs file system is pretty bad. He believes it has to do with the fact that fsync() on log-based file systems requires more writes. With a standard BSD/ext2 file system, WAL writes can stay on the same cylinder to perform fsync. Is that true of log-based file systems? I know xfs and reiser are both log based. Do we need to be concerned about PostgreSQL performance on these file systems? I use BSD FFS with soft updates here, so it doesn't affect me. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> The "problem" with log based filesystems is that they most likely > do not know the consequences of a write so an fsync on a file may > require double writing to both the log and the "real" portion of > the disk. They can also exhibit the problem that an fsync may > cause all pending writes to require scheduling unless the log is > constructed on the fly rather than incrementally. Yes, this double-writing is a problem. Suppose you have your WAL on a separate drive. You can fsync() WAL with zero head movement. With a log based file system, you need two head movements, so you have gone from zero movements to two. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Bruce Momjian <pgman@candle.pha.pa.us> [010502 14:01] wrote: > I was talking to a Linux user yesterday, and he said that performance > using the xfs file system is pretty bad. He believes it has to do with > the fact that fsync() on log-based file systems requires more writes. > > With a standard BSD/ext2 file system, WAL writes can stay on the same > cylinder to perform fsync. Is that true of log-based file systems? > > I know xfs and reiser are both log based. Do we need to be concerned > about PostgreSQL performance on these file systems? I use BSD FFS with > soft updates here, so it doesn't affect me. The "problem" with log based filesystems is that they most likely do not know the consequences of a write so an fsync on a file may require double writing to both the log and the "real" portion of the disk. They can also exhibit the problem that an fsync may cause all pending writes to require scheduling unless the log is constructed on the fly rather than incrementally. There was also the problem that was brought up recently that certain versions (maybe all?) of Linux perform fsync() in a very non-optimal manner, if the user is able to use the O_FSYNC option rather than fsync he may see a performance increase. But his guess is probably nearly as good as mine. :) -- -Alfred Perlstein - [alfred@freebsd.org] http://www.egr.unlv.edu/~slumos/on-netbsd.html
* Bruce Momjian <pgman@candle.pha.pa.us> [010502 15:20] wrote: > > The "problem" with log based filesystems is that they most likely > > do not know the consequences of a write so an fsync on a file may > > require double writing to both the log and the "real" portion of > > the disk. They can also exhibit the problem that an fsync may > > cause all pending writes to require scheduling unless the log is > > constructed on the fly rather than incrementally. > > Yes, this double-writing is a problem. Suppose you have your WAL on a > separate drive. You can fsync() WAL with zero head movement. With a > log based file system, you need two head movements, so you have gone > from zero movements to two. It may be worse depending on how the filesystem actually does journalling. I wonder if an fsync() may cause ALL pending meta-data to be updated (even metadata not related to the postgresql files). Do you know if reiser or xfs have this problem? -- -Alfred Perlstein - [alfred@freebsd.org] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/
> > Yes, this double-writing is a problem. Suppose you have your WAL on a > > separate drive. You can fsync() WAL with zero head movement. With a > > log based file system, you need two head movements, so you have gone > > from zero movements to two. > > It may be worse depending on how the filesystem actually does > journalling. I wonder if an fsync() may cause ALL pending > meta-data to be updated (even metadata not related to the > postgresql files). > > Do you know if reiser or xfs have this problem? I don't know, but the Linux user reported xfs was really slow. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > I was talking to a Linux user yesterday, and he said that performance > using the xfs file system is pretty bad. He believes it has to do with > the fact that fsync() on log-based file systems requires more writes. > > With a standard BSD/ext2 file system, WAL writes can stay on the same > cylinder to perform fsync. Is that true of log-based file systems? > > I know xfs and reiser are both log based. Do we need to be concerned > about PostgreSQL performance on these file systems? I use BSD FFS with > soft updates here, so it doesn't affect me. I did see poor performance on reiserfs, I have not as yet ventured into using xfs. I occurs to me that journalizing file systems will almost always be slower on an application such as postgres. The journalizing file system is trying to maintain data integrity for an application which is also trying to maintain data integrity. There will always be extra work involved. This behavior raises the question about file system usage in Postgres. Many databases, such as Oracle, create table space files and operate directly on the raw blocks, bypassing the file system altogether. On one hand, Postgres is easy to use and maintain because it cooperates with the native file system, on the other hand it incurs the overhead of whatever silliness the file system wants to do. I would bet it is a huge amount of work to use a "table space" system and no one wants that. lol. However, it should be noted that a bit more control over database layout would make some great performance improvements. The ability to put indexes on a separate volume from data. The ability to put different tables on different volumes. And so on. In the short term, I think poor performance on a journalizing file system is to be expected, unless there is an IOCTL to tell the FS to leave the files alone (and postgres calls it). A Linux HOWTO which informs people that certain file systems will have performance issues and why should handle the problem. Perhaps we can convince the Linux community to create a "dbfs" which is a stripped down simple no nonsense file system designed for applications like databases? -- I'm not offering myself as an example; every life evolves by its own laws. ------------------------ http://www.mohawksoft.com
On Thu, 3 May 2001, mlw wrote: > I would bet it is a huge amount of work to use a "table space" system > and no one wants that. From some stracing of 7.1, the most common syscall issued by postgres is an lseek() to the end of the file, presumably to find its length, which seems to happen up to about a dozen times per (pgbench) transaction. Tablespaces would solve this (not that lseek is a particularly expensive operation, of course). > Perhaps we can convince the Linux community to create a "dbfs" which > is a stripped down simple no nonsense file system designed for > applications like databases? Sync-metadata ext2 should be fine. Filesystems fsck pretty quick when they contain only a few large files. Otherwise, something like "smugfs" (now obsolete) might do. Matthew.
Matthew Kirkwood <matthew@hairy.beasts.org> writes: > From some stracing of 7.1, the most common syscall issued by > postgres is an lseek() to the end of the file, presumably to > find its length, which seems to happen up to about a dozen > times per (pgbench) transaction. > Tablespaces would solve this (not that lseek is a particularly > expensive operation, of course). No, they wouldn't; or at least they'd just create a different problem. The reason for the lseek is that the file length may have changed since the current backend last checked it. To avoid lseek we'd need some shared data structure that maintains the current length of every active table, which would be a nuisance to maintain and probably a source of contention delays. (Of course, such a data structure would just be the tip of the iceberg of what we'd have to maintain for ourselves if we couldn't depend on the kernel to do it for us. Reimplementing a filesystem doesn't strike me as a profitable use of our time.) regards, tom lane
> > I know xfs and reiser are both log based. Do we need to be concerned > > about PostgreSQL performance on these file systems? I use BSD FFS with > > soft updates here, so it doesn't affect me. > > I did see poor performance on reiserfs, I have not as yet ventured into using > xfs. > > I occurs to me that journalizing file systems will almost always be slower on > an application such as postgres. The journalizing file system is trying to > maintain data integrity for an application which is also trying to maintain > data integrity. There will always be extra work involved. Yes, the problem is that extra work is required on PostgreSQL's part. Log-based file systems make sure all the changes get onto the disk in an orderly way, but I believe it can delay what gets written to the drive. PostgreSQL wants to be sure all the data is on the disk, period. Unfortunately, the _orderly_ part makes the _fsync_ part do more work. By going from ext2 to a log-based file system, we are getting _farther_ from a raw device that if we just sayed with ext2. ext2 has serious problems with corrupt file systems after a crash, so I understand the need to move to another file system type. I have been waitin for Linux to get a more modern file system. Unfortunately, the new ones seem to be worse for PostgreSQL. > This behavior raises the question about file system usage in Postgres. Many > databases, such as Oracle, create table space files and operate directly on the > raw blocks, bypassing the file system altogether. OK, we have considered this, but frankly, the new, modern file systems like FFS/softupdates have i/o rates near raw speed, with all the advantages a file system gives us. I believe most commercial dbs are moving away from raw devices and toward file systems. In the old days the SysV file system was pretty bad at i/o & fragmentation, so they used raw devices. > The ability to put indexes on a separate volume from data. > The ability to put different tables on different volumes. > And so on. We certainly need that, but raw devices would not make this any easier, I think. > In the short term, I think poor performance on a journalizing file system is to > be expected, unless there is an IOCTL to tell the FS to leave the files alone > (and postgres calls it). A Linux HOWTO which informs people that certain file > systems will have performance issues and why should handle the problem. > > Perhaps we can convince the Linux community to create a "dbfs" which is a > stripped down simple no nonsense file system designed for applications like > databases? It could become a serious problem as people start using reiser/xfs for their file systems and don't understand the performance problems. Even more likely is that they will turn off fsync, thinking reiser doesn't need it, when in fact, I think it does. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Matthew Kirkwood <matthew@hairy.beasts.org> writes: > > From some stracing of 7.1, the most common syscall issued by > > postgres is an lseek() to the end of the file, presumably to > > find its length, which seems to happen up to about a dozen > > times per (pgbench) transaction. > > > Tablespaces would solve this (not that lseek is a particularly > > expensive operation, of course). > > No, they wouldn't; or at least they'd just create a different problem. > The reason for the lseek is that the file length may have changed since > the current backend last checked it. To avoid lseek we'd need some > shared data structure that maintains the current length of every active > table, which would be a nuisance to maintain and probably a source of > contention delays. Seems we should cache the file lengths somehow. Not sure how to do it because our file system cache is local to each backend. > (Of course, such a data structure would just be the tip of the iceberg > of what we'd have to maintain for ourselves if we couldn't depend on the > kernel to do it for us. Reimplementing a filesystem doesn't strike me > as a profitable use of our time.) Ditto. The database is complicated enough. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > kernel to do it for us. Reimplementing a filesystem doesn't strike me > > as a profitable use of our time.) > Ditto. The database is complicated enough. Maybe some kind of recommendation would be a good thing. That is, if the PostgreSQL community has enough knowledge. A section in the docs that discusses various file systems, so people can make an intelligent choice. -- Kaare Rasmussen --Linux, spil,-- Tlf: 3816 2582 Kaki Data tshirts, merchandize Fax: 3816 2501 Howitzvej 75 Åben 14.00-18.00 Web: www.suse.dk 2000 Frederiksberg Lørdag 11.00-17.00 Email: kar@webline.dk
> > This behavior raises the question about file system usage in Postgres. Many > > databases, such as Oracle, create table space files and operate directly on the > > raw blocks, bypassing the file system altogether. > > OK, we have considered this, but frankly, the new, modern file systems > like FFS/softupdates have i/o rates near raw speed, with all the > advantages a file system gives us. I believe most commercial dbs are > moving away from raw devices and toward file systems. In the old days > the SysV file system was pretty bad at i/o & fragmentation, so they used > raw devices. I'm starting to like the idea of raw FS for a few reasons: 1) Considering that postgresql now does WAL, the need for a logging FS for the database doesn't seem as needed (is it needed at all?). 2) Given the fact that postgresql is trying to support many OSs, depending on, for example, XFS on a linux system will cause many problems. What about solaris? How about BSD? Etc.. Using raw db MAY be easier than dealing with the problems that will arise from supporting multiple filesystems. That said, the ability to use the system's FS does have it's advantages (backup, moving files, etc). Just some thoughts.. - Brandon b. palmer, bpalmer@crimelabs.net pgp: www.crimelabs.net/bpalmer.pgp5
On Thu, 3 May 2001, mlw wrote: > This behavior raises the question about file system usage in Postgres. Many > databases, such as Oracle, create table space files and operate directly on the > raw blocks, bypassing the file system altogether. > > On one hand, Postgres is easy to use and maintain because it cooperates with > the native file system, on the other hand it incurs the overhead of whatever > silliness the file system wants to do. It is not *that* hard to write a 'postgresfs' but you have to look at the problems it creates. One of the biggest problems facing sys admins of large sites is that the Oracle/DB2/etc DBA, having created the purpose-build database filesystem, has not allowed enough room for growth. Like I said, a basic file system is not difficult, but volume management tools and the maintenance of the whole thing is. Currently, postgres administrators are not faced with such a problem. There is, of course, the argument that pgfs need not been enforced. The problem is that many people would probably use it so as to have a 'superior' installation. This then entails the problems above, creating more work for core developers. Gavin
Just put a note in the installation docs that the place where the database is initialised to should be on a non-Reiser, non-XFS mount... Chris -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of mlw Sent: Thursday, 3 May 2001 8:09 PM To: Bruce Momjian; Hackers List Subject: [HACKERS] Re: New Linux xfs/reiser file systems Bruce Momjian wrote: > > I was talking to a Linux user yesterday, and he said that performance > using the xfs file system is pretty bad. He believes it has to do with > the fact that fsync() on log-based file systems requires more writes. > > With a standard BSD/ext2 file system, WAL writes can stay on the same > cylinder to perform fsync. Is that true of log-based file systems? > > I know xfs and reiser are both log based. Do we need to be concerned > about PostgreSQL performance on these file systems? I use BSD FFS with > soft updates here, so it doesn't affect me. I did see poor performance on reiserfs, I have not as yet ventured into using xfs. I occurs to me that journalizing file systems will almost always be slower on an application such as postgres. The journalizing file system is trying to maintain data integrity for an application which is also trying to maintain data integrity. There will always be extra work involved. This behavior raises the question about file system usage in Postgres. Many databases, such as Oracle, create table space files and operate directly on the raw blocks, bypassing the file system altogether. On one hand, Postgres is easy to use and maintain because it cooperates with the native file system, on the other hand it incurs the overhead of whatever silliness the file system wants to do. I would bet it is a huge amount of work to use a "table space" system and no one wants that. lol. However, it should be noted that a bit more control over database layout would make some great performance improvements. The ability to put indexes on a separate volume from data. The ability to put different tables on different volumes. And so on. In the short term, I think poor performance on a journalizing file system is to be expected, unless there is an IOCTL to tell the FS to leave the files alone (and postgres calls it). A Linux HOWTO which informs people that certain file systems will have performance issues and why should handle the problem. Perhaps we can convince the Linux community to create a "dbfs" which is a stripped down simple no nonsense file system designed for applications like databases? -- I'm not offering myself as an example; every life evolves by its own laws. ------------------------ http://www.mohawksoft.com ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
There might be a problem, but if no one mentions it to the maintainers of those fs's, it will not get fixed... Regards John
Well, arguably if you're setting up a database server then a reasonable DBA should think about such things... (My 2c) Chris -----Original Message----- From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] Sent: Friday, 4 May 2001 9:42 AM To: Christopher Kings-Lynne Cc: mlw; Hackers List Subject: Re: [HACKERS] Re: New Linux xfs/reiser file systems > Just put a note in the installation docs that the place where the database > is initialised to should be on a non-Reiser, non-XFS mount... Sure, we can do that now. What do we do when these are the default file systems for Linux? We can tell them to create other types of file systems, but that is a pretty big hurdle. I wonder if it would be easier to get reiser/xfs to make some modifications. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Just put a note in the installation docs that the place where the database > is initialised to should be on a non-Reiser, non-XFS mount... Sure, we can do that now. What do we do when these are the default file systems for Linux? We can tell them to create other types of file systems, but that is a pretty big hurdle. I wonder if it would be easier to get reiser/xfs to make some modifications. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Well, arguably if you're setting up a database server then a reasonable DBA > should think about such things... Yes, but people have trouble installing PostgreSQL. I can't imagine walking them through a newfs. > > (My 2c) > > Chris > > -----Original Message----- > From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] > Sent: Friday, 4 May 2001 9:42 AM > To: Christopher Kings-Lynne > Cc: mlw; Hackers List > Subject: Re: [HACKERS] Re: New Linux xfs/reiser file systems > > > > Just put a note in the installation docs that the place where the database > > is initialised to should be on a non-Reiser, non-XFS mount... > > Sure, we can do that now. What do we do when these are the default file > systems for Linux? We can tell them to create other types of file > systems, but that is a pretty big hurdle. I wonder if it would be > easier to get reiser/xfs to make some modifications. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > > Just put a note in the installation docs that the place where the database > > is initialised to should be on a non-Reiser, non-XFS mount... > > Sure, we can do that now. What do we do when these are the default file > systems for Linux? We can tell them to create other types of file > systems, but that is a pretty big hurdle. I wonder if it would be > easier to get reiser/xfs to make some modifications. I have looked at Reiser, and I don't think it is a file system suited for very large files, or applications such as postgres. The Linux crowd should lobby against any such trend. It is ok for many moderately small files. ReiserFS would be great for a cddb server, but poor for a database box. XFS is a real big file system project, I'd bet that there are file properties or management tools to tell it to leave directories and files alone. They should have addressed that years ago. One last mention.. Having better control over WHERE various files in a database are located can make it easier to deal with these things. Just a thought. ;-) -- I'm not offering myself as an example; every life evolves by its own laws. ------------------------ http://www.mohawksoft.com
> > > Just put a note in the installation docs that the place where the >database > > is initialised to should be on a non-Reiser, non-XFS mount... > >Sure, we can do that now. I still think this is not necessarily the right approach either. One major purpose of using a journaling fs is for fast boot up time after crash. If you have a 100 GB database you may wish to have the data on XFS. I do think that the WAL log should be on a separate disk and on a non-journaling fs for performance. Best Regards, Carl Garland _________________________________________________________________ Get your FREE download of MSN Explorer at http://explorer.msn.com
Here is a radical idea... What is it that is causing Postgres trouble? It is the file system's attempts to maintain some integrity. So I proposed a simple "dbfs" sort of thing which was the most basic sort of file system possible. I'm not sure, but I think we can test this hypothesis on the FAT32 file system on Linux. As far as I know, FAT32 (FAT in general) is a very simple file system and does very little during operation, except read and write the files and manage what's been allocated. Plus, the allocation table is very simple in comparison all the other file systems. Would pgbench run on a system using ext2, Reiser, then FAT32 be sufficient to get a feeling for the type of performance Postgres would get, or am I just off the wall? If this idea has some merit, what would be the best way to test it? Move the pg_xlog directory first, then try base? What's the best methodology to try? carl garland wrote: > > > > > > Just put a note in the installation docs that the place where the > >database > > > is initialised to should be on a non-Reiser, non-XFS mount... > > > >Sure, we can do that now. > > I still think this is not necessarily the right approach either. One > major purpose of using a journaling fs is for fast boot up time after > crash. If you have a 100 GB database you may wish to have the data > on XFS. I do think that the WAL log should be on a separate disk and > on a non-journaling fs for performance. > > Best Regards, > Carl Garland > > _________________________________________________________________ > Get your FREE download of MSN Explorer at http://explorer.msn.com > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) -- I'm not offering myself as an example; every life evolves by its own laws. ------------------------ http://www.mohawksoft.com
On Thu, May 03, 2001 at 11:41:24AM -0400, Bruce Momjian wrote: > ext2 has serious problems with corrupt file systems after a crash, so I > understand the need to move to another file system type. I have been > waitin for Linux to get a more modern file system. Unfortunately, the > new ones seem to be worse for PostgreSQL. If you fsync() a directory in Linux, all the metadata within that directory will be written out to disk. As for filesystem corruption, I can say the e2fsck is among the best fsck programs out there, and I've only ever had 1 occasion where I've lost any data on an ext2 filesystem, and that was due to bad sectors causing me to lose the root directory. (Well, apart from human errors, but that doesn't count) > OK, we have considered this, but frankly, the new, modern file systems > like FFS/softupdates have i/o rates near raw speed, with all the > advantages a file system gives us. I believe most commercial dbs are > moving away from raw devices and toward file systems. In the old days > the SysV file system was pretty bad at i/o & fragmentation, so they used > raw devices. And Solaris' 1/01 media has better support for O_DIRECT (?), which they claim gives you 93% of the speed of a raw device. (Or something like that; I read this in marketing material a couple of months ago) Raw devices are designed to have filesystems on them. The only excuses for userland tools accessing them, are fs-specific tools (eg. dump, fsck, etc), or for non-unix filesystem tools, where the unix VFS doesn't handle things properly (hfstools). > > The ability to put indexes on a separate volume from data. > > The ability to put different tables on different volumes. > > And so on. > > We certainly need that, but raw devices would not make this any easier, > I think. It would be cool if either at compile time or at database creation time, we could specify a printf-like format for placing tables, indexes, etc. > It could become a serious problem as people start using reiser/xfs for > their file systems and don't understand the performance problems. Even > more likely is that they will turn off fsync, thinking reiser doesn't > need it, when in fact, I think it does. ReiserFS only supports metadata logging. The performance slowdown must be due to logging things like mtime or atime, because otherwise ReiserFS is a very high performance FS. (Although, I admittedly haven't used it since it was early in it's development) -- Michael Samuel <michael@miknet.net>
Michael Samuel wrote: > > ReiserFS only supports metadata logging. The performance slowdown must be > due to logging things like mtime or atime, because otherwise ReiserFS is a > very high performance FS. (Although, I admittedly haven't used it since it > was early in it's development) The way I understand it is that ReiserFS does not attempt to separate files at the block level. Multiple files can live in the same disk block. This is cool if you have many small files, but the extra overhead for large files such as those used by a database, is a bit much. I read some stuff about a year ago, and my impressions forced me to conclude that ReiserFS was geared toward applications. Which is a pretty good thing for applications, but not for databases. I really think a simple low down dirty file system is just what the doctor ordered for postgres. Remember, general purpose file systems must do for files what Postgres is already doing for records. You will always have extra work. I am seriously thinking of trying a FAT32 as pg_xlog. I wonder if it will improve performance, or if there is just something fundamentally stupid about FAT32 that will make it worse? -- I'm not offering myself as an example; every life evolves by its own laws. ------------------------ http://www.mohawksoft.com
Before we get too involved in speculating, shouldn't we actually measure the performance of 7.1 on XFS and Reiserfs? Since it's easy to disable fsync, we can test whether that's the problem. I don't think that logging file systems must intrinsically give bad performance on fsync since they only log metadata changes. I don't have a machine with XFS installed and it will be at least a week before I could get around to a build. Any volunteers? Ken Hirsch
mlw <markw@mohawksoft.com> writes: > I have looked at Reiser, and I don't think it is a file system suited for very > large files, or applications such as postgres. What's the problem with big files? ReiserFS v2 doesn't seem to support it, while v3 seems just fine (of the ondisk format) That said, I'm certainly looking forward to xfs - I believe it will be the most widely used of the current batch of journaling file systems (reiserfs, jfs, XFS and ext3, the latter mainly focusing on an easy migration path for existing system) -- Trond Eivind Glomsrød Red Hat, Inc.
On Fri, May 04, 2001 at 08:02:17AM -0400, mlw wrote: > The way I understand it is that ReiserFS does not attempt to separate files at > the block level. Multiple files can live in the same disk block. This is cool > if you have many small files, but the extra overhead for large files such as > those used by a database, is a bit much. It should be at least as fast as other filesystems for large files. I suspect that it would be faster in fact. The only catch is that the performance of reiserfs sucks when it gets past 85% or so full. (ext2 has similar problems) You can read about all this stuff at http://www.namesys.com/ > I really think a simple low down dirty file system is just what the doctor > ordered for postgres. Traditional BSD FFS or Solaris UFS is probably the best bet for postgres. > Remember, general purpose file systems must do for files what Postgres is > already doing for records. You will always have extra work. I am seriously > thinking of trying a FAT32 as pg_xlog. I wonder if it will improve performance, > or if there is just something fundamentally stupid about FAT32 that will make > it worse? Well, for a starters, file permissions... Ext2 would kick arse over FAT32 for performance. -- Michael Samuel <michael@miknet.net>
>>>>> "Bruce" == Bruce Momjian <pgman@candle.pha.pa.us> writes: >> Well, arguably if you're setting up a database server then a >> reasonable DBA should think about such things... Bruce> Yes, but people have trouble installing PostgreSQL. I Bruce> can't imagine walking them through a newfs. In most of linux-land, the DBA is probably also the sysadmin. In bigger shops, and those which currently run, say Oracle or Sybase, the two roles are separate. When they are separate, you don't have to walk the DBA through it; he just walks over to the sysadmin and says "I need X megabytes of space on a new Y filesystem." roland -- PGP Key ID: 66 BC 3B CD Roland B. Roberts, PhD RL Enterprises roland@rlenter.com 76-15 113th Street, Apt 3B rbroberts@acm.org Forest Hills, NY 11375
I got some information from Stephen Tweedie on this - please keep him "Cc:" as he's not on this list ************************************************************************ Bruce Momjian <pgman@candle.pha.pa.us> writes: > I was talking to a Linux user yesterday, and he said that performance > using the xfs file system is pretty bad. He believes it has to do with > the fact that fsync() on log-based file systems requires more writes. Performance doing what? XFS has known performance problems doing unlinks and truncates, but not synchronous IO. The user should be using fdatasync() for databases, btw, not fsync(). First, XFS, ext3 and reiserfs are *NOT* log-based filesystems. They are journaling filesystems. They have a log, but they are not log-based because they do not store data permanently in a log structure. Berkeley LFS, Sprite and Spiralog are log-based filesystems. > With a standard BSD/ext2 file system, WAL writes can stay on the same > cylinder to perform fsync. Is that true of log-based file systems? Not true on ext2 or BSD. Write-aheads are _usually_ close to the inode, but not always. For true log-based filesystems, writes are always completely sequential, so the issue just goes away. For journaling filesystems, depending on the setup there may be a seek to the journal involved, but some journaling filesystems can use a separate disk for the journal so no seek is required. > I know xfs and reiser are both log based. Do we need to be concerned > about PostgreSQL performance on these file systems? I use BSD FFS with > soft updates here, so it doesn't affect me. A database normally preallocates its data files and then performs most of its writes using update-in-place. In such cases, fsync() is almost always the wrong thing to be doing --- the data writes have changed nothing in the inode except for the timestamps, and there's no need to flush the timestamps to disk for every write. fdatasync() is designed for this --- if the only inode change is timestamps, fdatasync() will skip the seek to the inode and will only update the data. If any significant inode fields have been changed, then a full flush is done. Using fdatasync, most filesystems will incur no seeks for data flush, regardless of whether the filesystem is journaling or not. Cheers,Stephen ************************************************************************ -- Trond Eivind Glomsrød Red Hat, Inc.
> Sure, we can do that now. What do we do when these are the default file > systems for Linux? We can tell them to create other types of file What is a 'default file system' ? I know that untill now, everybody is using ext2. But that's only because there hasn't been anything comparable. Now we se ReiserFS, and my SuSE installation offers the choice. In the future, I believe that people can choose from ext2, ReiserFS,xfs, ext3 and maybe more. > systems, but that is a pretty big hurdle. I wonder if it would be > easier to get reiser/xfs to make some modifications. No, I don't think it's a big hurdle. If you just want to play with PostgreSQL, you wont care. If you're serious, you'll repartition. -- Kaare Rasmussen --Linux, spil,-- Tlf: 3816 2582 Kaki Data tshirts, merchandize Fax: 3816 2501 Howitzvej 75 Åben 14.00-18.00 Web: www.suse.dk 2000 Frederiksberg Lørdag 11.00-17.00 Email: kar@webline.dk
[ Charset ISO-8859-1 unsupported, converting... ] > Before we get too involved in speculating, shouldn't we actually measure the > performance of 7.1 on XFS and Reiserfs? Since it's easy to disable fsync, > we can test whether that's the problem. I don't think that logging file > systems must intrinsically give bad performance on fsync since they only log > metadata changes. > > I don't have a machine with XFS installed and it will be at least a week > before I could get around to a build. Any volunteers? There have been multiple reports of poor PostgreSQL performance on Reiser and xfs. I don't have numbers, though. Frankly, I think we need xfs and reiser experts involved to figure out our options here. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
[ Charset ISO-8859-1 unsupported, converting... ] > > Sure, we can do that now. What do we do when these are the default file > > systems for Linux? We can tell them to create other types of file > > What is a 'default file system' ? I know that untill now, everybody is using > ext2. But that's only because there hasn't been anything comparable. Now we > se ReiserFS, and my SuSE installation offers the choice. In the future, I > believe that people can choose from ext2, ReiserFS,xfs, ext3 and maybe more. But some day the default will be a log-based file system, and people will have to hunt around to create a non-log based one. > > systems, but that is a pretty big hurdle. I wonder if it would be > > easier to get reiser/xfs to make some modifications. > > No, I don't think it's a big hurdle. If you just want to play with > PostgreSQL, you wont care. If you're serious, you'll repartition. Yes, but we could get a reputation for slowness on these log-based file systems. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> On Fri, May 04, 2001 at 08:02:17AM -0400, mlw wrote: > > The way I understand it is that ReiserFS does not attempt to separate files at > > the block level. Multiple files can live in the same disk block. This is cool > > if you have many small files, but the extra overhead for large files such as > > those used by a database, is a bit much. > > It should be at least as fast as other filesystems for large files. I suspect > that it would be faster in fact. The only catch is that the performance of > reiserfs sucks when it gets past 85% or so full. (ext2 has similar problems) That is pretty standard for most modern file systems. They need that free space to optimize. > > You can read about all this stuff at http://www.namesys.com/ > > > I really think a simple low down dirty file system is just what the doctor > > ordered for postgres. > > Traditional BSD FFS or Solaris UFS is probably the best bet for postgres. That is my opinion. BSD FFS seems to be general enough to give good performance for a large scale of application needs. It is not as fast as XFS for streaming large files (media), and it doesn't optimize small files below the 1k size (fragments), and it does require fsck on reboot. However, looking at all those for PostgreSQL, the costs of the new Linux file systems seems pretty high, especially considering our need for fsync(). What I am really concerned about is when xfs/reiser become the default file systems for Linux, and people complain about PostgreSQL performance. And if we require special file systems, we lose some of our ability to easily grow. Because of ext2's problems with crash recovery, who is going to want to put other data on that file system when they have xfs/reiser available. And boots are going to have to fsck that ext2 file system. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
mlw wrote:<br /><blockquote cite="mid:3AF2200D.922E5723@mohawksoft.com" type="cite"><pre wrap="">Bruce Momjian wrote:<br/></pre><blockquote type="cite"><blockquote type="cite"><pre wrap="">Just put a note in the installation docs thatthe place where the database<br />is initialised to should be on a non-Reiser, non-XFS mount...<br /></pre></blockquote><prewrap="">Sure, we can do that now. What do we do when these are the default file<br />systems forLinux? We can tell them to create other types of file<br />systems, but that is a pretty big hurdle. I wonder if itwould be<br />easier to get reiser/xfs to make some modifications.<br /></pre></blockquote><pre wrap=""><br /><br />I havelooked at Reiser, and I don't think it is a file system suited for very<br />large files, or applications such as postgres.The Linux crowd should lobby<br />against any such trend. It is ok for many moderately small files. ReiserFS<br/>would be great for a cddb server, but poor for a database box.<br /><br />XFS is a real big file system project,I'd bet that there are file properties<br />or management tools to tell it to leave directories and files alone.They<br />should have addressed that years ago.<br /><br />One last mention..<br /><br />Having better control overWHERE various files in a database are located can<br />make it easier to deal with these things.</pre></blockquote> Ithink it's worth noting that Oracle has been petitioning the kernel developers for better raw device support: in other words,the ability to write directly to the hard disk and bypassing the filesystem all together. <br /><br /> If the dbis going to assume the responsibility of disk write verification it seems reasonable to assume you might want to investigatethe raw disk i/o options.<br /><br /> Telling your installers that a major performance gain is attainable by doingso might be a start in the opposite direction. I've monitored a lot of discussions and from what I can gather, postgresqldoes it's own set of journaling operations. I don't think that it's necessary for writes to be double journalledanyway.<br /><br /> Again, just my two cents worth...<br />
"Ken Hirsch" <kenhirsch@myself.com> writes: > I don't have a machine with XFS installed and it will be at least a week > before I could get around to a build. Any volunteers? I think I could do that... any useful benchmarks to run? -- Trond Eivind Glomsrød Red Hat, Inc.
> Hi, > > On Fri, May 04, 2001 at 01:49:54PM -0400, Bruce Momjian wrote: > > > > > > Performance doing what? XFS has known performance problems doing > > > unlinks and truncates, but not synchronous IO. The user should be > > > using fdatasync() for databases, btw, not fsync(). > > > > This is hugely helpful. In PostgreSQL 7.1, we do use fdatasync() by > > default it is available on a platform. > > Good --- fdatasync is defined in SingleUnix, so it's probably safe to > probe for it and use it by default if it is there. > > The 2.2 Linux kernel does not have fdatasync implemented, but glibc > will fall back to fsync if that's all that the kernel supports. 2.4 > implements both with the required semantics. OK, that is something we found too, that fdatasync() was there on some platforms, but was really just an fsync(). I believe some HPUX platforms had that. OK, so they need a 2.4 kernel to properly test performance of Reiser/xfs with fdatasync(). -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Michael Samuel wrote: > > > Remember, general purpose file systems must do for files what Postgres is > > already doing for records. You will always have extra work. I am seriously > > thinking of trying a FAT32 as pg_xlog. I wonder if it will improve performance, > > or if there is just something fundamentally stupid about FAT32 that will make > > it worse? > > Well, for a starters, file permissions... > > Ext2 would kick arse over FAT32 for performance. OK, I'll bite. In a database environment where file creation is not such an issue, why would ext2 be faster? The FAT file system has, AFAIK, very little overhead for file writes. It simply writes the two FAT tables on file extension, and data. Depending on cluster size, there is probably even less happening there. I don't think that anyone is saying that FAT is the answer in a production environment, but maybe we can do a comparison of various file systems and see if any performance issues show up. I mentioned FAT only because I was thinking about how postgres would perform on a very simple file system, one which bypasses most of the normal stuff a "good" general purpose file system would do. While I was thinking this, it occurred to me that FAT was about he cheesiest simple file system one could find, short of a ram disk, and maybe we could use it to test the assumptions about performance impact of the file system on postgres. Just a thought. If you know of some reason why ext2 would perform better in the postgres environment, I would love to hear why, I'm very curious.
[ Charset ISO-8859-1 unsupported, converting... ] > I got some information from Stephen Tweedie on this - please keep him > "Cc:" as he's not on this list > > ************************************************************************ > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > I was talking to a Linux user yesterday, and he said that performance > > using the xfs file system is pretty bad. He believes it has to do with > > the fact that fsync() on log-based file systems requires more writes. > > > Performance doing what? XFS has known performance problems doing > unlinks and truncates, but not synchronous IO. The user should be > using fdatasync() for databases, btw, not fsync(). This is hugely helpful. In PostgreSQL 7.1, we do use fdatasync() by default it is available on a platform. > First, XFS, ext3 and reiserfs are *NOT* log-based filesystems. They > are journaling filesystems. They have a log, but they are not > log-based because they do not store data permanently in a log > structure. Berkeley LFS, Sprite and Spiralog are log-based > filesystems. Sorry, I get those mixed up. > > With a standard BSD/ext2 file system, WAL writes can stay on the same > > cylinder to perform fsync. Is that true of log-based file systems? > > Not true on ext2 or BSD. Write-aheads are _usually_ close to the > inode, but not always. For true log-based filesystems, writes are > always completely sequential, so the issue just goes away. For > journaling filesystems, depending on the setup there may be a seek to > the journal involved, but some journaling filesystems can use a > separate disk for the journal so no seek is required. > > > I know xfs and reiser are both log based. Do we need to be concerned > > about PostgreSQL performance on these file systems? I use BSD FFS with > > soft updates here, so it doesn't affect me. > > A database normally preallocates its data files and then performs most > of its writes using update-in-place. In such cases, fsync() is almost > always the wrong thing to be doing --- the data writes have changed > nothing in the inode except for the timestamps, and there's no need to > flush the timestamps to disk for every write. fdatasync() is > designed for this --- if the only inode change is timestamps, > fdatasync() will skip the seek to the inode and will only update the > data. If any significant inode fields have been changed, then a full > flush is done. We do pre-allocate our log file space in chunks to avoid inode/block index writes. > Using fdatasync, most filesystems will incur no seeks for data flush, > regardless of whether the filesystem is journaling or not. Thanks. That is a big help. I wonder if people reporting performance problems were using 7.0.3. We only added fdatasync() in 7.1. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > There have been multiple reports of poor PostgreSQL performance on > > Reiser and xfs. I don't have numbers, though. Frankly, I think we need > > xfs and reiser experts involved to figure out our options here. > > I've done some testing to see how Reiserfs performs > vs ext2, and also various for various values of wal_sync_method while on a > reiserfs partition. The attached graph shows the results. The y axis is > transactions per second and the x axis is the transaction number. It was > clear that, at least for my specific app, ext2 was significantly faster. > > The hardware I tested on has an Athalon 1 Ghz cpu and 512 MB ram. The > harddrive is a 2 year old IDE drive. I'm running Red Hat 7 with all the > latest updates, and a freshly compiled 2.4.2 kernel with the latest Reiserfs > patch, and of course PostgreSQL 7.1. The transactions were run in a loop, > 700 times per test, to insert sample data into 4 tables. I used a PHP script > running on the same machine to do the inserts. > > I'd be happy to provide more detail or try a different variation if anyone > is interested. This is hugely helpful. Yikes, look at those lines. It shows a few things. First, under Reiser, nosync, fsync, and fdatasync are pretty much the same. The big surprise here is that fsync doesn't seem to have any effect. Second surprise is that open fsync, which synces on every write rather than on end of transaction, was slower. I believe this should be slower if multiple WAL writes are being made in one transaction. fdatasync would sync just at end of transaction, while each WAL write would be synced by open fsync. And the largest surpise is that ext2 is faster, but not because of fsync, and almost double so. Keep in mind that WAL writes are no the only write happening. Though in 7.1 we don't flush the data blocks to disk, we do write to disk as the buffer cache fill up with dirty buffers. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Joe Conway <joe@conway-family.com> wrote: > > I've done some testing to see how Reiserfs performs > vs ext2, and also various for various values of wal_sync_method while on a > reiserfs partition. The attached graph shows the results. The y axis is > transactions per second and the x axis is the transaction number. It was > clear that, at least for my specific app, ext2 was significantly faster. This is great, thanks a lot! Among other things it tells us, it appears that fsync() is not the problem on Reiserfs. I don't know the details of Reiserfs, but I think a lot of work has gone into optimizing it for very small files, so you can use the file system as a simple database for strings, a la Windows registry. I don't remember hearing about optimizing for large files and large block reads and writes. XFS, on the other hand, is used for very large files on SGI systems. I think the XFS and Reiserfs folks will be happy to look at the performance problem, but it would be very helpful for them to have a prepackaged benchmark (or two or three) to use. We should set up an FTP area to share them. Joe, can you contribute yours? Does anybody else have anything? Already, Trond Eivind Glomsrød teg@redhat.com has volunteered to test on XFS. The easier we make it, the more help we'll get. Ken Hirsch
> I think the XFS and Reiserfs folks will be happy to look at the performance > problem, but it would be very helpful for them to have a prepackaged > benchmark (or two or three) to use. We should set up an FTP area to share > them. Joe, can you contribute yours? Does anybody else have anything? > I don't mind contributing the script and schema that I used, but one thing I failed to mention in my first post is that the first thing the script does is open connections to 256 databases (all on this same machine), and the transactions are relatively evenly dispersed among the 256 connections. The test was originally written to try out an idea to allow scalability by partitioning the data into seperate databases (which could eventually each live on its own server). If you are interested I can modify the test to use only one database and rerun the same tests this weekend. Joe
At 02:09 AM 5/4/01 -0500, Thomas Swan wrote: > I think it's worth noting that Oracle has been petitioning the > kernel developers for better raw device support: in other words, > the ability to write directly to the hard disk and bypassing the > filesystem all together. But there could be other reasons why Oracle would want to do raw stuff. 1) They have more things to sell - management modules/software. More training courses. Certified blahblahblah. More features in brochure. 2) It just helps make things more proprietary. Think lock in. All that for maybe 10% performance increase? I think it's more advantageous for Postgresql to keep the filesystem layer of abstraction, than to do away with it, and later reinvent certain parts of it along with new bugs. What would be useful is if one can specify where the tables, indexes, WAL and other files go. That feature would probably help improve performance far more. For example: you could then stick the WAL on a battery backed up RAM disk. How much total space does a WAL log need? A battery backed RAM disk might even be cheaper than Brand X RDBMS Proprietary Feature #5. Cheerio, Link.
Lincoln Yeoh wrote: > > At 02:09 AM 5/4/01 -0500, Thomas Swan wrote: > > I think it's worth noting that Oracle has been petitioning the > > kernel developers for better raw device support: in other words, > > the ability to write directly to the hard disk and bypassing the > > filesystem all together. > > But there could be other reasons why Oracle would want to do raw stuff. > > 1) They have more things to sell - management modules/software. More > training courses. Certified blahblahblah. More features in brochure. > 2) It just helps make things more proprietary. Think lock in. > > All that for maybe 10% performance increase? > > I think it's more advantageous for Postgresql to keep the filesystem layer > of abstraction, than to do away with it, and later reinvent certain parts > of it along with new bugs. I just did a test of putting pg_xlog on a FAT file system, and my first rough tests (pgbench) show an approximate 20% performance increase over ext2 with fsync enabled. -- I'm not offering myself as an example; every life evolves by its own laws. ------------------------ http://www.mohawksoft.com
Bruce Momjian <pgman@candle.pha.pa.us> wrote: >> > Yes, this double-writing is a problem. Suppose you have your WAL on a >> > separate drive. You can fsync() WAL with zero head movement. With a >> > log based file system, you need two head movements, so you have gone >> > from zero movements to two. >> >> It may be worse depending on how the filesystem actually does >> journalling. I wonder if an fsync() may cause ALL pending >> meta-data to be updated (even metadata not related to the >> postgresql files). >> >> Do you know if reiser or xfs have this problem? > I don't know, but the Linux user reported xfs was really slow. i think this should be tested in more detail: i once tried this lightly (running pgbench against postgresql 7.1beta4) with different filesystems: ext2, reiserfs and XFS and reproducable i got about 15% better results running on XFS ... ok - it's not a very big test, but i think it might be worth to really do an a/b test before seing it as a fact that postgresql is slow on XFS (and maybe reiserfs too ... but reiserfs has had performance problems in certain situations anyway) XFS is a journaling fs, but it does all it's work in a very clever way (delayed allocation etc.) - so usually you should under normal conditions get decent performance out of it - otherwise it might be worth sending a mail to the XFS mailinglist (resierfs maybe dito) t -- thomas graichen <tgr@spoiled.org> ... perfection is reached, not when there is no longer anything to add, but when there is no longer anything to take away. --- antoine de saint-exupery
At 01:16 PM 5/5/01 -0400, mlw wrote: >Lincoln Yeoh wrote: >> >> All that for maybe 10% performance increase? >> >> I think it's more advantageous for Postgresql to keep the filesystem layer >> of abstraction, than to do away with it, and later reinvent certain parts >> of it along with new bugs. > >I just did a test of putting pg_xlog on a FAT file system, and my first rough >tests (pgbench) show an approximate 20% performance increase over ext2 with >fsync enabled. OK. I slouch corrected :). It's more than 10%. However in the same message I did also say: >What would be useful is if one can specify where the tables, indexes, WAL >and other files go. That feature would probably help improve performance >far more. > >For example: you could then stick the WAL on a battery backed up RAM disk. >How much total space does a WAL log need? > >A battery backed RAM disk might even be cheaper than Brand X RDBMS >Proprietary Feature #5. And your experiments do help show that it is useful to be able to specify where things go, that putting just the WAL somewhere else makes things 20% faster. So you don't have to put everything on a pgfs. Just the WAL on some other FS (even FAT32, ick ;) ). --- OK we can do that with symlinks, but is there a PGSQL Recommended or Standard way to do it, so as to reduce administrative errors, and at least help improve consistency with multiadmin pgsql installations? The WAL and DBs are in separate directories, so this makes things easy. But the object names are now all numbers so that makes things a bit harder - and what to do with temp tables? Would it be good to have tables in one directory and indexes in another? Or most people optimize on a specific table/index basis? Where does PGSQL do the on-disk sorts? How about naming the DB objects <object ID>.<object name>? e.g 121575.testtable 125575.testtableindex (or the other way round - name.OID - harder for DB, easier for admin?) They'll still be unique, but now they're admin readable. Slower? e.g. at that code point, pgsql no longer knows the object's name, and wants to refer to everything by just numbers? I apologize if there was already a long discussion on this. I seem to recall Bruce saying that the developers agonized over this. Cheerio, Link.
Lincoln Yeoh wrote: > > At 01:16 PM 5/5/01 -0400, mlw wrote: > >Lincoln Yeoh wrote: > >> > >> All that for maybe 10% performance increase? > >> > >> I think it's more advantageous for Postgresql to keep the filesystem layer > >> of abstraction, than to do away with it, and later reinvent certain parts > >> of it along with new bugs. > > > >I just did a test of putting pg_xlog on a FAT file system, and my first rough > >tests (pgbench) show an approximate 20% performance increase over ext2 with > >fsync enabled. > > OK. I slouch corrected :). It's more than 10%. > > However in the same message I did also say: > >What would be useful is if one can specify where the tables, indexes, WAL > >and other files go. That feature would probably help improve performance > >far more. > > > >For example: you could then stick the WAL on a battery backed up RAM disk. > >How much total space does a WAL log need? > > > >A battery backed RAM disk might even be cheaper than Brand X RDBMS > >Proprietary Feature #5. > > And your experiments do help show that it is useful to be able to specify > where things go, that putting just the WAL somewhere else makes things 20% > faster. So you don't have to put everything on a pgfs. Just the WAL on some > other FS (even FAT32, ick ;) ). So you propose pgwalfs ? ;) It may be much easier to implement than a full fs. How hard would it be to let wal reside on a (raw) device ? If we already pre-allocate a required number of fixed-size files would it be too hard to replace them with plain (raw) devices and test for possible performance gains ? > > How about naming the DB objects <object ID>.<object name>? > e.g > > 121575.testtable > 125575.testtableindex > This sure seems to be an elegant solution for the problem that seems to be impossible to solve with symlinks and such. Even the IMHO hardest to solve problem - RENAME - can probably be done in a transaction-safe manner by doing a link(oid.<newname>) in the beginning and selective unlink(oid.<newname/oldname>) at commit time. -------------------- Hannu
Hannu Krosing wrote: > > Lincoln Yeoh wrote: > > > > At 01:16 PM 5/5/01 -0400, mlw wrote: > > >Lincoln Yeoh wrote: > > >> > > >> All that for maybe 10% performance increase? > > >> > > >> I think it's more advantageous for Postgresql to keep the filesystem layer > > >> of abstraction, than to do away with it, and later reinvent certain parts > > >> of it along with new bugs. > > > > > >I just did a test of putting pg_xlog on a FAT file system, and my first rough > > >tests (pgbench) show an approximate 20% performance increase over ext2 with > > >fsync enabled. > > > > OK. I slouch corrected :). It's more than 10%. > > > > However in the same message I did also say: > > >What would be useful is if one can specify where the tables, indexes, WAL > > >and other files go. That feature would probably help improve performance > > >far more. > > > > > >For example: you could then stick the WAL on a battery backed up RAM disk. > > >How much total space does a WAL log need? > > > > > >A battery backed RAM disk might even be cheaper than Brand X RDBMS > > >Proprietary Feature #5. > > > > And your experiments do help show that it is useful to be able to specify > > where things go, that putting just the WAL somewhere else makes things 20% > > faster. So you don't have to put everything on a pgfs. Just the WAL on some > > other FS (even FAT32, ick ;) ). > > So you propose pgwalfs ? ;) I don't know about a "pgwalfs" too much work. I have had some time to grapple with my feelings about FAT, and you know what? I don't hate the idea. I would, of course, like to look through the driver code and see if there are any technical reasons why it should be excluded. FAT is almost perfect for WAL, and if I can figure out how to get the "base" directory to get the same performance, I'd think about putting it there as well. The ReiserFS issues touched on some vague suspicions I had about fsync. Maybe I'm over reacting, but there are reasons why the oracles manage their own table spaces. Back to FAT. FAT is probably the most simple file system I can think of. As long as it writes to disk when it gets synched, and doesn't loose things, its perfect. Postgres maintains much of the coherency issues, there is no real problem with permissions because it will be owned by the postgres super user, etc. I would never suggest FAT as a general purpose file system, but, geez, as a special purpose single user (postgres) it seems an ideal answer to what will be an increasingly hard problem of advanced file systems. Aside from a general, and well deserved, disdain for FAT. What are the technical "cons" of such a proposal. If we can get the Linux kernel (and other unices) to accept IOCTLs to direct space allocation, and/or write up a white paper on how to use this for postgres, why wouldn't it be a reasonable strategy? -- I'm not offering myself as an example; every life evolves by its own laws. ------------------------ http://www.mohawksoft.com
>Lincoln Yeoh wrote: >> >> >Lincoln Yeoh wrote: >> >For example: you could then stick the WAL on a battery backed up RAM disk. >> >How much total space does a WAL log need? >> > >> >A battery backed RAM disk might even be cheaper than Brand X RDBMS >> >Proprietary Feature #5. >> >> And your experiments do help show that it is useful to be able to specify >> where things go, that putting just the WAL somewhere else makes things 20% >> faster. So you don't have to put everything on a pgfs. Just the WAL on some >> other FS (even FAT32, ick ;) ). At 02:04 PM 5/6/01 +0200, Hannu Krosing wrote: >So you propose pgwalfs ? ;) Nah. I'm proposing the opposite in fact. I'm saying so far there appears to be no real need to come up with a special filesystem. Stick to using existing/future filesystems. Just make it easy and safe enough for DBA's to put the objects on whatever filesystem they choose. So long as the O/S kernel/driver people support the hardware or filesystem, postgresql will take advantage of it with little if any extra work. In fact as mlw's experiments show, you can put the WAL on FAT (FAT16?) for a 20% performance increase. How much better would a raw device be? Would it really be worth all that hassle? For instance if you need to resize the FAT partition, you could probably use fips, Partition Magic or some other cost effective solution - no need for pgsql developers or anybody to reinvent anything. My proposed but untested idea is that you could get a significant performance increase by putting the WAL on popular filesystems running on battery backed RAM drives (or other special hardware). 128MB RAM should be enough for small setups? Don't know how much these things cost, but I believe that when you need the speed, they'll be more worthwhile than a special proprietary filesystem. Ok, just found: http://www.expressdata.com.au/Products/ProductsList.asp?SUPPLIER_NAME=PLATYP US+TECHNOLOGY&SUBCATEGORY_NAME=QikDrive2#PRODUCTTITLE AUD$1,624.70 = USD843.06. Not cheap but not way out of reach. Haven't found other competing products yet. Must be somewhere. Cheerio, Link.
Lincoln Yeoh <lyeoh@pop.jaring.my> writes: > OK we can do that with symlinks, but is there a PGSQL Recommended or > Standard way to do it, so as to reduce administrative errors, and at least > help improve consistency with multiadmin pgsql installations? Not yet. There should be support for this. See doc/TODO.detail/tablespaces. regards, tom lane
Hannu Krosing <hannu@tm.ee> writes: > Even the IMHO hardest to solve problem > - RENAME - can > probably be done in a transaction-safe manner by doing a > link(oid.<newname>) in the > beginning and selective unlink(oid.<newname/oldname>) at commit time. Nope. Consider begin;rename a to b;rename b to a;end; And don't tell me you'll solve this by ignoring failures from link(). That's a recipe for losing your data... I would ask people who think they have a solution to please go back and reread the very long discussions we have had on this point in the past. Nobody particularly likes numeric filenames, but there really isn't any other workable answer. regards, tom lane
At 12:03 PM 5/6/01 -0400, Tom Lane wrote: >Hannu Krosing <hannu@tm.ee> writes: >> Even the IMHO hardest to solve problem >> - RENAME - can >> probably be done in a transaction-safe manner by doing a >> link(oid.<newname>) in the >> beginning and selective unlink(oid.<newname/oldname>) at commit time. > >Nope. Consider > > begin; > rename a to b; > rename b to a; > end; > >And don't tell me you'll solve this by ignoring failures from link(). >That's a recipe for losing your data... > >I would ask people who think they have a solution to please go back and >reread the very long discussions we have had on this point in the past. >Nobody particularly likes numeric filenames, but there really isn't any >other workable answer. OK. Found one of the discussions at: http://postgresql.readysetnet.com/mhonarc/pgsql-hackers/2000-03/threads.html #00088 Conclusion calling stuff oid.relname doesn't really work. Sorry to have brought it up again. Another idea that's probably more messy than it's worth: Main object still called <oid> with a symlink called <oid.originalrelname>. DB really just uses <oid>. Rename= adds symlink called <oid.newrelname>, doesn't remove symlinks (symlinks more for show!). Committed drop table does what 7.1 does with the main oid entry. Vacuum cleans up the symlinks leaving just a single valid one or zaps all if the table has been dropped. For windows create empty files named oid.relname instead of symlinks. Windows will definitely like .verylongrelname extensions ;). Kinda messy and kludgy. Throw in the performance reduction and Ick! I probably have to think harder :), maybe there's just no good way :(. Ah well, Link.
Re: TABLE RENAME/NUMERIC FILENAMES (Was: New Linux xfs/reiser file systems)
From
Hannu Krosing
Date:
Tom Lane wrote: > > Hannu Krosing <hannu@tm.ee> writes: > > Even the IMHO hardest to solve problem > > - RENAME - can > > probably be done in a transaction-safe manner by doing a > > link(oid.<newname>) in the > > beginning and selective unlink(oid.<newname/oldname>) at commit time. > > Nope. Consider > > begin; > rename a to b; > rename b to a; > end; > > And don't tell me you'll solve this by ignoring failures from link(). > That's a recipe for losing your data... I guess link() failures can be safely ignored _as long as_ we check that we have the right link after doing it. I can't see how it will lose data. > I would ask people who think they have a solution to please go back and > reread the very long discussions we have had on this point in the past. I think I have now (No way to guarantee I have read _everything_ about it, but I did hit about ~10 messages on oid_relname naming scheme). the most serious objection seemed to be that we need to remember the postgres tablename while it would be much easier to use only oids . I guess we could hit some system limits here (running out of directory entries or reaching the maximum number of links to a file) but at least on linux i was able to make >10000 links to one file with no problems. now that i think of it I have one concern - it would require extra work to use tablenames like "/etc/passwd" or others that use characters that are reserved in filenames which are ok to use in 7.1. hannu=# create table "/etc/passwd"( hannu(# login text, hannu(# uid int, hannu(# gid int hannu(# ); CREATE hannu=# \dt List of relations Name | Type | Owner -------------+-------+-------/etc/passwd | table | hannu So if people start using names like these it will not be easy to go back ;) > Nobody particularly likes numeric filenames, but there really isn't any > other workable answer. At least we could put links on system relations, so it would be easier to find them. I guess one is not supposed to rename/drop system tables ? --------------------- Hannu
> > Before we get too involved in speculating, shouldn't we actually measure the > > performance of 7.1 on XFS and Reiserfs? Since it's easy to disable fsync, > > we can test whether that's the problem. I don't think that logging file > > systems must intrinsically give bad performance on fsync since they only log > > metadata changes. > > > > I don't have a machine with XFS installed and it will be at least a week > > before I could get around to a build. Any volunteers? > > There have been multiple reports of poor PostgreSQL performance on > Reiser and xfs. I don't have numbers, though. Frankly, I think we need > xfs and reiser experts involved to figure out our options here. I've done some testing to see how Reiserfs performs vs ext2, and also various for various values of wal_sync_method while on a reiserfs partition. The attached graph shows the results. The y axis is transactions per second and the x axis is the transaction number. It was clear that, at least for my specific app, ext2 was significantly faster. The hardware I tested on has an Athalon 1 Ghz cpu and 512 MB ram. The harddrive is a 2 year old IDE drive. I'm running Red Hat 7 with all the latest updates, and a freshly compiled 2.4.2 kernel with the latest Reiserfs patch, and of course PostgreSQL 7.1. The transactions were run in a loop, 700 times per test, to insert sample data into 4 tables. I used a PHP script running on the same machine to do the inserts. I'd be happy to provide more detail or try a different variation if anyone is interested. - Joe
Hi, On Fri, May 04, 2001 at 01:49:54PM -0400, Bruce Momjian wrote: > > > > Performance doing what? XFS has known performance problems doing > > unlinks and truncates, but not synchronous IO. The user should be > > using fdatasync() for databases, btw, not fsync(). > > This is hugely helpful. In PostgreSQL 7.1, we do use fdatasync() by > default it is available on a platform. Good --- fdatasync is defined in SingleUnix, so it's probably safe to probe for it and use it by default if it is there. The 2.2 Linux kernel does not have fdatasync implemented, but glibc will fall back to fsync if that's all that the kernel supports. 2.4 implements both with the required semantics. --Stephen
teg@redhat.com (Trond Eivind Glomsrød) writes: > "Ken Hirsch" <kenhirsch@myself.com> writes: > > > I don't have a machine with XFS installed and it will be at least a week > > before I could get around to a build. Any volunteers? > > I think I could do that... any useful benchmarks to run? In lack of bigger benchmarks, I tried postgresql 7.1 on a Red Hat Linux 7.1 system with the SGI XFS modifications. The differences were very small. -- Trond Eivind Glomsrød Red Hat, Inc.
> teg@redhat.com (Trond Eivind Glomsr?d) writes: > > > "Ken Hirsch" <kenhirsch@myself.com> writes: > > > > > I don't have a machine with XFS installed and it will be at least a week > > > before I could get around to a build. Any volunteers? > > > > I think I could do that... any useful benchmarks to run? > > In lack of bigger benchmarks, I tried postgresql 7.1 on a Red Hat > Linux 7.1 system with the SGI XFS modifications. The differences were > very small. > Thanks. That is very helpful. Seems XFS is fine. According to Joe Conway, reiser has some problems. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> I don't mind contributing the script and schema that I used, but one thing I > failed to mention in my first post is that the first thing the script does > is open connections to 256 databases (all on this same machine), and the > transactions are relatively evenly dispersed among the 256 connections. The > test was originally written to try out an idea to allow scalability by > partitioning the data into seperate databases (which could eventually each > live on its own server). If you are interested I can modify the test to use > only one database and rerun the same tests this weekend. > I modified my test script to use just one (instead of 256) databases to be more representative of a common installation. Then I ran more tests under both ext2 and reiserfs. The summary follows. Short answer is that the differences are much smaller than under the first test, but ext2 is still faster. -- Joe case rfs_fdatasync ext_fdatasync rfs_fdatasync ext_fdatasync rfs_fdatasync ext_fdatasync fstab sync,noatime sync,noatime noatime noatime defaults defaults starting # tup 70k 70k 70k 70k 70k 70k total time (min) 12.10 11.77 11.83 11.43 11.88 11.42 cpu util % 90-94% 95-98% 90-95% 95-99% 90-95% 95-99% ram - stable cpu 42M 42M 42M 42M 42M 42M ram - final 52M 52M 52M 52M 52M 52M avg trans/sec 10000 tup 13.77 14.16 14.08 14.58 14.03 14.60 5000 tup 13.70 14.08 13.97 14.71 13.93 14.75 1000 tup 11.36 11.63 11.63 13.33 11.63 13.51 Notes: 1. rfs_fdatasync: data and wal on rieserfs with wal_sync_method = fdatasync 2. ext_fdatasync: data and wal on ext2 with wal_sync_method = fdatasync 3. starting # tup: the database was pre-seeded with 70k tuples. I made a tarball of the starting database and refreshed the pgsql/data filestructure before each test to ensure a good comparison. 4. cpu utilization + ram - stable cpu + ram - final: I eyeballed top while the test was running. In general cpu % increased steadily through the first 1500 or so transactions, along with ram usage. At the point when cpu utilization stabilized, ram was pretty consistently at 42M. From there, cpu util % varied in the ranges noted, while ram usage slowly increased to 52M. It seemed pretty linear in that I could estimate the number of transactions already processes based on ram usage. 5. avg trans/sec: These represent the total transactions/total elapsed time at the given number of transactions (as opposed to some instantaneous value at that point in time).
teg@redhat.com (Trond Eivind Glomsrød) writes: > teg@redhat.com (Trond Eivind Glomsrød) writes: > > > "Ken Hirsch" <kenhirsch@myself.com> writes: > > > > > I don't have a machine with XFS installed and it will be at least a week > > > before I could get around to a build. Any volunteers? > > > > I think I could do that... any useful benchmarks to run? > > In lack of bigger benchmarks, I tried postgresql 7.1 on a Red Hat > Linux 7.1 system with the SGI XFS modifications. The differences were > very small. And here is the one for ReiserFS - same kernel, but recompiled to turn off debugging When compared to the earlier ones (including XFS), you'll note that ReiserFS performance is rather poor in some of the tests - it takes 37 vs. 13 seconds for 8192 inserts, when the inserts are different transactions. -- Trond Eivind Glomsrød Red Hat, Inc.
Attachment
> > When compared to the earlier ones (including XFS), you'll note that ReiserFS > performance is rather poor in some of the tests - it takes 37 vs. 13 > seconds for 8192 inserts, when the inserts are different transactions. That is all the fsync delay, probably, and it should be using fdatasync() on that kernel. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > > When compared to the earlier ones (including XFS), you'll note that ReiserFS > > performance is rather poor in some of the tests - it takes 37 vs. 13 > > seconds for 8192 inserts, when the inserts are different transactions. > > That is all the fsync delay, probably, and it should be using fdatasync() > on that kernel. And it does seem to work that way with XFS... -- Trond Eivind Glomsrød Red Hat, Inc.
Quoting Trond Eivind Glomsrød <teg@redhat.com>: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > > > > > When compared to the earlier ones (including XFS), you'll note that > ReiserFS > > > performance is rather poor in some of the tests - it takes 37 vs. 13 > > > seconds for 8192 inserts, when the inserts are different transactions. > > > > That is all the fsync delay, probably, and it should be using fdatasync() > > on that kernel. > > And it does seem to work that way with XFS... I'm concearned about this because we are going to switch our fist server to a Journaling FS (on Linux). Searching and asking I found out that for our short term work we need ReiserFS (it's for a proxy server). Put the interesting thing was that for large (very large) files, everybody recomends XFS. The drawback of XFS is that it's very, very sloooow when deleting files. Saludos... :-) -- El mejor sistema operativo es aquel que te da de comer. Cuida tu dieta. ----------------------------------------------------------------- Martin Marques | mmarques@unl.edu.ar Programador, Administrador | Centro de Telematica Universidad Nacional del Litoral -----------------------------------------------------------------
> I'm concearned about this because we are going to switch our > fist server to a Journaling FS (on Linux). Searching and asking > I found out that for our short term work we need ReiserFS (it's > for a proxy server). Put the interesting thing was that for > large (very large) files, everybody recomends XFS. The drawback > of XFS is that it's very, very sloooow when deleting files. Why do all these file systems seem to have one major negative? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Quoting Bruce Momjian <pgman@candle.pha.pa.us>: > > I'm concearned about this because we are going to switch our > > fist server to a Journaling FS (on Linux). Searching and asking > > I found out that for our short term work we need ReiserFS (it's > > for a proxy server). Put the interesting thing was that for > > large (very large) files, everybody recomends XFS. The drawback > > of XFS is that it's very, very sloooow when deleting files. > > Why do all these file systems seem to have one major negative? In the case of XFS they told me that it was slow deleting, but I guess that they were trying to tell me that reiser would do the job on a proxy cache better then XFS. Everybody put there thumbs-up to XFS when talking about databases (because of the large file size). Saludos... :-) -- El mejor sistema operativo es aquel que te da de comer. Cuida tu dieta. ----------------------------------------------------------------- Martin Marques | mmarques@unl.edu.ar Programador, Administrador | Centro de Telematica Universidad Nacional del Litoral -----------------------------------------------------------------
Makes it more fun :) Kinda like a lottery ticket: - reliable (cherry) - fast (cherry) - resource hog (lemon) -- Rod Taylor BarChord Entertainment Inc. ----- Original Message ----- From: "Bruce Momjian" <pgman@candle.pha.pa.us> To: "Martín Marqués" <martin@bugs.unl.edu.ar> Cc: "Trond Eivind Glomsrød" <teg@redhat.com>; <pgsql-hackers@postgresql.org> Sent: Wednesday, May 09, 2001 1:24 PM Subject: Re: [HACKERS] Re: New Linux xfs/reiser file systems > > I'm concearned about this because we are going to switch our > > fist server to a Journaling FS (on Linux). Searching and asking > > I found out that for our short term work we need ReiserFS (it's > > for a proxy server). Put the interesting thing was that for > > large (very large) files, everybody recomends XFS. The drawback > > of XFS is that it's very, very sloooow when deleting files. > > Why do all these file systems seem to have one major negative? > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster >
Hello ! I am forwarding the following from lkml It seems that the only case when XFS is slow is the 'rm -rf linux' [which can be considered as a good sign for linux]. For all other operation XFS is the winner. YAS <MessageFromLKML> From: Ricardo Galli (gallir@uib.es) Date: Wed May 09 2001 - 20:45:46 EDT * Next message: clameter@lameter.com: "USB broken in 2.4.4? Serial Ricochet works, USB performance sucks." * Previous message: AmigaLinux A2232 Driver Project : "New Amiga Driver" * Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] > It would be great to see a table of ReiserFS/XFS/Ext2+index performance> results. Well, to make it really fair it shouldbe Ext3+index so I'd> better add 'backport the patch to 2.2' or 'bug Stephen and friends to> hurry up' to my to-dolist. You can find a simple benchmark (an average of three samples) among reiser, ext2, xfs and fat32 under Linux: http://bulma.lug.net/body.phtml?nIdNoticia=626 Although is Spanish, the tables are easy to understand. The benchmark was carried up by Guillem Cantallops, student of the University of Balearics Islands and member or the local LUG... BASIC WORDS ;-) Escritura: Writing Lectura: Reading Borrado: Deletion Copia: Copy Extracción: Extraction Regards, --ricardo http://m3d.uib.es/~gallir/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ </MessageFromLKML> Bruce Momjian wrote: >>I'm concearned about this because we are going to switch our >>fist server to a Journaling FS (on Linux). Searching and asking >>I found out that for our short term work we need ReiserFS (it's >>for a proxy server). Put the interesting thing was that for >>large (very large) files, everybody recomends XFS. The drawback >>of XFS is that it's very, very sloooow when deleting files. >> > > Why do all these file systems seem to have one major negative? > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster >