Thread: Filesystem
Hi @ all, i have only a little question. Which filesystem is preferred for postgresql? I'm plan to use xfs (before i used reiserfs). The reason is the xfs_freeze Tool to make filesystem-snapshots. Is the performance better than reiserfs, is it reliable? best regards, Martin
Martin Fandel wrote: > Hi @ all, > > i have only a little question. Which filesystem is preferred for > postgresql? I'm plan to use xfs (before i used reiserfs). The reason > is the xfs_freeze Tool to make filesystem-snapshots. > > Is the performance better than reiserfs, is it reliable? > I used postgresql with xfs on mandrake 9.0/9.1 a while ago - reliability was great, performance seemed better than ext3. I didn't compare with reiserfs - the only time I have ever lost data from a Linux box has been when I used reiserfs, hence I am not a fan :-( best wishes Mark
We have been using XFS for about 6 months now and it has even tolerated a controller card crash. So far we have mostly good things to report about XFS. I benchmarked raw throughputs at various stripe sizes, and XFS came out on top for us against reiser and ext3. I also used it because of it's supposed good support for large files, which was verified somewhat by the benchmarks.
I have noticed a problem though - if you have 800000 files in a directory, it seems that XFS chokes on simple operations like 'ls' or 'chmod -R ...' where ext3 doesn't, don't know about reiser, I went straight back to default after that problem (that partition is not on a DB server though).
Alex Turner
netEconomist
I have noticed a problem though - if you have 800000 files in a directory, it seems that XFS chokes on simple operations like 'ls' or 'chmod -R ...' where ext3 doesn't, don't know about reiser, I went straight back to default after that problem (that partition is not on a DB server though).
Alex Turner
netEconomist
On 6/3/05, Martin Fandel <martin.fandel@alphyra-evs.de> wrote:
Hi @ all,
i have only a little question. Which filesystem is preferred for
postgresql? I'm plan to use xfs (before i used reiserfs). The reason
is the xfs_freeze Tool to make filesystem-snapshots.
Is the performance better than reiserfs, is it reliable?
best regards,
Martin
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
Hi i have tested a xfs+LVM installation with the scalix (HP OpenMail) Mailserver (it's a little time ago). I had at that time some problems using xfs_freeze. I used a script for freezing the fs and making storing the snapshots. Sometimes the complete Server hangs (no blinking cursor, no possible logins, no network). I don't know if it was a hardware problem or if it was the xfs-software. I installed/compiled the newest kernel for this system (i think it was a 2.6.9) to check out if it's maybe a kernel-problem. But on the next days, the system hangs again. After that i used reiserfs again. I tested it with Suse Linux Enterprise Server 8. Has someone heared about such problems? That is the only reason that i have a bit fear to use xfs for a critical database :/. Best regards, Martin Am Freitag, den 03.06.2005, 09:18 -0400 schrieb Alex Turner: > We have been using XFS for about 6 months now and it has even > tolerated a controller card crash. So far we have mostly good things > to report about XFS. I benchmarked raw throughputs at various stripe > sizes, and XFS came out on top for us against reiser and ext3. I also > used it because of it's supposed good support for large files, which > was verified somewhat by the benchmarks. > > I have noticed a problem though - if you have 800000 files in a > directory, it seems that XFS chokes on simple operations like 'ls' or > 'chmod -R ...' where ext3 doesn't, don't know about reiser, I went > straight back to default after that problem (that partition is not on > a DB server though). > > Alex Turner > netEconomist > > On 6/3/05, Martin Fandel <martin.fandel@alphyra-evs.de> wrote: > Hi @ all, > > i have only a little question. Which filesystem is preferred > for > postgresql? I'm plan to use xfs (before i used reiserfs). The > reason > is the xfs_freeze Tool to make filesystem-snapshots. > > Is the performance better than reiserfs, is it reliable? > > best regards, > Martin > > > ---------------------------(end of > broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an > appropriate > subscribe-nomail command to majordomo@postgresql.org so > that your > message can get through to the mailing list cleanly >
On Fri, 3 Jun 2005 09:06:41 +0200 "Martin Fandel" <martin.fandel@alphyra-evs.de> wrote: > i have only a little question. Which filesystem is >preferred for postgresql? I'm plan to use xfs >(before i used reiserfs). The reason > is the xfs_freeze Tool to make filesystem-snapshots. XFS has worked great for us, and has been both reliable and fast. Zero problems and currently our standard server filesystem. Reiser, on the other hand, has on rare occasion eaten itself on the few systems where someone was running a Reiser partition, though none were running Postgres at the time. We have deprecated the use of Reiser on all systems where it is not already running. In terms of performance for Postgres, the rumor is that XFS and JFS are at the top of the heap, definitely better than ext3 and somewhat better than Reiser. I've never used JFS, but I've seen a few benchmarks that suggest it is at least as fast as XFS for Postgres. Since XFS is more mature than JFS on Linux, I go with XFS by default. If some tragically bad problems develop with XFS I may reconsider that position, but we've been very happy with it so far. YMMV. cheers, J. Andrew Rogers
Hi, I've installed the same installation of my reiser-fs-postgres-8.0.1 with xfs. Now my pgbench shows the following results: postgres@ramses:~> pgbench -h 127.0.0.1 -p 5432 -c150 -t5 pgbench starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 1 number of clients: 150 number of transactions per client: 5 number of transactions actually processed: 750/750 tps = 133.719348 (including connections establishing) tps = 151.670315 (excluding connections establishing) With reiserfs my pgbench results are between 230-280 (excluding connections establishing) and 199-230 (including connections establishing). I'm using Suse Linux 9.3. I can't see better performance with xfs. :/ Must I enable special fstab-settings? Best regards, Martin Am Freitag, den 03.06.2005, 10:18 -0700 schrieb J. Andrew Rogers: > On Fri, 3 Jun 2005 09:06:41 +0200 > "Martin Fandel" <martin.fandel@alphyra-evs.de> wrote: > > i have only a little question. Which filesystem is > >preferred for postgresql? I'm plan to use xfs > >(before i used reiserfs). The reason > > is the xfs_freeze Tool to make filesystem-snapshots. > > > XFS has worked great for us, and has been both reliable > and fast. Zero problems and currently our standard server > filesystem. Reiser, on the other hand, has on rare > occasion eaten itself on the few systems where someone was > running a Reiser partition, though none were running > Postgres at the time. We have deprecated the use of > Reiser on all systems where it is not already running. > > In terms of performance for Postgres, the rumor is that > XFS and JFS are at the top of the heap, definitely better > than ext3 and somewhat better than Reiser. I've never > used JFS, but I've seen a few benchmarks that suggest it > is at least as fast as XFS for Postgres. > > Since XFS is more mature than JFS on Linux, I go with XFS > by default. If some tragically bad problems develop with > XFS I may reconsider that position, but we've been very > happy with it so far. YMMV. > > cheers, > > J. Andrew Rogers
On Wed, Jun 08, 2005 at 09:36:31AM +0200, Martin Fandel wrote: >I've installed the same installation of my reiser-fs-postgres-8.0.1 >with xfs. Do you have pg_xlog on a seperate partition? I've noticed that ext2 seems to have better performance than xfs for the pg_xlog workload (with all the syncs). Mike Stone
Hi, ah you're right. :) I forgot to symlink the pg_xlog-dir to another partition. Now it's a bit faster than before. But not faster than the same installation with reiserfs: postgres@ramses:~> pgbench -h 127.0.0.1 -p 5432 -c150 -t5 pgbench starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 1 number of clients: 150 number of transactions per client: 5 number of transactions actually processed: 750/750 tps = 178.831543 (including connections establishing) tps = 213.931383 (excluding connections establishing) I've tested dump's and copy's with the xfs-installation. It's faster than before. But the transactions-query's are still slower than the reiserfs-installation. Are any fstab-/mount-options recommended for xfs? best regards, Martin Am Mittwoch, den 08.06.2005, 08:10 -0400 schrieb Michael Stone: > pgsql-performance@postgresql.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Martin Fandel wrote: | | I've tested dump's and copy's with the xfs-installation. It's | faster than before. But the transactions-query's are still slower | than the reiserfs-installation. | | Are any fstab-/mount-options recommended for xfs? | Hello, Martin. I'm afraid that, unless you planned for your typical workload and database cluster configuration at filesystem creation time, there is not much you can do solely by using different mount options. As you don't mention how you configured the filesystem though, here's some thoughts on that (everybody is more than welcome to comment on this, of course). Depending on the underlying array, the block size should be set as high as possible (page size) to get as close as possible to single stripe unit size, provided that the stripe unit is a multiple of block size. For now, x86 unfortunately doesn't allow blocks of multiple pages (yet). If possible, try to stick as close to the PostgreSQL page size as well, which is 8kB, if I recall correctly. 4k blocks may hence be a good choice here. Higher allocation group count (agcount/agsize) allows for better parallelism when allocating blocks and inodes. From your perspective, this may not necessarily be needed (or desired), as allocation and block reorganization may be "implicitly forced" to being performed internally as often as possible (depending on how frequently you run VACUUM FULL; if you can afford it, try running it as seldomly as possible). What you do want here though, is a good enough an allocation group count to prevent one group from occupying too much of one single disk in the array, thus smothering other applicants trying to obtain an extent (this would imply $agsize = ($desired_agsize - ($sunit * n)), where n < ($swidth / $sunit). If stripe unit for the underlying RAID device is x kB, the "sunit" setting is (2 * x), as it is in 512-byte blocks (do not be mislead by the rather confusing manpage). If you have RAID10/RAID01 in place, "swidth" may be four times the size of "sunit", depending on how your RAID controller (or software driver) understands it (I'm not 100% sure on this, comments, anyone?). "unwritten" (for unwritten extent markings) can be set to 0 if all of the files are predominantly preallocated - again, if you VACUUM FULL extremely seldomly, and delete/update a lot, this may be useful as it saves I/O and CPU time. YMMV. Inode size can be set using "size" parameter set to maximum, which is currently 2048 bytes on x86, if you're using page-sized blocks. As the filesystem will probably be rather big, as well as the files that live on it, you probably won't be using much of it for inodes, so you can set "maxpct" to a safe minimum of 1%, which would yield apprx. 200.000 file slots in a 40GB filesystem (with inode size of 2kB). Log can, of course, be either "internal", with a "sunit" that fits the logical configuration of the array, or any other option, if you want to move the book-keeping overhead away from your data. Do mind that typical journal size is usually rather small though, so you probably want to be using one partitioned disk drive for a number of journals, especially since there usually isn't much journalism to be done on a typical database cluster filesystem (compared to, for example, a mail server). Naming (a.k.a. directory) area of the filesystem is also rather poorly utilized, as there are few directories, and they only contain small numbers of files, so you can try optimizing in this area too: "size" may be set to maximum, 64k, although this probably doesn't buy you much besides a couple of kilobytes' worth of space. Now finally, the most common options you could play with at mount time. They would most probably include "noatime", as it is of course rather undesirable to update inodes upon each and every read access, attribute or directory lookup, etc. I would be surprised if you were running the filesystem both without noatime and a good reason to do that. :) Do mind that this is a generic option available for all filesystems that support the atime attribute and is not xfs-specific in any way. As for XFS, biosize=n can be used, where n = log2(${swidth} * ${sunit}), ~ or a multiple thereof. This is, _if_ you planned for your workload by using an array configuration and stripe sizes befitting of biosize, as well as configuring filesystem appropriately, the setting where you can gain by making operating system cache in a slightly readahead manner. Another useful option might be osyncisosync, which implements a true O_SYNC on files opened with that option, instead of the default Linux behaviour where O_SYNC, O_DSYNC and O_RSYNC are synonymous. It may hurt your performance though, so beware. If you decided to externalize log journal to another disk drive, and you have several contendants to that storage unit, you may also want to release contention a bit by using larger logbufs and logbsize settings, to provide for more slack in others when a particular needs to spill buffers to disk. All of these ideas share one common thought: you can tune a filesystem so it helps in reducing the amount of iowait. The filesystem itself can help preventing unnecessary work performed by the disk and eliminating contention for the bandwidth of the transport subsystem. This can be achieved by improving internal organization of the filesystem to better suite the requirements of a typical database workload, and eliminating the (undesired part of the) book-keeping work in a your filesystem. Hope to have helped. Kind regards, - -- Grega Bremec gregab at p0f dot net -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFCpvdcfu4IwuB3+XoRAiQQAJ4rnnFYGW42U/SnYz4LGmgEsF0s1gCfXikL HT6EHWeTvQfd+s+9DkvOQpI= =V+E2 -----END PGP SIGNATURE-----