Thread: Filesystem

Filesystem

From
"Martin Fandel"
Date:
Hi @ all,

i have only a little question. Which filesystem is preferred for
postgresql? I'm plan to use xfs (before i used reiserfs). The reason
is the xfs_freeze Tool to make filesystem-snapshots.

Is the performance better than reiserfs, is it reliable?

best regards,
Martin


Re: Filesystem

From
Mark Kirkwood
Date:
Martin Fandel wrote:
> Hi @ all,
>
> i have only a little question. Which filesystem is preferred for
> postgresql? I'm plan to use xfs (before i used reiserfs). The reason
> is the xfs_freeze Tool to make filesystem-snapshots.
>
> Is the performance better than reiserfs, is it reliable?
>

I used postgresql with xfs on mandrake 9.0/9.1 a while ago -
reliability was great, performance seemed better than ext3. I didn't
compare with reiserfs - the only time I have ever lost data from a Linux
box has been when I used reiserfs, hence I am not a fan :-(

best wishes

Mark

Re: Filesystem

From
Alex Turner
Date:
We have been using XFS for about 6 months now and it has even tolerated a controller card crash.  So far we have mostly good things to report about XFS.  I benchmarked raw throughputs at various stripe sizes, and XFS came out on top for us against reiser and ext3.  I also used it because of it's supposed good support for large files, which was verified somewhat by the benchmarks.

I have noticed a problem though - if you have 800000 files in a directory, it seems that XFS chokes on simple operations like 'ls' or 'chmod -R ...' where ext3 doesn't, don't know about reiser, I went straight back to default after that problem (that partition is not on a DB server though).

Alex Turner
netEconomist

On 6/3/05, Martin Fandel <martin.fandel@alphyra-evs.de> wrote:
Hi @ all,

i have only a little question. Which filesystem is preferred for
postgresql? I'm plan to use xfs (before i used reiserfs). The reason
is the xfs_freeze Tool to make filesystem-snapshots.

Is the performance better than reiserfs, is it reliable?

best regards,
Martin


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to majordomo@postgresql.org so that your
      message can get through to the mailing list cleanly

Re: Filesystem

From
"Martin Fandel"
Date:
Hi

i have tested a xfs+LVM installation with the scalix (HP OpenMail)
Mailserver (it's a little time ago). I had at that time some problems
using xfs_freeze. I used a script for freezing the fs and making storing
the snapshots. Sometimes the complete Server hangs (no blinking cursor,
no possible logins, no network). I don't know if it was a hardware
problem or if it was the xfs-software. I installed/compiled the newest
kernel for this system (i think it was a 2.6.9) to check out if it's
maybe a kernel-problem. But on the next days, the system hangs
again. After that i used reiserfs again.

I tested it with Suse Linux Enterprise Server 8.

Has someone heared about such problems? That is the only reason that
i have a bit fear to use xfs for a critical database :/.

Best regards,
Martin

Am Freitag, den 03.06.2005, 09:18 -0400 schrieb Alex Turner:
> We have been using XFS for about 6 months now and it has even
> tolerated a controller card crash.  So far we have mostly good things
> to report about XFS.  I benchmarked raw throughputs at various stripe
> sizes, and XFS came out on top for us against reiser and ext3.  I also
> used it because of it's supposed good support for large files, which
> was verified somewhat by the benchmarks.
>
> I have noticed a problem though - if you have 800000 files in a
> directory, it seems that XFS chokes on simple operations like 'ls' or
> 'chmod -R ...' where ext3 doesn't, don't know about reiser, I went
> straight back to default after that problem (that partition is not on
> a DB server though).
>
> Alex Turner
> netEconomist
>
> On 6/3/05, Martin Fandel <martin.fandel@alphyra-evs.de> wrote:
>         Hi @ all,
>
>         i have only a little question. Which filesystem is preferred
>         for
>         postgresql? I'm plan to use xfs (before i used reiserfs). The
>         reason
>         is the xfs_freeze Tool to make filesystem-snapshots.
>
>         Is the performance better than reiserfs, is it reliable?
>
>         best regards,
>         Martin
>
>
>         ---------------------------(end of
>         broadcast)---------------------------
>         TIP 3: if posting/reading through Usenet, please send an
>         appropriate
>               subscribe-nomail command to majordomo@postgresql.org so
>         that your
>               message can get through to the mailing list cleanly
>


Re: Filesystem

From
"J. Andrew Rogers"
Date:
On Fri, 3 Jun 2005 09:06:41 +0200
  "Martin Fandel" <martin.fandel@alphyra-evs.de> wrote:
> i have only a little question. Which filesystem is
>preferred for postgresql? I'm plan to use xfs
>(before i used reiserfs). The reason
> is the xfs_freeze Tool to make filesystem-snapshots.


XFS has worked great for us, and has been both reliable
and fast.  Zero problems and currently our standard server
filesystem.  Reiser, on the other hand, has on rare
occasion eaten itself on the few systems where someone was
running a Reiser partition, though none were running
Postgres at the time.  We have deprecated the use of
Reiser on all systems where it is not already running.

In terms of performance for Postgres, the rumor is that
XFS and JFS are at the top of the heap, definitely better
than ext3 and somewhat better than Reiser.  I've never
used JFS, but I've seen a few benchmarks that suggest it
is at least as fast as XFS for Postgres.

Since XFS is more mature than JFS on Linux, I go with XFS
by default.  If some tragically bad problems develop with
XFS I may reconsider that position, but we've been very
happy with it so far.  YMMV.

cheers,

J. Andrew Rogers

Re: Filesystem

From
"Martin Fandel"
Date:
Hi,

I've installed the same installation of my reiser-fs-postgres-8.0.1
with xfs.

Now my pgbench shows the following results:

postgres@ramses:~> pgbench -h 127.0.0.1 -p 5432 -c150 -t5 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
number of clients: 150
number of transactions per client: 5
number of transactions actually processed: 750/750
tps = 133.719348 (including connections establishing)
tps = 151.670315 (excluding connections establishing)

With reiserfs my pgbench results are between 230-280 (excluding
connections establishing) and 199-230 (including connections
establishing). I'm using Suse Linux 9.3.

I can't see better performance with xfs. :/ Must I enable special
fstab-settings?

Best regards,
Martin

Am Freitag, den 03.06.2005, 10:18 -0700 schrieb J. Andrew Rogers:
> On Fri, 3 Jun 2005 09:06:41 +0200
>   "Martin Fandel" <martin.fandel@alphyra-evs.de> wrote:
> > i have only a little question. Which filesystem is
> >preferred for postgresql? I'm plan to use xfs
> >(before i used reiserfs). The reason
> > is the xfs_freeze Tool to make filesystem-snapshots.
>
>
> XFS has worked great for us, and has been both reliable
> and fast.  Zero problems and currently our standard server
> filesystem.  Reiser, on the other hand, has on rare
> occasion eaten itself on the few systems where someone was
> running a Reiser partition, though none were running
> Postgres at the time.  We have deprecated the use of
> Reiser on all systems where it is not already running.
>
> In terms of performance for Postgres, the rumor is that
> XFS and JFS are at the top of the heap, definitely better
> than ext3 and somewhat better than Reiser.  I've never
> used JFS, but I've seen a few benchmarks that suggest it
> is at least as fast as XFS for Postgres.
>
> Since XFS is more mature than JFS on Linux, I go with XFS
> by default.  If some tragically bad problems develop with
> XFS I may reconsider that position, but we've been very
> happy with it so far.  YMMV.
>
> cheers,
>
> J. Andrew Rogers


Re: Filesystem

From
Michael Stone
Date:
On Wed, Jun 08, 2005 at 09:36:31AM +0200, Martin Fandel wrote:
>I've installed the same installation of my reiser-fs-postgres-8.0.1
>with xfs.

Do you have pg_xlog on a seperate partition? I've noticed that ext2
seems to have better performance than xfs for the pg_xlog workload (with
all the syncs).

Mike Stone

Re: Filesystem

From
"Martin Fandel"
Date:
Hi,

ah you're right. :) I forgot to symlink the pg_xlog-dir to another
partition. Now it's a bit faster than before. But not faster than
the same installation with reiserfs:

postgres@ramses:~> pgbench -h 127.0.0.1 -p 5432 -c150 -t5 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 1
number of clients: 150
number of transactions per client: 5
number of transactions actually processed: 750/750
tps = 178.831543 (including connections establishing)
tps = 213.931383 (excluding connections establishing)

I've tested dump's and copy's with the xfs-installation. It's
faster than before. But the transactions-query's are still slower
than the reiserfs-installation.

Are any fstab-/mount-options recommended for xfs?

best regards,
Martin

Am Mittwoch, den 08.06.2005, 08:10 -0400 schrieb Michael Stone:
> pgsql-performance@postgresql.org


Re: Filesystem

From
Grega Bremec
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Fandel wrote:
|
| I've tested dump's and copy's with the xfs-installation. It's
| faster than before. But the transactions-query's are still slower
| than the reiserfs-installation.
|
| Are any fstab-/mount-options recommended for xfs?
|

Hello, Martin.

I'm afraid that, unless you planned for your typical workload and
database cluster configuration at filesystem creation time, there is not
much you can do solely by using different mount options. As you don't
mention how you configured the filesystem though, here's some thoughts
on that (everybody is more than welcome to comment on this, of course).

Depending on the underlying array, the block size should be set as high
as possible (page size) to get as close as possible to single stripe
unit size, provided that the stripe unit is a multiple of block size.
For now, x86 unfortunately doesn't allow blocks of multiple pages (yet).
If possible, try to stick as close to the PostgreSQL page size as well,
which is 8kB, if I recall correctly. 4k blocks may hence be a good
choice here.

Higher allocation group count (agcount/agsize) allows for better
parallelism when allocating blocks and inodes. From your perspective,
this may not necessarily be needed (or desired), as allocation and block
reorganization may be "implicitly forced" to being performed internally
as often as possible (depending on how frequently you run VACUUM FULL;
if you can afford it, try running it as seldomly as possible). What you
do want here though, is a good enough an allocation group count to
prevent one group from occupying too much of one single disk in the
array, thus smothering other applicants trying to obtain an extent (this
would imply $agsize = ($desired_agsize - ($sunit * n)), where n <
($swidth / $sunit).

If stripe unit for the underlying RAID device is x kB, the "sunit"
setting is (2 * x), as it is in 512-byte blocks (do not be mislead by
the rather confusing manpage). If you have RAID10/RAID01 in place,
"swidth" may be four times the size of "sunit", depending on how your
RAID controller (or software driver) understands it (I'm not 100% sure
on this, comments, anyone?).

"unwritten" (for unwritten extent markings) can be set to 0 if all of
the files are predominantly preallocated - again, if you VACUUM FULL
extremely seldomly, and delete/update a lot, this may be useful as it
saves I/O and CPU time. YMMV.

Inode size can be set using "size" parameter set to maximum, which
is currently 2048 bytes on x86, if you're using page-sized blocks. As
the filesystem will probably be rather big, as well as the files that
live on it, you probably won't be using much of it for inodes, so you
can set "maxpct" to a safe minimum of 1%, which would yield apprx.
200.000 file slots in a 40GB filesystem (with inode size of 2kB).

Log can, of course, be either "internal", with a "sunit" that fits the
logical configuration of the array, or any other option, if you want to
move the book-keeping overhead away from your data. Do mind that typical
journal size is usually rather small though, so you probably want to be
using one partitioned disk drive for a number of journals, especially
since there usually isn't much journalism to be done on a typical
database cluster filesystem (compared to, for example, a mail server).

Naming (a.k.a. directory) area of the filesystem is also rather poorly
utilized, as there are few directories, and they only contain small
numbers of files, so you can try optimizing in this area too: "size" may
be set to maximum, 64k, although this probably doesn't buy you much
besides a couple of kilobytes' worth of space.

Now finally, the most common options you could play with at mount time.
They would most probably include "noatime", as it is of course rather
undesirable to update inodes upon each and every read access, attribute
or directory lookup, etc. I would be surprised if you were running the
filesystem both without noatime and a good reason to do that. :) Do mind
that this is a generic option available for all filesystems that support
the atime attribute and is not xfs-specific in any way.

As for XFS, biosize=n can be used, where n = log2(${swidth} * ${sunit}),
~ or a multiple thereof. This is, _if_ you planned for your workload by
using an array configuration and stripe sizes befitting of biosize, as
well as configuring filesystem appropriately, the setting where you can
gain by making operating system cache in a slightly readahead manner.

Another useful option might be osyncisosync, which implements a true
O_SYNC on files opened with that option, instead of the default Linux
behaviour where O_SYNC, O_DSYNC and O_RSYNC are synonymous. It may hurt
your performance though, so beware.

If you decided to externalize log journal to another disk drive, and you
have several contendants to that storage unit, you may also want to
release contention a bit by using larger logbufs and logbsize settings,
to provide for more slack in others when a particular needs to spill
buffers to disk.

All of these ideas share one common thought: you can tune a filesystem
so it helps in reducing the amount of iowait. The filesystem itself can
help preventing unnecessary work performed by the disk and eliminating
contention for the bandwidth of the transport subsystem. This can be
achieved by improving internal organization of the filesystem to better
suite the requirements of a typical database workload, and eliminating
the (undesired part of the) book-keeping work in a your filesystem.

Hope to have helped.

Kind regards,
- --
Grega Bremec
gregab at p0f dot net
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFCpvdcfu4IwuB3+XoRAiQQAJ4rnnFYGW42U/SnYz4LGmgEsF0s1gCfXikL
HT6EHWeTvQfd+s+9DkvOQpI=
=V+E2
-----END PGP SIGNATURE-----