Re: Filesystem - Mailing list pgsql-performance

From Grega Bremec
Subject Re: Filesystem
Date
Msg-id 42A6F75C.50100@p0f.net
Whole thread Raw
In response to Re: Filesystem  ("Martin Fandel" <martin.fandel@alphyra-evs.de>)
List pgsql-performance
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Fandel wrote:
|
| I've tested dump's and copy's with the xfs-installation. It's
| faster than before. But the transactions-query's are still slower
| than the reiserfs-installation.
|
| Are any fstab-/mount-options recommended for xfs?
|

Hello, Martin.

I'm afraid that, unless you planned for your typical workload and
database cluster configuration at filesystem creation time, there is not
much you can do solely by using different mount options. As you don't
mention how you configured the filesystem though, here's some thoughts
on that (everybody is more than welcome to comment on this, of course).

Depending on the underlying array, the block size should be set as high
as possible (page size) to get as close as possible to single stripe
unit size, provided that the stripe unit is a multiple of block size.
For now, x86 unfortunately doesn't allow blocks of multiple pages (yet).
If possible, try to stick as close to the PostgreSQL page size as well,
which is 8kB, if I recall correctly. 4k blocks may hence be a good
choice here.

Higher allocation group count (agcount/agsize) allows for better
parallelism when allocating blocks and inodes. From your perspective,
this may not necessarily be needed (or desired), as allocation and block
reorganization may be "implicitly forced" to being performed internally
as often as possible (depending on how frequently you run VACUUM FULL;
if you can afford it, try running it as seldomly as possible). What you
do want here though, is a good enough an allocation group count to
prevent one group from occupying too much of one single disk in the
array, thus smothering other applicants trying to obtain an extent (this
would imply $agsize = ($desired_agsize - ($sunit * n)), where n <
($swidth / $sunit).

If stripe unit for the underlying RAID device is x kB, the "sunit"
setting is (2 * x), as it is in 512-byte blocks (do not be mislead by
the rather confusing manpage). If you have RAID10/RAID01 in place,
"swidth" may be four times the size of "sunit", depending on how your
RAID controller (or software driver) understands it (I'm not 100% sure
on this, comments, anyone?).

"unwritten" (for unwritten extent markings) can be set to 0 if all of
the files are predominantly preallocated - again, if you VACUUM FULL
extremely seldomly, and delete/update a lot, this may be useful as it
saves I/O and CPU time. YMMV.

Inode size can be set using "size" parameter set to maximum, which
is currently 2048 bytes on x86, if you're using page-sized blocks. As
the filesystem will probably be rather big, as well as the files that
live on it, you probably won't be using much of it for inodes, so you
can set "maxpct" to a safe minimum of 1%, which would yield apprx.
200.000 file slots in a 40GB filesystem (with inode size of 2kB).

Log can, of course, be either "internal", with a "sunit" that fits the
logical configuration of the array, or any other option, if you want to
move the book-keeping overhead away from your data. Do mind that typical
journal size is usually rather small though, so you probably want to be
using one partitioned disk drive for a number of journals, especially
since there usually isn't much journalism to be done on a typical
database cluster filesystem (compared to, for example, a mail server).

Naming (a.k.a. directory) area of the filesystem is also rather poorly
utilized, as there are few directories, and they only contain small
numbers of files, so you can try optimizing in this area too: "size" may
be set to maximum, 64k, although this probably doesn't buy you much
besides a couple of kilobytes' worth of space.

Now finally, the most common options you could play with at mount time.
They would most probably include "noatime", as it is of course rather
undesirable to update inodes upon each and every read access, attribute
or directory lookup, etc. I would be surprised if you were running the
filesystem both without noatime and a good reason to do that. :) Do mind
that this is a generic option available for all filesystems that support
the atime attribute and is not xfs-specific in any way.

As for XFS, biosize=n can be used, where n = log2(${swidth} * ${sunit}),
~ or a multiple thereof. This is, _if_ you planned for your workload by
using an array configuration and stripe sizes befitting of biosize, as
well as configuring filesystem appropriately, the setting where you can
gain by making operating system cache in a slightly readahead manner.

Another useful option might be osyncisosync, which implements a true
O_SYNC on files opened with that option, instead of the default Linux
behaviour where O_SYNC, O_DSYNC and O_RSYNC are synonymous. It may hurt
your performance though, so beware.

If you decided to externalize log journal to another disk drive, and you
have several contendants to that storage unit, you may also want to
release contention a bit by using larger logbufs and logbsize settings,
to provide for more slack in others when a particular needs to spill
buffers to disk.

All of these ideas share one common thought: you can tune a filesystem
so it helps in reducing the amount of iowait. The filesystem itself can
help preventing unnecessary work performed by the disk and eliminating
contention for the bandwidth of the transport subsystem. This can be
achieved by improving internal organization of the filesystem to better
suite the requirements of a typical database workload, and eliminating
the (undesired part of the) book-keeping work in a your filesystem.

Hope to have helped.

Kind regards,
- --
Grega Bremec
gregab at p0f dot net
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFCpvdcfu4IwuB3+XoRAiQQAJ4rnnFYGW42U/SnYz4LGmgEsF0s1gCfXikL
HT6EHWeTvQfd+s+9DkvOQpI=
=V+E2
-----END PGP SIGNATURE-----

pgsql-performance by date:

Previous
From: George Essig
Date:
Subject: Re: SELECT DISTINCT Performance Issue
Next
From: K C Lau
Date:
Subject: Re: SELECT DISTINCT Performance Issue