In our last installment, we saw that JFS provides higher pgbench
performance than either XFS or ext3. Using a direct-I/O patch stolen
from 8.1, JFS achieved 105 tps with 100 clients.
To refresh, the machine in question has 5 7200RPM SATA disks, an Areca
RAID controller with 128MB cache, and 1GB of main memory. pgbench is
being run with a scale factor of 1000 and 100000 total transactions.
At the suggestion of Andreas Dilger of clusterfs, I tried modulating the
size of the ext3 journal, and the mount options (data=journal,
writeback, and ordered). I turns out that you can achieve a substantial
improvement (almost 50%) by simply mounting the ext3 volume with
data=writeback instead of data=ordered (the default). Changing the
journal size did not seem to make a difference, except that 256MB is for
some reason pathological (9% slower than the best time). 128MB, the
default for a large volume, gave the same performance as 400MB (the max)
or 32MB.
In the end, the ext3 volume mounted with -o noatime,data=writeback
yielded 88 tps with 100 clients. This is about 16% off the performance
of JFS with default options.
Andreas pointed me to experimental patches to ext3's block allocation
code and writeback strategy. I will test these, but I expect the
database community, which seems so attached to its data, will be very
interested in code that has not yet entered mainstream use.
Another frequent suggestion is to put the xlog on a separate device. I
tried this, and, for a given number of disks, it appears to be
counter-productive. A RAID5 of 5 disks holding both logs and data is
about 15% faster than a RAID5 of 3 disks with the data, and a mirror of
two disks holding the xlog.
Here are the pgbench results for each permutation of ext3:
Journal Size | Journal Mode | 1 Client | 10 Clients | 100 Clients
------------------------------------------------------------------
32 ordered 28 51 57
32 writeback 34 70 88
64 ordered 29 52 61
64 writeback 32 69 87
128 ordered 32 54 62
128 writeback 34 70 88
256 ordered 28 51 60
256 writeback 29 64 79
400 ordered 26 49 59
400 writeback 32 70 87
-jwb