Thread: Some pgbench results

Some pgbench results

From
"Just Someone"
Date:
I was doing some load testing on a server, and decided to test it with
different file systems to see how it reacts to load/speed. I tested
xfs, jfs and ext3. The machine runs FC4 with the latest 2.6.15 kernel
from Fedora.

Hardware: Dual Opteron 246, 4GB RAM, Adaptec 2230 with battery backup,
2 10K SCSI disks in RAID1 for OS and WAL (with it's own partiton on
ext3), 6 10K scsi disks in RAID10 (RAID1 in hw, RAID0 on top of that
in sw). Postgres config tweaked as per the performance guide.

Initialized the data with: pgbench -i -s 100
Test runs: pgbench -s 100 -t 10000 -c 20
I did 20 runs, removed the first 3 runs from each sample to account
for stabilization. Here are the results in tps without connection
establishing:

FS:       JFS     XFS     EXT3
Avg:     462      425       319
Stdev:  104        74       106

Intererstingly, the first 3 samples I removed had a MUCH higher tps
count. Up to 900+.

Bye,

Guy.

--
Family management on rails: http://www.famundo.com - coming soon!
My develpment related blog: http://devblog.famundo.com

Re: Some pgbench results

From
Bernhard Weisshuhn
Date:
Just Someone wrote:

> 2 10K SCSI disks in RAID1 for OS and WAL (with it's own partiton on
> ext3),

You'll want the WAL on its own spindle. IIRC a separate partition on a
shared disc won't give you much benefit. The idea is to keep the disc's
head from moving away for other tasks. Or so they say.

regards,
bkw

Re: Some pgbench results

From
Jim Nasby
Date:
On Mar 23, 2006, at 11:32 AM, Bernhard Weisshuhn wrote:

> Just Someone wrote:
>
>> 2 10K SCSI disks in RAID1 for OS and WAL (with it's own partiton on
>> ext3),
>
> You'll want the WAL on its own spindle. IIRC a separate partition
> on a shared disc won't give you much benefit. The idea is to keep
> the disc's head from moving away for other tasks. Or so they say.

Actually, the OS partitions are normally quiet enough that it won't
make a huge difference, unless you're really hammering the database
all the time.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461



Re: Some pgbench results

From
Jim Nasby
Date:
On Mar 23, 2006, at 11:01 AM, Just Someone wrote:

> I was doing some load testing on a server, and decided to test it with
> different file systems to see how it reacts to load/speed. I tested
> xfs, jfs and ext3. The machine runs FC4 with the latest 2.6.15 kernel
> from Fedora.

You should also try testing ext3 with data=writeback, on both
partitions. People have found it makes a big difference in performance.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461



Re: Some pgbench results

From
"Just Someone"
Date:
Jim,

I did another test with ext3 using data=writeback, and indeed it's much better:

Avg:    429.87
Stdev:  77

A bit (very tiny bit) faster than xfs and bit slower than jfs. Still,
very much improved.

Bye,

Guy.


On 3/23/06, Jim Nasby <jnasby@pervasive.com> wrote:
> On Mar 23, 2006, at 11:32 AM, Bernhard Weisshuhn wrote:
>
> > Just Someone wrote:
> >
> >> 2 10K SCSI disks in RAID1 for OS and WAL (with it's own partiton on
> >> ext3),
> >
> > You'll want the WAL on its own spindle. IIRC a separate partition
> > on a shared disc won't give you much benefit. The idea is to keep
> > the disc's head from moving away for other tasks. Or so they say.
>
> Actually, the OS partitions are normally quiet enough that it won't
> make a huge difference, unless you're really hammering the database
> all the time.
> --
> Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
> Pervasive Software      http://pervasive.com    work: 512-231-6117
> vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>


--
Family management on rails: http://www.famundo.com - coming soon!
My develpment related blog: http://devblog.famundo.com

Re: Some pgbench results

From
"Just Someone"
Date:
Hi,

> Did you re-initialize the test pgbench database between runs?
> I get weird results otherwise since some integers gets overflowed in the
> test (it doesn't complete the full 10000 transactions after the first run).

No, I didn't. The reason is that I noticed that the first run is
always MUCH faster. My initial runs if I reinit pgbench and run again
will always hover around 900-970 tps for xfs. And I didn't need this
as a real performance test, it was a side effect of a load test I was
doing on the server. Also, pgbench isn't close to the load I'll see on
my server (web application which will be mostly read)

> Could you please tell me what stripe size you have on the raid system?
> Could you also share the mkfs and mount options on each filesystem you
> tried?

RAID stripe size of 256K.
File system creation:
xfs: mkfs -t xfs -l size=64m /dev/md0
jfs: mkfs -t jfs /dev/md0

Mount for xfs with -o noatime,nodiratime,logbufs=8
jfs: -o noatime,nodiratime

> A hint on using a raided ext3 system is to use whole block device
> instead of partitions to align the data better and use data=journal with
> a big journal. This might seem counter-productive at first (it did to
> me) but I increased my throughput a lot when using this.

Thanks for the advice! Actually, the RAID 10 I have is mounted as
/var/lib/pgsql, so it's ONLY for postgres data, and the pg_xlog
directory is mounted on another disk.

> My filesystem parameters are calculated like this:
> stripe=256 # <- 256k raid stripe size
> bsize=4 # 4k blocksize
> bsizeb=$(( $bsize * 1024 )) # in bytes
> stride=$(( $stripe / $bsize ))
>
> mke2fs -b $bsizeb -j -J size=400 -m 1 -O sparse_super \
>   -T largefile4 -E stride=$stride /dev/sdb
>
> Mounted with: mount -t ext3 -o data=journal,noatime /dev/sdb /mnt/test8

That's an interesting thing to try, though because of other things I
want, I prefer xfs or jfs anyway. I will have an extreme number of
schemas and files, which make high demands on the directory structure.
My tests showed me that ext3 doesn't cope with many files in
directories very well. With xfs and jfs I can create 500K files in one
directory in no time (about 250 seconds), with ext3 it start to crawl
after about 30K files.

> I'm a little surprised that I can get more pgbench performance out of my
> system since you're using 10K scsi disks. Please try the above settings
> and see if it helps you...
>
> I've not run so many tests yet, I'll do some more after the weekend...

Please share the results. It's very interesting...

Bye,

Guy.

BTW, one thing I also tested is a software RAID0 over two RAID5 SATA
arrays. Total disk count in this is 15. The read performance was
really good. The write performance (as expected) not so great. But
that was just a test to get a feeling of the speed. This RAID5 system
is only used for file storage, not database.

--
Family management on rails: http://www.famundo.com - coming soon!
My development related blog: http://devblog.famundo.com

Re: Some pgbench results

From
"Just Someone"
Date:
I played a bit with kernnel versions as I was getting a kernel panic
on my Adaptec card. I downgraded to 2.6.11 (the original that came
with fedora core 4) and the panic went away, but more than that, the
performance on XFS went considerably higher. With the exact same
settings as before, I got now Average of 813.65tps with a standard
deviation of: 130.33.

I hope this kernel doesn't panic on me. But I'll know just tomorrow as
I'm pounding on the machine now.

Bye,

Guy.


On 3/23/06, Magnus Naeslund(f) <mag@fbab.net> wrote:
> Just Someone wrote:
> >
> > Initialized the data with: pgbench -i -s 100
> > Test runs: pgbench -s 100 -t 10000 -c 20
> > I did 20 runs, removed the first 3 runs from each sample to account
> > for stabilization.
>
> Did you re-initialize the test pgbench database between runs?
> I get weird results otherwise since some integers gets overflowed in the
> test (it doesn't complete the full 10000 transactions after the first run).
>
> > Here are the results in tps without connection
> > establishing:
> >
> > FS:       JFS     XFS     EXT3
> > Avg:     462      425       319
> > Stdev:  104        74       106
> >
>
> Could you please tell me what stripe size you have on the raid system?
> Could you also share the mkfs and mount options on each filesystem you
> tried?
>
> I ran some tests on an somewhat similar system:
> A supermicro H8SSL-i-B motherboard with one dual core opteron 165 with
> 4gb of memory, debian sarge amd64 (current stable) but with a pristine
> kernel.org 2.6.16 kernel (there's no debian patches or packages yet).
>
> It has a 3ware 9550 + BBU sata raid card with 6 disks in a raid 10
> configuration with 256kb stripe size. I think this results in about
> 200mb/s raw read performance and about 155mb/s raw write performance (as
> in tested with dd:ing a 10gb file back and forth).
> I had no separate WAL device/partition, only tweaked postgresql.conf.
>
> I get about 520-530 tps with your pgbench parameters on ext3 but very
> poor (order of magnitude) performance on xfs (that's why I ask of your
> mkfs parameters).
>
> A hint on using a raided ext3 system is to use whole block device
> instead of partitions to align the data better and use data=journal with
> a big journal. This might seem counter-productive at first (it did to
> me) but I increased my throughput a lot when using this.
>
> My filesystem parameters are calculated like this:
> stripe=256 # <- 256k raid stripe size
> bsize=4 # 4k blocksize
> bsizeb=$(( $bsize * 1024 )) # in bytes
> stride=$(( $stripe / $bsize ))
>
> mke2fs -b $bsizeb -j -J size=400 -m 1 -O sparse_super \
>   -T largefile4 -E stride=$stride /dev/sdb
>
> Mounted with: mount -t ext3 -o data=journal,noatime /dev/sdb /mnt/test8
>
> I'm a little surprised that I can get more pgbench performance out of my
> system since you're using 10K scsi disks. Please try the above settings
> and see if it helps you...
>
> I've not run so many tests yet, I'll do some more after the weekend...
>
> Regards,
> Magnus
>
>
>


--
Family management on rails: http://www.famundo.com - coming soon!
My development related blog: http://devblog.famundo.com

Re: Some pgbench results

From
"Magnus Naeslund(f)"
Date:
Just Someone wrote:
[snip]
>>
>> mke2fs -b $bsizeb -j -J size=400 -m 1 -O sparse_super \
>>   -T largefile4 -E stride=$stride /dev/sdb
>>
>> Mounted with: mount -t ext3 -o data=journal,noatime /dev/sdb /mnt/test8
>
> That's an interesting thing to try, though because of other things I
> want, I prefer xfs or jfs anyway. I will have an extreme number of
> schemas and files, which make high demands on the directory structure.
> My tests showed me that ext3 doesn't cope with many files in
> directories very well. With xfs and jfs I can create 500K files in one
> directory in no time (about 250 seconds), with ext3 it start to crawl
> after about 30K files.
>

It might seem that I'm selling ext3 or something :) but it's the linux
filesystem I know best.
If you want ext3 to perform with large directories, there is an mkfs
option that enables directory hashing that you can try: -O dir_index.

Regards,
Magnus


Re: Some pgbench results

From
"Just Someone"
Date:
Hi Magnus,

> It might seem that I'm selling ext3 or something :) but it's the linux
> filesystem I know best.
> If you want ext3 to perform with large directories, there is an mkfs
> option that enables directory hashing that you can try: -O dir_index.

Not at all (sell ext3 ;-) ). It's great to get this kind of info! I
rather use ext3 as it's VERY stable., and the default in Fedora
anyway. So thanks for the tip!

Bye,

Guy.

--
Family management on rails: http://www.famundo.com - coming soon!
My development related blog: http://devblog.famundo.com

Re: Some pgbench results

From
Douglas McNaught
Date:
"Magnus Naeslund(f)" <mag@fbab.net> writes:

> It might seem that I'm selling ext3 or something :) but it's the linux
> filesystem I know best.
> If you want ext3 to perform with large directories, there is an mkfs
> option that enables directory hashing that you can try: -O dir_index.

You can also turn it on for an existing filesystem using 'tune2fs' and
a remount, but it won't hash already-existing large directories--those
will have to be recreated to take advantage of hashing.

-Doug

Re: Some pgbench results

From
"Magnus Naeslund(k)"
Date:
Just Someone wrote:
>
> Initialized the data with: pgbench -i -s 100
> Test runs: pgbench -s 100 -t 10000 -c 20
> I did 20 runs, removed the first 3 runs from each sample to account
> for stabilization.

Did you re-initialize the test pgbench database between runs?
I get weird results otherwise since some integers gets overflowed in the
test (it doesn't complete the full 10000 transactions after the first run).

> Here are the results in tps without connection
> establishing:
>
> FS:       JFS     XFS     EXT3
> Avg:     462      425       319
> Stdev:  104        74       106
>

Could you please tell me what stripe size you have on the raid system?
Could you also share the mkfs and mount options on each filesystem you
tried?

I ran some tests on an somewhat similar system:
A supermicro H8SSL-i-B motherboard with one dual core opteron 165 with
4gb of memory, debian sarge amd64 (current stable) but with a pristine
kernel.org 2.6.16 kernel (there's no debian patches or packages yet).

It has a 3ware 9550 + BBU sata raid card with 6 disks in a raid 10
configuration with 256kb stripe size. I think this results in about
200mb/s raw read performance and about 155mb/s raw write performance (as
in tested with dd:ing a 10gb file back and forth).
I had no separate WAL device/partition, only tweaked postgresql.conf.

I get about 520-530 tps with your pgbench parameters on ext3 but very
poor (order of magnitude) performance on xfs (that's why I ask of your
mkfs parameters).

A hint on using a raided ext3 system is to use whole block device
instead of partitions to align the data better and use data=journal with
a big journal. This might seem counter-productive at first (it did to
me) but I increased my throughput a lot when using this.

My filesystem parameters are calculated like this:
stripe=256 # <- 256k raid stripe size
bsize=4 # 4k blocksize
bsizeb=$(( $bsize * 1024 )) # in bytes
stride=$(( $stripe / $bsize ))

mke2fs -b $bsizeb -j -J size=400 -m 1 -O sparse_super \
  -T largefile4 -E stride=$stride /dev/sdb

Mounted with: mount -t ext3 -o data=journal,noatime /dev/sdb /mnt/test8

I'm a little surprised that I can get more pgbench performance out of my
system since you're using 10K scsi disks. Please try the above settings
and see if it helps you...

I've not run so many tests yet, I'll do some more after the weekend...

Regards,
Magnus