Thread: Can anyone explain this pgbench results?

Can anyone explain this pgbench results?

From
"Joost Kraaijeveld"
Date:
Hi,

Below are some results of running pgbench, run on a machine that is doing nothing else than running PostgreSQL woth
pgbench.The strange thing is that the results are *constantly alternating* hight (750-850 transactions)and low (50-80
transactions),no matter how many test I run. If I wait a long time (> 5 minutes) after running the test, I always get a
hightscore, followed by a low one, followed by a high one, low one etc.  

I was expecting a low(ish) score the first run (because the tables are not loaded in the cache yet), followed by
continueshigh(ish) scores, but not an alternating pattern. I also did not expect so much difference, given the hardware
Ihave (Dual Opteron, 4GB memory , 3Ware SATA RAID5 with 5 disks, seerate swap and pg_log disks). 

Anyone any idea?

Results of pgbench:

postgres@panoramix:/usr/lib/postgresql/8.1/bin$ ./pgbench  -c 10 -t 150 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 10
number of transactions per client: 150
number of transactions actually processed: 1500/1500
tps = 50.651705 (including connections establishing)
tps = 50.736338 (excluding connections establishing)
postgres@panoramix:/usr/lib/postgresql/8.1/bin$ ./pgbench  -c 10 -t 150 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 10
number of transactions per client: 150
number of transactions actually processed: 1500/1500
tps = 816.972995 (including connections establishing)
tps = 836.951755 (excluding connections establishing)
postgres@panoramix:/usr/lib/postgresql/8.1/bin$ ./pgbench  -c 10 -t 150 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 10
number of transactions per client: 150
number of transactions actually processed: 1500/1500
tps = 42.924294 (including connections establishing)
tps = 42.986747 (excluding connections establishing)
postgres@panoramix:/usr/lib/postgresql/8.1/bin$ ./pgbench  -c 10 -t 150 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 10
number of transactions per client: 150
number of transactions actually processed: 1500/1500
tps = 730.651970 (including connections establishing)
tps = 748.538852 (excluding connections establishing)


TIA

Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl

Re: Can anyone explain this pgbench results?

From
Michael Fuhr
Date:
On Mon, Mar 06, 2006 at 04:29:49PM +0100, Joost Kraaijeveld wrote:
> Below are some results of running pgbench, run on a machine that
> is doing nothing else than running PostgreSQL woth pgbench. The
> strange thing is that the results are *constantly alternating* hight
> (750-850 transactions)and low (50-80 transactions), no matter how
> many test I run. If I wait a long time (> 5 minutes) after running
> the test, I always get a hight score, followed by a low one, followed
> by a high one, low one etc.

The default checkpoint_timeout is 300 seconds (5 minutes).  Is it
coincidence that the "long time" between fast results is about the
same?  What's your setting?  Are your test results more consistent
if you execute CHECKPOINT between them?

--
Michael Fuhr

Re: Can anyone explain this pgbench results?

From
"Joost Kraaijeveld"
Date:
Hi Michael,

Thanls for your response.

Michael Fuhr wrote:
> On Mon, Mar 06, 2006 at 04:29:49PM +0100, Joost Kraaijeveld wrote:
>> Below are some results of running pgbench, run on a machine that
>> is doing nothing else than running PostgreSQL woth pgbench. The
>> strange thing is that the results are *constantly alternating* hight
>> (750-850 transactions)and low (50-80 transactions), no matter how
>> many test I run. If I wait a long time (> 5 minutes) after running
>> the test, I always get a hight score, followed by a low one, followed
>> by a high one, low one etc.
>
> The default checkpoint_timeout is 300 seconds (5 minutes).  Is it
> coincidence that the "long time" between fast results is about the
> same?
I have not measured the "long wait time". But I can run multiple test in 3 minutes: the fast test lasts 3 sec, the long
one40 secs (see below). During the tests there is not much activity on the partition where the logfiles are (other
controllerand disk than the database and swap) 

postgres@panoramix:/usr/lib/postgresql/8.1/bin$ time ./pgbench  -c 10 -t 150 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 10
number of transactions per client: 150
number of transactions actually processed: 1500/1500
tps = 531.067258 (including connections establishing)
tps = 541.694790 (excluding connections establishing)

real    0m2.892s
user    0m0.105s
sys     0m0.145s


postgres@panoramix:/usr/lib/postgresql/8.1/bin$ time ./pgbench  -c 10 -t 150 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 10
number of transactions per client: 150
number of transactions actually processed: 1500/1500
tps = 37.064000 (including connections establishing)
tps = 37.114023 (excluding connections establishing)

real    0m40.531s
user    0m0.088s
sys     0m0.132s

>What's your setting?
Default.

> Are your test results more consistent
> if you execute CHECKPOINT between them?
Could you tell me how I could do that?



Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl

Re: Can anyone explain this pgbench results?

From
Michael Fuhr
Date:
On Mon, Mar 06, 2006 at 07:46:05PM +0100, Joost Kraaijeveld wrote:
> Michael Fuhr wrote:
> > What's your setting?
>
> Default.

Have you tweaked postgresql.conf at all?  If so, what non-default
settings are you using?

> > Are your test results more consistent
> > if you execute CHECKPOINT between them?
>
> Could you tell me how I could do that?

Connect to the database as a superuser and execute a CHECKPOINT
statement.

http://www.postgresql.org/docs/8.1/interactive/sql-checkpoint.html

From the command line you could do something like

psql -c checkpoint
pgbench -c 10 -t 150 test
psql -c checkpoint
pgbench -c 10 -t 150 test
psql -c checkpoint
pgbench -c 10 -t 150 test

--
Michael Fuhr

Re: Can anyone explain this pgbench results?

From
Michael Fuhr
Date:
[Please copy the mailing list on replies.]

On Mon, Mar 06, 2006 at 09:38:20PM +0100, Joost Kraaijeveld wrote:
> Michael Fuhr wrote:
> > Have you tweaked postgresql.conf at all?  If so, what non-default
> > settings are you using?
>
> Yes, I have tweaked the following settings:
>
> shared_buffers = 40000
> work_mem = 512000
> maintenance_work_mem = 512000
> max_fsm_pages = 40000
> effective_cache_size = 131072

Are you sure you need work_mem that high?  How did you decide on
that value?  Are all other settings at their defaults?  No changes
to the write ahead log (WAL) or background writer (bgwriter) settings?
What version of PostgreSQL are you running?  The paths in your
original message suggest 8.1.x.

> >>> Are your test results more consistent
> > psql -c checkpoint
> > pgbench -c 10 -t 150 test
> > psql -c checkpoint
> > pgbench -c 10 -t 150 test
> > psql -c checkpoint
> > pgbench -c 10 -t 150 test
>
> OK, that leads to a consistant hight score. I also noticed that
> "psql -c checkpoint" results in I/O on the database partition but
> not on the partition that has the logfiles (pg_xlog directory). Do
> you know if that how it should be?

A checkpoint updates the database files with the data from the
write-ahead log; you're seeing those writes to the database partition.
The postmaster does checkpoints every checkpoint_timeout seconds
(default 300) or every checkpoint_segment log segments (default 3);
it also uses a background writer to trickle pages to the database
files between checkpoints so the checkpoints don't have as much
work to do.  I've been wondering if your pgbench runs are being
affected by that background activity; the fact that you get
consistently good performance after forcing a checkpoint suggests
that that might be the case.

If you run pgbench several times without intervening checkpoints,
do your postmaster logs have any messages like "checkpoints are
occurring too frequently"?  It might be useful to increase
checkpoint_warning up to the value of checkpoint_timeout and then
see if you get any such messages during pgbench runs.  If checkpoints
are happening a lot more often than every checkpoint_timeout seconds
then try increasing checkpoint_segments (assuming you have the disk
space).  After doing so, restart the database and run pgbench several
times without intervening checkpoints and see if performance is
more consistent.

Note that tuning PostgreSQL for pgbench performance might be
irrelevant for your actual needs unless your usage patterns happen
to resemble what pgbench does.

--
Michael Fuhr

Re: Can anyone explain this pgbench results?

From
"Joost Kraaijeveld"
Date:
Hi Michael,

Michael Fuhr wrote:
>>> Have you tweaked postgresql.conf at all?  If so, what non-default
>>> settings are you using?
>>
>> Yes, I have tweaked the following settings:
>>
>> shared_buffers = 40000
>> work_mem = 512000
>> maintenance_work_mem = 512000
>> max_fsm_pages = 40000
>> effective_cache_size = 131072
>
> Are you sure you need work_mem that high?  How did you decide on
> that value?
I have used http://www.powerpostgresql.com/Downloads/annotated_conf_80.html , expecting that the differences between
8.0and 8.1 do not invalidate the recommendations. I have checked with (some) of my (large) queries and adjusted upward
untillI had no temp files in the PGDATA/base/DB_OID/pgsql_tmp. (The warning about 

> Are all other settings at their defaults?
Yep.

> No changes to the write ahead log (WAL) or background writer (bgwriter) settings?
No, because the forementioned document explicitely states that it has recomendations on these subjects.

> What version of PostgreSQL are you running?  The paths in your
> original message suggest 8.1.x.
Debian's Ecth 8.1.0-3

> A checkpoint updates the database files with the data from the
> write-ahead log; you're seeing those writes to the database partition.
> The postmaster does checkpoints every checkpoint_timeout seconds
> (default 300) or every checkpoint_segment log segments (default 3);
> it also uses a background writer to trickle pages to the database
> files between checkpoints so the checkpoints don't have as much
> work to do.  I've been wondering if your pgbench runs are being
> affected by that background activity; the fact that you get
> consistently good performance after forcing a checkpoint suggests
> that that might be the case.
OK, thanks.

To be sure if I understand it correctly:

1. Every update/insert is first written to a WAL log file which is in the PGDATA/pg_xlog directory.
2. Routinely the background writer than writes the changes to the PGDATA/base/DB_OID/ directory.
2. Postmaster forces after 300 secs or if the log segments are full (which ever comes first?) a checkpoint so that the
WALlog file are empty ( I assume that that are the changes the background writer has not written yet since the last
forcedcheckpont?). 

> If you run pgbench several times without intervening checkpoints,
> do your postmaster logs have any messages like "checkpoints are
> occurring too frequently"?  It might be useful to increase
> checkpoint_warning up to the value of checkpoint_timeout and then
> see if you get any such messages during pgbench runs.  If checkpoints
> are happening a lot more often than every checkpoint_timeout seconds
> then try increasing checkpoint_segments (assuming you have the disk
> space).  After doing so, restart the database and run pgbench several
> times without intervening checkpoints and see if performance is
> more consistent.
I will try that this day.

> Note that tuning PostgreSQL for pgbench performance might be
> irrelevant for your actual needs unless your usage patterns happen
> to resemble what pgbench does.

The advantage of using pgbench is a repeatable short command that leads to something that is showing in actual real
worldusage. 

My problem is with the raw performance of my disk array (3Ware 9500S-8 SATA RAID5 controller with 5 disks). I am having
*very*serious performance problems if I do large updates on my databases. E.g. an update of 1 (boolean) column in a
table(update prototype.customers set deleted = false) that has 368915 records last forever (> 3500 secs ). The only
noticabledisk activity during such an update is on the disk/partition that has the PGDATA/base/DB_OID/ directory
(/dev/sdc,the 3Ware 9800S-8 RAID 5 array). There is *no* noticable disk activity on the disk/partition that hase the
PGDATA/pg_xlogdirectory (/dev/sdb, on a Sil 3114 on-board SAT controller). The throughtput during the update is ~ 2
MB/sec.The thoughtput during a large file copy or running bonnie (a benchmark) is > 40 MB/sec. My primary goal is to
understandthe differences ( and than sue the guilty ones ;-)), and than maybe either learn to live with it or find a
solution.The number of write operations/sec during the update is ~ 2000 /sec. I suspect that the RAID card cannot
handlea lot of small write operations (with fsync?) in a short time without performance penalty (and yes, the write
cacheon the controller is enabled). 


Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl

Re: Can anyone explain this pgbench results?

From
"Joost Kraaijeveld"
Date:
Hi Michael,

Michael Fuhr wrote:
> If you run pgbench several times without intervening checkpoints,
> do your postmaster logs have any messages like "checkpoints are
> occurring too frequently"?  It might be useful to increase
> checkpoint_warning up to the value of checkpoint_timeout and then
> see if you get any such messages during pgbench runs.  If checkpoints
> are happening a lot more often than every checkpoint_timeout seconds
> then try increasing checkpoint_segments (assuming you have the disk
> space).  After doing so, restart the database and run pgbench several
> times without intervening checkpoints and see if performance is
> more consistent.
I got the "checkpoints are occurring too frequently". Increasing the number of checkpoint_segments from the default 3
to10 resulted in more tests without performance penalty (~ 5-6 tests). The perfomance penalty is also a little less. It
takesseveral minutes for the background writer to catch up. 

This will solve my problems at the customers site (they do not run sm many sales transaction per second), but not my
ownproblem while converting the old database to a new databse :-(. Maybe I should invest in other hardware or
re-arrangemy RAID5 in a RAID10 (or 50???). 

Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl

Re: Can anyone explain this pgbench results?

From
"Jim C. Nasby"
Date:
On Mon, Mar 06, 2006 at 04:29:49PM +0100, Joost Kraaijeveld wrote:
> I was expecting a low(ish) score the first run (because the tables are not loaded in the cache yet), followed by
continueshigh(ish) scores, but not an alternating pattern. I also did not expect so much difference, given the hardware
Ihave (Dual Opteron, 4GB memory , 3Ware SATA RAID5 with 5 disks, seerate swap and pg_log disks). 

On a side-note:
RAID5 and databases generally don't mix well.

Most people find that pg_xlog will live happily with the OS; it's the
data files that need the most bandwidth.

If you start swapping, performance will tank to the point that it's
unlikely that swap being on seperate disks will help at all. Better off
to just keep it with the OS and use the disks for the database tables.

Speaking of 'disks', what's your exact layout? Do you have a 5 drive
raid5 for the OS and the database, 1 drive for swap and 1 drive for
pg_xlog?
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Can anyone explain this pgbench results?

From
"Joost Kraaijeveld"
Date:
Jim C. Nasby wrote:

> Speaking of 'disks', what's your exact layout? Do you have a 5 drive
> raid5 for the OS and the database, 1 drive for swap and 1 drive for
> pg_xlog?

On a Sil SATA 3114 controller:
/dev/sda OS + Swap
/dev/sdb /var with pg_xlog

On the 3Ware 9500S-8, 5 disk array:
/dev/sdc with the database (and very safe, my MP3 collection ;-))

As I wrote in one of my posts to Michael, I suspect that the card is not handling the amount of write operations as
wellas I expected. I wonder if anyone else sees the same characteristics with this kind of card. 


Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl

Re: Can anyone explain this pgbench results?

From
"Jim C. Nasby"
Date:
On Tue, Mar 07, 2006 at 08:49:30PM +0100, Joost Kraaijeveld wrote:
>
> Jim C. Nasby wrote:
>
> > Speaking of 'disks', what's your exact layout? Do you have a 5 drive
> > raid5 for the OS and the database, 1 drive for swap and 1 drive for
> > pg_xlog?
>
> On a Sil SATA 3114 controller:
> /dev/sda OS + Swap
> /dev/sdb /var with pg_xlog
>
> On the 3Ware 9500S-8, 5 disk array:
> /dev/sdc with the database (and very safe, my MP3 collection ;-))
>
> As I wrote in one of my posts to Michael, I suspect that the card is not handling the amount of write operations as
wellas I expected. I wonder if anyone else sees the same characteristics with this kind of card. 

Well, the problem is that you're using RAID5, which has a huge write
overhead. You're unlikely to get good performance with it.

Also, it sounds like sda and sdb are not mirrored. If that's the case,
you have no protection from a drive failure taking out your entire
database, because you'd lose pg_xlog.

If you want better performance your best bets are to either setup RAID10
or if you don't care about the data, just go to RAID0.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Can anyone explain this pgbench results?

From
"Joost Kraaijeveld"
Date:
Jim C. Nasby wrote:
> Well, the problem is that you're using RAID5, which has a huge write
> overhead. You're unlikely to get good performance with it.
Apparently. But I had no idea that the performance hit would be that big.

Running bonnie or copying a large file with dd show that the card can do 30-50 MB/sec. Running a large update on my
postgresqldatabase however, show a throughtput of ~ 2MB/sec, doing between ~ 2500 - 2300 writes/second (avarage). with
anutilisation of almost always 100%, and large await times ( almost always > 700), large io-wait percentages (>50%),
allmeasured with iostat. 

> Also, it sounds like sda and sdb are not mirrored. If that's the case,
> you have no protection from a drive failure taking out your entire
> database, because you'd lose pg_xlog.
>
> If you want better performance your best bets are to either
> setup RAID10 or if you don't care about the data, just go to RAID0.
Because it is just my development machine I think I will opt for the last option. More diskspace left.


Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl

Re: Can anyone explain this pgbench results?

From
"Jim C. Nasby"
Date:
On Tue, Mar 07, 2006 at 09:15:37PM +0100, Joost Kraaijeveld wrote:
> Jim C. Nasby wrote:
> > Well, the problem is that you're using RAID5, which has a huge write
> > overhead. You're unlikely to get good performance with it.
> Apparently. But I had no idea that the performance hit would be that big.
>
> Running bonnie or copying a large file with dd show that the card can do 30-50 MB/sec. Running a large update on my
postgresqldatabase however, show a throughtput of ~ 2MB/sec, doing between ~ 2500 - 2300 writes/second (avarage). with
anutilisation of almost always 100%, and large await times ( almost always > 700), large io-wait percentages (>50%),
allmeasured with iostat. 

While there are some issues with PostgreSQL not getting as close to the
theoretical maximum of a dd bs=8k (you did match the block size to
PostgreSQL's page size, right? :) ), a bigger issue in this case is that
better cards are able to remove much/all of the RAID5 write penalty in
the case where you're doing a large sequential write, because it will
just blow entire stripes down to disk. This is very different from doing
a more random IO. And it's also very possible that if you use a block
size that's smaller than the stripe size that the controller won't be
able to pick up on that.

In any case, RAID0 will absolutely be the fastest performance you can
get.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461