Thread: Opteron scaling with PostgreSQL

Opteron scaling with PostgreSQL

From
Steve Wolfe
Date:
   Some time ago, I asked about how well PostgreSQL scales with the
number of processors in an Opteron system.  To my surprise, no one
seemed to know!  Well, a couple of days ago, a shiny, new Celestica
A8440 showed up at my office, so I decided to run it through the paces.
  Hopefully, this will be useful to someone else as well!

Hardware info
-------------
Celestica A8440
4xOpteron 848
8 gigs PC3200 reg/ECC memory

Software info
-------------
Fedora core 2 x86-64
PostgreSQL 7.4.2
Added compile options: -O3 -m64
Startup options:  256 MB shared buffer, fsync OFF to eliminate the disk
system as a variable, 128 megs sort memory


Testing method
--------------
    I logged 10,000 queries from our production DB server, and wrote a Perl
program to issue them via an arbitrary number of "workers".  Before each
run, the database was "warmed up" by going through two preliminary runs
to ensure that caches and buffers were populated.

    Instead of removing processors (which would have also reduced the
memory), I used the boot argument "maxcpus" to limit the number of CPUs
that Linux would use.

Preliminary thoughts
--------------------
    After playing around, I found that the optimal size for the shared
buffer was 256 megs.  To the opposite of my expectations, using more
shared buffer resulted in a lower throughput.

Results!
--------

maxcpus        max queries per second
-------        ----------------------
1        378 qps @ 32 connections (baseline)
2        609 qps @ 96 connections (161% of baseline)
3        853 qps @ 48 connections (225% of baseline)
4        1033 qps @ 64 connections (273% of baseline)


   A graph of the throughputs for various numbers of CPUs and
connections can be found at  http://www.codon.com/PG-scaling.gif

steve

Re: Opteron scaling with PostgreSQL

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Steve Wolfe
> Sent: Thursday, June 10, 2004 2:09 PM
> To: pgsql-general
> Subject: [GENERAL] Opteron scaling with PostgreSQL
>
>
>
>    Some time ago, I asked about how well PostgreSQL scales with the
> number of processors in an Opteron system.  To my surprise, no one
> seemed to know!  Well, a couple of days ago, a shiny, new Celestica
> A8440 showed up at my office, so I decided to run it through
> the paces.
>   Hopefully, this will be useful to someone else as well!
>
> Hardware info
> -------------
> Celestica A8440
> 4xOpteron 848
> 8 gigs PC3200 reg/ECC memory
>
> Software info
> -------------
> Fedora core 2 x86-64
> PostgreSQL 7.4.2
> Added compile options: -O3 -m64
> Startup options:  256 MB shared buffer, fsync OFF to
> eliminate the disk
> system as a variable, 128 megs sort memory

I would very much like to see the same test with Fsync on.
A test that does not reflect real-world use has less value than one that
just shows how fast it can go.

For a read-only database, fsync could be turned off.  For any other
system it would be hair-brained and nobody in their right mind would do
it.

> Testing method
> --------------
>     I logged 10,000 queries from our production DB server,
> and wrote a Perl
> program to issue them via an arbitrary number of "workers".
> Before each
> run, the database was "warmed up" by going through two
> preliminary runs
> to ensure that caches and buffers were populated.
>
>     Instead of removing processors (which would have also
> reduced the
> memory), I used the boot argument "maxcpus" to limit the
> number of CPUs
> that Linux would use.
>
> Preliminary thoughts
> --------------------
>     After playing around, I found that the optimal size for
> the shared
> buffer was 256 megs.  To the opposite of my expectations, using more
> shared buffer resulted in a lower throughput.
>
> Results!
> --------
>
> maxcpus        max queries per second
> -------        ----------------------
> 1        378 qps @ 32 connections (baseline)
> 2        609 qps @ 96 connections (161% of baseline)
> 3        853 qps @ 48 connections (225% of baseline)
> 4        1033 qps @ 64 connections (273% of baseline)
>
>
>    A graph of the throughputs for various numbers of CPUs and
> connections can be found at  http://www.codon.com/PG-scaling.gif

It is very impressive how well the system scales.  I would like to see a
PostgreSQL system run against these guys:
http://www.tpc.org/

It might prove interesting to see how it stacks up against commercial
systems.
Certainly when it comes to Dollars per TPS, there would be a stupendous
leg up to start with!
;-)

Re: Opteron scaling with PostgreSQL

From
Steve Wolfe
Date:
> I would very much like to see the same test with Fsync on.
> A test that does not reflect real-world use has less value than one that
> just shows how fast it can go.
 >
 > For a read-only database, fsync could be turned off.  For any other
 > system it would be hair-brained and nobody in their right mind would
 > do it.

   Then I must not be in my right mind. : )

   Before I explain why *I* run with fsync turned off, the main reason
the tests were done without fsync was to test the scalability of the
Opteron platform, not the scalability of my disk subsystem. = )

   I've run with fsync off on my production servers for years.  Power
never goes off, and RAID 5 protects me from disk failures.  Sooner or
later, it may bite me in the butt.  We make backups sufficiently often
that the small amount of data we'll lose will be far offset by the
tremendous performance boost that we've enjoyed.  In fact, we even have
a backup server sitting there doing nothing, which can take over the
duties of the main DB server within a VERY short amount of time.

steve

Re: Opteron scaling with PostgreSQL

From
Greg Stark
Date:
Steve Wolfe <nw@codon.com> writes:

>    I've run with fsync off on my production servers for years.  Power never
> goes off, and RAID 5 protects me from disk failures.  Sooner or later, it may
> bite me in the butt.  We make backups sufficiently often that the small amount
> of data we'll lose will be far offset by the tremendous performance boost that
> we've enjoyed.  In fact, we even have a backup server sitting there doing
> nothing, which can take over the duties of the main DB server within a VERY
> short amount of time.

That's good, because you'll eventually need it.

All it will take will be a Linux crash for the database files on disk to
become corrupted. No amount of UPS or RAID protection will protect from that.

--
greg

Re: Opteron scaling with PostgreSQL

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Greg Stark
> Sent: Saturday, June 12, 2004 12:18 AM
> To: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Opteron scaling with PostgreSQL
>
>
>
> Steve Wolfe <nw@codon.com> writes:
>
> >    I've run with fsync off on my production servers for
> years.  Power
> > never goes off, and RAID 5 protects me from disk failures.
> Sooner or
> > later, it may bite me in the butt.  We make backups
> sufficiently often
> > that the small amount of data we'll lose will be far offset by the
> > tremendous performance boost that we've enjoyed.  In fact, we even
> > have a backup server sitting there doing nothing, which can
> take over
> > the duties of the main DB server within a VERY short amount of time.
>
> That's good, because you'll eventually need it.
>
> All it will take will be a Linux crash for the database files
> on disk to become corrupted. No amount of UPS or RAID
> protection will protect from that.

Another important point is that the data in an organization is always
more valuable than the hardware and the software.

Hose up the hardware and the software, and insurance gets new stuff.

Hose up the data and you are really hosed for good.

Re: Opteron scaling with PostgreSQL

From
jseymour@linxnet.com (Jim Seymour)
Date:
"Dann Corbit" <DCorbit@connx.com> wrote:
[snip]
>
> Another important point is that the data in an organization is always
> more valuable than the hardware and the software.
>
> Hose up the hardware and the software, and insurance gets new stuff.
>
> Hose up the data and you are really hosed for good.

It's amazing, how many people don't seem to get that.

Jim


Re: Opteron scaling with PostgreSQL

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Steve Wolfe <nw@codon.com> writes:
>> I've run with fsync off on my production servers for years.

> All it will take will be a Linux crash for the database files on disk to
> become corrupted. No amount of UPS or RAID protection will protect from that.

And neither will fsync'ing, so I'm not sure what your point is.  Steve
clearly understands the need for backups, so I think he's prepared as
well as he can for worst-case scenarios.  He's determined that the
particular scenarios fsync can protect him against are not big enough
risks *in his environment* to justify the cost.  I can't say that I see
any flaws in his reasoning.

            regards, tom lane

Re: Opteron scaling with PostgreSQL

From
Steve Atkins
Date:
On Sat, Jun 12, 2004 at 07:19:05AM -0400, Jim Seymour wrote:
> "Dann Corbit" <DCorbit@connx.com> wrote:
> [snip]
> >
> > Another important point is that the data in an organization is always
> > more valuable than the hardware and the software.
> >
> > Hose up the hardware and the software, and insurance gets new stuff.
> >
> > Hose up the data and you are really hosed for good.
>
> It's amazing, how many people don't seem to get that.

It's often not true.

I use postgresql for massive data-mining of a bunch of high-update
rate data sources. The value of the data decreases rapidly as it
ages. Data over a month old is worthless. Data over a week old has
very little value.

If I lose all the data and can't recover it from backups then I can be
back up and running within two days worth of new data handling, and
back to business as usual within a week of new data.

If I lose a router or a controller and have to fault-find, order a
replacement and get it overnighted, reload the OS and restore the
database and analysis software it'll take me offline for at least a
couple of days, during which I can't even handle new incoming data,
so it'd still take me two-three days after that before I was back
up and running with usable data.

So, for that particular case the data really isn't as valuable as
the infrastructure, despite that segment of the business being
primarily data analysis.

In other words, different people have different needs. There are
perfectly valid cases where you just don't care too much about the
data, but need a decent SQL engine, others where you care about
data-integrity (no silent corruption) but don't care about data
loss, others where recovery from the previous days backup is fine
if the system crashes and others where loss of a single transaction
is a serious problem.

PostgreSQL handles all those cases quite nicely, and provides some
good performance/reliability trade-off configuration options.

Cheers,
  Steve



Re: Opteron scaling with PostgreSQL

From
jseymour@linxnet.com (Jim Seymour)
Date:
Steve Atkins <steve@blighty.com> wrote:
>
> On Sat, Jun 12, 2004 at 07:19:05AM -0400, Jim Seymour wrote:
> > "Dann Corbit" <DCorbit@connx.com> wrote:
> > [snip]
> > >
> > > Another important point is that the data in an organization is always
> > > more valuable than the hardware and the software.
> > >
> > > Hose up the hardware and the software, and insurance gets new stuff.
> > >
> > > Hose up the data and you are really hosed for good.
> >
> > It's amazing, how many people don't seem to get that.
>
> It's often not true.
>
> I use postgresql for massive data-mining of a bunch of high-update
> rate data sources. The value of the data decreases rapidly as it
> ages. Data over a month old is worthless. Data over a week old has
> very little value.
[snip]
>

Good argument and well-made. So s/always/frequently/ in Dann Corbit's
comments.  Perhaps even "most often."  The point is: Many people, some
even so-called "SysAdmins," will compromise on hardware and software,
apparently w/o thought to the fact that the unique, original,
irreplaceable data that hardware and software is handling is indeed
valuable and (possibly) irreplaceable.

Jim

Re: Opteron scaling with PostgreSQL

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > Steve Wolfe <nw@codon.com> writes:
> >> I've run with fsync off on my production servers for years.
>
> > All it will take will be a Linux crash for the database files on disk to
> > become corrupted. No amount of UPS or RAID protection will protect from that.
>
> And neither will fsync'ing, so I'm not sure what your point is.

Uhm, well a typical panic causes the machine to halt. It's possible that
causes the OS to scribble all over disk if that's what you mean, but it's
pretty rare. Usually I just get random reboots or halts when things are going
wrong. In that case you have a consistent database if you use fsync but not if
you don't.

> Steve clearly understands the need for backups, so I think he's prepared as
> well as he can for worst-case scenarios. He's determined that the particular
> scenarios fsync can protect him against are not big enough risks *in his
> environment* to justify the cost. I can't say that I see any flaws in his
> reasoning.

I wasn't disagreeing with that. Just trying to ensure that it was clear what
the risk was. Without fsync anything that causes the OS to stop flushing
blocks without syncing including power loss but also including a panic of any
kind could (and probably would I would think) corrupt the DB.

--
greg

Re: Opteron scaling with PostgreSQL

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Jim Seymour
> Sent: Saturday, June 12, 2004 12:27 PM
> To: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Opteron scaling with PostgreSQL
>
>
> Steve Atkins <steve@blighty.com> wrote:
> >
> > On Sat, Jun 12, 2004 at 07:19:05AM -0400, Jim Seymour wrote:
> > > "Dann Corbit" <DCorbit@connx.com> wrote:
> > > [snip]
> > > >
> > > > Another important point is that the data in an organization is
> > > > always more valuable than the hardware and the software.
> > > >
> > > > Hose up the hardware and the software, and insurance gets new
> > > > stuff.
> > > >
> > > > Hose up the data and you are really hosed for good.
> > >
> > > It's amazing, how many people don't seem to get that.
> >
> > It's often not true.
> >
> > I use postgresql for massive data-mining of a bunch of high-update
> > rate data sources. The value of the data decreases rapidly
> as it ages.
> > Data over a month old is worthless. Data over a week old has very
> > little value.
> [snip]
> >
>
> Good argument and well-made. So s/always/frequently/ in Dann
> Corbit's comments.  Perhaps even "most often."  The point is:
> Many people, some even so-called "SysAdmins," will compromise
> on hardware and software, apparently w/o thought to the fact
> that the unique, original, irreplaceable data that hardware
> and software is handling is indeed valuable and (possibly)
> irreplaceable.

In addition, a data warehouse is a special case, since the source data
remains untouched.

With a data warehouse, you intentionally destroy and recreate it on a
frequent basis.

My statement remains true.  The data is more valuable than the hardware.
But in the case of a data warehouse, if the warehouse "burns to the
ground" you can create another one on-demand.  Since the original data
is not destroyed, the data is not destroyed.  If the original data from
which the warehouse is derived were to be destroyed, then we see the
value of the data.

Of course, there are special cases where you don't care if you lose
data.  But it is not unusual for DBAs and Sysadmins to underestimate the
value of the data, even in these special cases.

For example, if a data warehouse goes down, and the data warehouse is
used to compute month-end closing information, a delay of 3 days to redo
everything can be a tremendous expense.

There is an exception to every rule, of course.  But I raise my hand and
shout about this issue just so that people will think about it.
What will the real cost be, if my database fails?  Will it really be
cheaper to do data integrity shortcuts rather than buying faster
hardware?

I will tend to err on the side of data integrity, for sure.