Re: PostgreSQL, OLAP, and Large Clusters - Mailing list pgsql-general

From Ryan Kelly
Subject Re: PostgreSQL, OLAP, and Large Clusters
Date
Msg-id 20120927185003.GD25296@llserver.lakeliving.com
Whole thread Raw
In response to Re: PostgreSQL, OLAP, and Large Clusters  (Scott Marlowe <scott.marlowe@gmail.com>)
Responses Re: PostgreSQL, OLAP, and Large Clusters  (Rob Sargent <robjsargent@gmail.com>)
Re: PostgreSQL, OLAP, and Large Clusters  (Scott Marlowe <scott.marlowe@gmail.com>)
List pgsql-general
On Wed, Sep 26, 2012 at 03:18:16PM -0600, Scott Marlowe wrote:
> On Wed, Sep 26, 2012 at 5:50 AM, Ryan Kelly <rpkelly22@gmail.com> wrote:
> > Hi:
> >
> > The size of our database is growing rather rapidly. We're concerned
> > about how well Postgres will scale for OLAP-style queries over terabytes
> > of data. Googling around doesn't yield great results for vanilla
> > Postgres in this application, but generally links to other software like
> > Greenplum, Netezza, and Aster Data (some of which are based off of
> > Postgres). Too, there are solutions like Stado. But I'm concerned about
> > the amount of effort to use such solutions and what we would have to
> > give up feature-wise.
>
> If you want fastish OLAP on postgres you need to do several things.
>
> 1: Throw very fast disk arrays at it.  Lots of spinners in a linux SW
> RAID-10 or RAID-0 if your data is easily replaceable work wonders
> here.
> 2: Throw lots of memory at it.  Memory is pretty cheap.  256G is not
> unusual for OLAP machines
> 3: Throw fast CPUs at it.  Faster CPUs, especially fewer faster cores,
> are often helpful.
What do you mean by "fewer faster cores"? Wouldn't "more faster cores"
be better?

> Applied in that order you can get some pretty impressive results.
>
> A lot of OLAP stuff needs to read hundreds of gigs at a time from the
> drive array.  An array of 32 15kRPM drives, each reading at ~100MB/s
> or better can flood your  PCI bus at 3200MB/s for reads.  Note that
> most RAID controllers aren't as fast for sequential reads on large
> drive arrays.  Although a battery backed cache can GREATLY improved
> parallel write speed, it doesn't really make a big difference for big
> sequential stuff and usually gets in the way here.
>
> Memory to cache as much as possible and allow all your queries to do
> hash joins etc in memory (crank up work_mem as needed, but be careful
> not to use all your memory up.)
>
> Lastly once you've thrown lots of IO and memory at it, a faster CPU
> can make a bit of a difference too.  But honestly I'd rather have a
> dual core 2GHz CPU on top of 32 spinners with 256G than a 3.6GHz 8
> core CPU on top of 4 drives and 32G of RAM.
All of this seems like great advice.

Thanks,
-Ryan Kelly


pgsql-general by date:

Previous
From: Ryan Kelly
Date:
Subject: Re: PostgreSQL, OLAP, and Large Clusters
Next
From: Ryan Kelly
Date:
Subject: Re: PostgreSQL, OLAP, and Large Clusters