Re: PostgreSQL, OLAP, and Large Clusters - Mailing list pgsql-general

From Scott Marlowe
Subject Re: PostgreSQL, OLAP, and Large Clusters
Date
Msg-id CAOR=d=2Q8bDbABoV5R6T1YZ_7-H7Lzf3nqQUnfa4m2SEiXhWMw@mail.gmail.com
Whole thread Raw
In response to PostgreSQL, OLAP, and Large Clusters  (Ryan Kelly <rpkelly22@gmail.com>)
Responses Re: PostgreSQL, OLAP, and Large Clusters  (Stephen Frost <sfrost@snowman.net>)
Re: PostgreSQL, OLAP, and Large Clusters  (Ryan Kelly <rpkelly22@gmail.com>)
List pgsql-general
On Wed, Sep 26, 2012 at 5:50 AM, Ryan Kelly <rpkelly22@gmail.com> wrote:
> Hi:
>
> The size of our database is growing rather rapidly. We're concerned
> about how well Postgres will scale for OLAP-style queries over terabytes
> of data. Googling around doesn't yield great results for vanilla
> Postgres in this application, but generally links to other software like
> Greenplum, Netezza, and Aster Data (some of which are based off of
> Postgres). Too, there are solutions like Stado. But I'm concerned about
> the amount of effort to use such solutions and what we would have to
> give up feature-wise.

If you want fastish OLAP on postgres you need to do several things.

1: Throw very fast disk arrays at it.  Lots of spinners in a linux SW
RAID-10 or RAID-0 if your data is easily replaceable work wonders
here.
2: Throw lots of memory at it.  Memory is pretty cheap.  256G is not
unusual for OLAP machines
3: Throw fast CPUs at it.  Faster CPUs, especially fewer faster cores,
are often helpful.

Applied in that order you can get some pretty impressive results.

A lot of OLAP stuff needs to read hundreds of gigs at a time from the
drive array.  An array of 32 15kRPM drives, each reading at ~100MB/s
or better can flood your  PCI bus at 3200MB/s for reads.  Note that
most RAID controllers aren't as fast for sequential reads on large
drive arrays.  Although a battery backed cache can GREATLY improved
parallel write speed, it doesn't really make a big difference for big
sequential stuff and usually gets in the way here.

Memory to cache as much as possible and allow all your queries to do
hash joins etc in memory (crank up work_mem as needed, but be careful
not to use all your memory up.)

Lastly once you've thrown lots of IO and memory at it, a faster CPU
can make a bit of a difference too.  But honestly I'd rather have a
dual core 2GHz CPU on top of 32 spinners with 256G than a 3.6GHz 8
core CPU on top of 4 drives and 32G of RAM.


pgsql-general by date:

Previous
From: Alexander Reichstadt
Date:
Subject: Re: [GENERAL] SELECT …. WHERE id is in pool of ids of subquery……
Next
From: Samuel Gendler
Date:
Subject: Re: [PERFORM] Inaccurate Explain Cost