Home > mailing lists

Re: PostgreSQL, OLAP, and Large Clusters - Mailing list pgsql-general

From	Scott Marlowe
Subject	Re: PostgreSQL, OLAP, and Large Clusters
Date	September 27, 2012 00:18:21
Msg-id	CAOR=d=2Q8bDbABoV5R6T1YZ_7-H7Lzf3nqQUnfa4m2SEiXhWMw@mail.gmail.com Whole thread Raw
In response to	PostgreSQL, OLAP, and Large Clusters (Ryan Kelly <rpkelly22@gmail.com>)
Responses	Re: PostgreSQL, OLAP, and Large Clusters Re: PostgreSQL, OLAP, and Large Clusters
List	pgsql-general

Tree view

On Wed, Sep 26, 2012 at 5:50 AM, Ryan Kelly <rpkelly22@gmail.com> wrote:
> Hi:
>
> The size of our database is growing rather rapidly. We're concerned
> about how well Postgres will scale for OLAP-style queries over terabytes
> of data. Googling around doesn't yield great results for vanilla
> Postgres in this application, but generally links to other software like
> Greenplum, Netezza, and Aster Data (some of which are based off of
> Postgres). Too, there are solutions like Stado. But I'm concerned about
> the amount of effort to use such solutions and what we would have to
> give up feature-wise.

If you want fastish OLAP on postgres you need to do several things.

1: Throw very fast disk arrays at it.  Lots of spinners in a linux SW
RAID-10 or RAID-0 if your data is easily replaceable work wonders
here.
2: Throw lots of memory at it.  Memory is pretty cheap.  256G is not
unusual for OLAP machines
3: Throw fast CPUs at it.  Faster CPUs, especially fewer faster cores,
are often helpful.

Applied in that order you can get some pretty impressive results.

A lot of OLAP stuff needs to read hundreds of gigs at a time from the
drive array.  An array of 32 15kRPM drives, each reading at ~100MB/s
or better can flood your  PCI bus at 3200MB/s for reads.  Note that
most RAID controllers aren't as fast for sequential reads on large
drive arrays.  Although a battery backed cache can GREATLY improved
parallel write speed, it doesn't really make a big difference for big
sequential stuff and usually gets in the way here.

Memory to cache as much as possible and allow all your queries to do
hash joins etc in memory (crank up work_mem as needed, but be careful
not to use all your memory up.)

Lastly once you've thrown lots of IO and memory at it, a faster CPU
can make a bit of a difference too.  But honestly I'd rather have a
dual core 2GHz CPU on top of 32 spinners with 256G than a 3.6GHz 8
core CPU on top of 4 drives and 32G of RAM.

pgsql-general by date:

From: Alexander Reichstadt
Date: 27 September 2012, 00:04:20
Subject: Re: [GENERAL] SELECT …. WHERE id is in pool of ids of subquery……

From: Samuel Gendler
Date: 27 September 2012, 01:42:18
Subject: Re: [PERFORM] Inaccurate Explain Cost

Re: PostgreSQL, OLAP, and Large Clusters - Mailing list pgsql-general

Previous

Next