Thread: Postgresql 9.6 and Big Data

Postgresql 9.6 and Big Data

From

Job

Date:

02 December 2016, 08:23:12

Hello,

we are planning to store historically data into a Postgresql 9.6 table.

We see on Postgresql limit that it could handle "big data".
In fact, limit for a single table is 32 Tb.

We need to archive this data to generate report and analysis views.

Anyone has experienced Postgresql 9.6 with Database and table bigger than four or five Terabytes?
Which hardware requirements has been used?

There were some problems or bottleneck with so big data?

We are evaluating Database projects concepted for big Data, such as Cassandra or MongoDb but i am personally a very
affectionatedPostgreSql user since about 10 years and i would like to know which performance can give in "big data". 

Thank you!

Francesco

Re: Postgresql 9.6 and Big Data

From

Stephen Frost

Date:

02 December 2016, 13:16:15

Job,

* Job (Job@colliniconsulting.it) wrote:
> we are planning to store historically data into a Postgresql 9.6 table.

The question is less about what you're storing in PG and more about what
you're going to be doing with that data.

> We see on Postgresql limit that it could handle "big data".
> In fact, limit for a single table is 32 Tb.

Very few people would advocate storing a single, non-partitioned, 32TB
table in any database system.  As with most other database technologies,
you'll want to partition up your data.  The general rule of thumb is to
partition at the 100s-of-millions of tuples level which tends to lead to
tables which are somewhere between 5G and 100G.  At 100G per table, you
would end up with a few hundred tables to get up to 32TB, which is quite
managable in PG.  Indeed, PG can support many more tables than that, but
putting more than a few hundred into a single inheritance structure
using constraint exclusion for the partitioning can lead to long
planning times.

> We need to archive this data to generate report and analysis views.

Is the historical data changing?  Are the queries changing?  Will you
primairly be performing queries which look at all of the data?

> Anyone has experienced Postgresql 9.6 with Database and table bigger than four or five Terabytes?

Yes.

> Which hardware requirements has been used?

This depends a great deal on what you're actually doing.

> There were some problems or bottleneck with so big data?

If you're looking to run queries against 4 or 5 TBs at a time where all
of the tuples have to be considered (no index-based lookups, etc), then
you'll want to be doing parallel work.  With 9.6, you could try using
the parallel query capability, but even with 9.5 and earlier you could
pretty trivially write your own parallel query system by using multiple
connections to the database and it works very well.  The only downside
to using multiple queries is if you have ongoing changes happening and
you need a consistent view of the data- that's still possible to do with
multiple processes, but you have to do some prep work ahead of time to
extract out the keys for all of the records you want to process in a
given run and store them in a static side-table that the parallel
processes then use.  You have to avoid removing records that are
included in that set, of course, and depending on your requirments you
might wish to avoid updating them also.

Thanks!

Stephen

Attachment

signature.asc

Re: Postgresql 9.6 and Big Data

From

John R Pierce

Date:

02 December 2016, 19:12:02

On 12/2/2016 12:19 AM, Job wrote:
> We need to archive this data to generate report and analysis views.

on top of what Stephen said, I'd do as much preprocessing, and
preaggregation as possible to simplify and speed up those
report/analysis views....   for instance, storing min/max/mean rollups
per hour, per day, per month could mean the difference between a billion
row seq scan that takes hours and a few second query to graph a given
time range.


--
john r pierce, recycling bits in santa cruz