Thread: 10 TB database

10 TB database

From
Artur
Date:
Hi!

We are thinking to create some stocks related search engine.
It is experimental project just for fun.

The problem is that we expect to have more than 250 GB of data every month.
This data would be in two tables. About 50.000.000 new rows every month.

We want to have access to all the date mostly for generating user
requesting reports (aggregating).
We would have about 10TB of data in three years.

Do you think is it possible to build this with postgresql and have any
idea how to start? :)


Thanks in advance,
Artur




Re: 10 TB database

From
Grzegorz Jaśkiewicz
Date:
On Mon, Jun 15, 2009 at 1:00 PM, Artur<a_wronski@gazeta.pl> wrote:
> Hi!
>
> We are thinking to create some stocks related search engine.
> It is experimental project just for fun.
>
> The problem is that we expect to have more than 250 GB of data every month.
> This data would be in two tables. About 50.000.000 new rows every month.

Well, obviously you need to decrease size of it, by doing some
normalization than.
If some information is the same across table, stick it into separate
table, and assign id to it.

If you can send me sample of that data, I could tell you where to cut size.
I have that big databases under my wings, and that's where
normalization starts to make sens, to save space (and hence speed
things up).

> We want to have access to all the date mostly for generating user requesting
> reports (aggregating).
> We would have about 10TB of data in three years.

For that sort of database you will need partitioning for sure.


Napisz do mnie, to moge pomoc prywatnie, moze za niewielka danina ;)

--
GJ

Re: 10 TB database

From
Whit Armstrong
Date:
I have a 300GB database, and I would like to look at partitioning as a
possible way to speed it up a bit.

I see the partitioning examples from the documentation:
http://www.postgresql.org/docs/8.3/static/ddl-partitioning.html

Is anyone aware of additional examples or tutorials on partitioning?

Thanks,
Whit


2009/6/15 Grzegorz Jaśkiewicz <gryzman@gmail.com>:
> On Mon, Jun 15, 2009 at 1:00 PM, Artur<a_wronski@gazeta.pl> wrote:
>> Hi!
>>
>> We are thinking to create some stocks related search engine.
>> It is experimental project just for fun.
>>
>> The problem is that we expect to have more than 250 GB of data every month.
>> This data would be in two tables. About 50.000.000 new rows every month.
>
> Well, obviously you need to decrease size of it, by doing some
> normalization than.
> If some information is the same across table, stick it into separate
> table, and assign id to it.
>
> If you can send me sample of that data, I could tell you where to cut size.
> I have that big databases under my wings, and that's where
> normalization starts to make sens, to save space (and hence speed
> things up).
>
>> We want to have access to all the date mostly for generating user requesting
>> reports (aggregating).
>> We would have about 10TB of data in three years.
>
> For that sort of database you will need partitioning for sure.
>
>
> Napisz do mnie, to moge pomoc prywatnie, moze za niewielka danina ;)
>
> --
> GJ
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Re: 10 TB database

From
Alvaro Herrera
Date:
Artur wrote:
> Hi!
>
> We are thinking to create some stocks related search engine.
> It is experimental project just for fun.
>
> The problem is that we expect to have more than 250 GB of data every month.
> This data would be in two tables. About 50.000.000 new rows every month.

Sounds a bit like what Truviso does ...

--
Alvaro Herrera

Re: 10 TB database

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Artur
> Sent: Monday, June 15, 2009 5:00 AM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] 10 TB database
>
> Hi!
>
> We are thinking to create some stocks related search engine.
> It is experimental project just for fun.
>
> The problem is that we expect to have more than 250 GB of data every
> month.
> This data would be in two tables. About 50.000.000 new rows every
> month.
>
> We want to have access to all the date mostly for generating user
> requesting reports (aggregating).
> We would have about 10TB of data in three years.
>
> Do you think is it possible to build this with postgresql and have any
> idea how to start? :)

Consider summarization of this data into a data warehouse.
Most of the data will be historical and therefore the vast majority of
the data will be read-mostly (with the rare write operations probably
consisting mostly of corrections).
You won't want to scan the whole 10TB every time you make a
summarization query.

I have an idea that might make an interesting experiment:
Create tables that are a combination of year and month.
Create views that combine all 12 months into one yearly table.
Create a view that combines all the yearly views into one global view.
The reason that I think this suggestion may have some merit is that the
historical trends will not need to be recalculated on a daily basis (but
it would be nice if you could perform calculations against the whole
pile at will on rare occasions).  By maintaining separate tables by
month, it will reduce the average depth of the b-trees.  I guess that
for the most part, the active calculations will be only against recent
data (e.g. the past 6 months to one year or so).  It could also be
interesting to create a view that combines the N most recent months of
data, where N is supplied on the fly (I do not know how difficult it
would be to create this view or even if it is possible).

If you are going to collect a terrific volume of data like this, I
suggest that a mathematics package might be coupled with the data like
SAS, R, Octave, SciLab, Maxima, etc. so that you can support decisions
derived from the data effectively.

You are also going to need high-end hardware to support a database like
this.  Just some ideas you might like to test when you start fooling
around with this data.

IMO-YMMV

Re: 10 TB database

From
"Brent Wood"
Date:
Hi Artur,

Some general comments:

I'd look at partitioning and tablespaces to better manage the files where the data is stored, but also look at some
efficientlyparallelised disks behind the filesystems. You might also look at optimising the filesystem &OS parameters
toincrease efficiency as well, so it is a mix of hardware/OS/filesystem & db setup to optimise for such a situation. 

For data retrieval, clustered indexes may help, but as this requires a physical reordering of the data on disk, it may
beimpractical. 


Cheers,

  Brent Wood



Brent Wood
DBA/GIS consultant
NIWA, Wellington
New Zealand
>>> Artur <a_wronski@gazeta.pl> 06/16/09 3:30 AM >>>
Hi!

We are thinking to create some stocks related search engine.
It is experimental project just for fun.

The problem is that we expect to have more than 250 GB of data every month.
This data would be in two tables. About 50.000.000 new rows every month.

We want to have access to all the date mostly for generating user
requesting reports (aggregating).
We would have about 10TB of data in three years.

Do you think is it possible to build this with postgresql and have any
idea how to start? :)


Thanks in advance,
Artur




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

NIWA is the trading name of the National Institute of Water & Atmospheric Research Ltd.

Re: 10 TB database

From
Michelle Konzack
Date:
Hi Artur,

I am owner of a database about War, Worcrime and Terroism with more then
1,6 TByte and I am already fscked...

Am 2009-06-15 14:00:05, schrieb Artur:
> Hi!
>
> We are thinking to create some stocks related search engine.
> It is experimental project just for fun.
>
> The problem is that we expect to have more than 250 GB of data every month.

I have only 500 MByte per month...

> This data would be in two tables. About 50.000.000 new rows every month.

arround 123.000 new rows per month

> We want to have access to all the date mostly for generating user
> requesting reports (aggregating).
> We would have about 10TB of data in three years.
>
> Do you think is it possible to build this with postgresql and have any
> idea how to start? :)

You have to use a physical cluster like me. Searches in  a  Database  of
more then 1 TByte even under using "tablespace" and  "tablepartitioning"
let you run into performance issues...

I have now splited my Database in chunks of 250 GByte using a Cluster of
1U Servers from Sun Microsystems.  Currently I run  8 servers  with  one
proxy.  Each server cost me 2.300 Euro.

Note:   On Friday I have a meeting with a  Sun  Partner  in
        Germany because a bigger project...  where  I  have
        to increase the performance of my database servers.
        I have to calculate with 150.000 customers.

Thanks, Greetings and nice Day/Evening
    Michelle Konzack
    Systemadministrator
    Tamay Dogan Network
    Debian GNU/Linux Consultant

--
Linux-User #280138 with the Linux Counter, http://counter.li.org/
##################### Debian GNU/Linux Consultant #####################
<http://www.tamay-dogan.net/>                 Michelle Konzack
<http://www.can4linux.org/>                   c/o Vertriebsp. KabelBW
<http://www.flexray4linux.org/>               Blumenstrasse 2
Jabber linux4michelle@jabber.ccc.de           77694 Kehl/Germany
IRC #Debian (irc.icq.com)                     Tel. DE: +49 177 9351947
ICQ #328449886                                Tel. FR: +33  6  61925193

Attachment

Re: 10 TB database

From
Pedro Doria Meunier
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Arthur,

We have a database that has a table growing ~1,5M rows each month.
The overall growth for the db is ~1GB/month.
PostgreSQL 8.2.9 on x86_64 - a very modest Dell R200 with 4GB of ram.

Although poor planning was made in the beginning (i.e. no clustering,
no partitioning...) - we weren't expecting the boom :] - that
particular server runs like clockwork with hundreds of queries per
minute and still doing so without any noticeable speed loss.

We're, of course, planning for load balancing in the beginning of next
year ... :)

BR,

Pedro Doria Meunier
GSM: +351 96 17 20 188
Skype: pdoriam




Michelle Konzack wrote:
> Hi Artur,
>
> I am owner of a database about War, Worcrime and Terroism with more
> then 1,6 TByte and I am already fscked...
>
> Am 2009-06-15 14:00:05, schrieb Artur:
>> Hi!
>>
>> We are thinking to create some stocks related search engine. It
>> is experimental project just for fun.
>>
>> The problem is that we expect to have more than 250 GB of data
>> every month.
>
> I have only 500 MByte per month...
>
>> This data would be in two tables. About 50.000.000 new rows every
>> month.
>
> arround 123.000 new rows per month
>
>> We want to have access to all the date mostly for generating user
>>  requesting reports (aggregating). We would have about 10TB of
>> data in three years.
>>
>> Do you think is it possible to build this with postgresql and
>> have any idea how to start? :)
>
> You have to use a physical cluster like me. Searches in  a
> Database  of more then 1 TByte even under using "tablespace" and
> "tablepartitioning" let you run into performance issues...
>
> I have now splited my Database in chunks of 250 GByte using a
> Cluster of 1U Servers from Sun Microsystems.  Currently I run  8
> servers  with  one proxy.  Each server cost me 2.300 Euro.
>
> Note:   On Friday I have a meeting with a  Sun  Partner  in Germany
> because a bigger project...  where  I  have to increase the
> performance of my database servers. I have to calculate with
> 150.000 customers.
>
> Thanks, Greetings and nice Day/Evening Michelle Konzack
> Systemadministrator Tamay Dogan Network Debian GNU/Linux Consultant
>
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFKN13P2FH5GXCfxAsRAkIiAJ95GvbQhBrOglzK2d57F5Qv7E5NdgCfcKga
bFpRiWf2vSY0oMOD40PgSsg=
=4OB3
-----END PGP SIGNATURE-----


Re: 10 TB database

From
"Todd Lieberman"
Date:

> The problem is that we expect to have more than 250 GB of data every month.

Sounds like Terradata or Netezza teritory

Re: 10 TB database

From
Martin Gainty
Date:
would suggest Oracle 11 for DB of 10TB or greater
http://www.oracle.com/solutions/performance_scalability/tpc-h-10tb-11g.html

Martin Gainty
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité
 
Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le destinataire prévu, nous te demandons avec bonté que pour satisfaire informez l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est interdite. Ce message sert à l'information seulement et n'aura pas n'importe quel effet légalement obligatoire. Étant donné que les email peuvent facilement être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité pour le contenu fourni.






Subject: Re: [GENERAL] 10 TB database
Date: Tue, 16 Jun 2009 08:24:07 -0700
From: tlieberman@marchex.com
To: pgsql-general@postgresql.org; pgsql-general@postgresql.org

> The problem is that we expect to have more than 250 GB of data every month.

Sounds like Terradata or Netezza teritory



Insert movie times and more without leaving Hotmail®. See how.

Re: 10 TB database

From
Greg Smith
Date:
On Mon, 15 Jun 2009, Artur wrote:

> We are thinking to create some stocks related search engine.
> It is experimental project just for fun.

I hope your idea of fun involves spending a bunch of money on hardware and
endless tweaking to get data loading every day with appropriate
corrections, because that's just the first round of "fun" on a job like
this.

> The problem is that we expect to have more than 250 GB of data every
> month. This data would be in two tables. About 50.000.000 new rows every
> month. We want to have access to all the date mostly for generating user
> requesting reports (aggregating). We would have about 10TB of data in
> three years. Do you think is it possible to build this with postgresql
> and have any idea how to start? :)

You start by figuring out what sort of business model is going to justify
this very expensive adventure in today's market where buyers of financial
market products are pretty rare, but that's probably not the question you
wanted an answer to.

You can certainly build a server capable of handling this job with
PostgreSQL here in 2009.  Get 8 cores, a stack of 24 1TB disks and a RAID
card with a write cache, and you'll have a big enough system to handle the
job.  Basic database design isn't too terribly difficult either.  Stock
data is trivial to partition up into tiny pieces at the database level
(each day can be its own 250GB partition), and any system capable of
holding that much data is going to have a giant stack of drives spreading
out the disk I/O too.

The first level of problems you'll run into are how to keep up with
loading data every day.  The main way to get bulk data in PostgreSQL,
COPY, isn't particularly fast, and you'll be hard pressed to keep up with
250GB/day unless you write a custom data loader that keeps multiple cores
going with that load.  Commercial databases have some better solutions to
solve this problem in the base product, or easily available from third
party sources.

The much, much bigger problem here is how exactly you're going to provide
a user interface to this data.  You can't just give people access to the
whole thing and let them run queries; the first person who executes
something like "select symbol,avg(price) from tickdata group by symbol"
because they want to see the average price of some stock over its lifetime
is going to kill the whole server.  You really need to generate the
aggregated reports ahead of time, using an approach like materialized
views, and then only let people grab those.  It's possible to manually
create materialized views in PostgreSQL, but that will be yet another bit
of custom development here.

The third level of issue is how you scale the app up if you're actually
successful.  It's hard enough to get 250GB of daily data loaded into a
single database and storing 10TB of data somewhere; doing the job across a
replicated set of servers, so you can spread the queries out, is even more
"fun" than that.

P.S. If you're not already familiar with how to aggregate properly over a
trading calendar that includes holidays and spots where the market is only
open part of the day, give up now; that's the hardest issue specific to
this particular type of application to get right, and a lot of people
don't realize that early enough in the design process to properly plan for
it.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: 10 TB database

From
Greg Smith
Date:
On Mon, 15 Jun 2009, Whit Armstrong wrote:

> Is anyone aware of additional examples or tutorials on partitioning?

http://www.pgcon.org/2007/schedule/events/41.en.html
http://blog.mozilla.com/webdev/2007/05/15/partitioning-fun-in-postgresql/
http://benjamin.smedbergs.us/blog/2007-05-12/when-partitioning-database-tables-explain-your-queries/

In that order really; those go from general commentary down to focusing on
specific issues people tend to run into.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: 10 TB database

From
Greg Smith
Date:
On Tue, 16 Jun 2009, Brent Wood wrote:

> For data retrieval, clustered indexes may help, but as this requires a
> physical reordering of the data on disk, it may be impractical.

This tends to be irrelevant for this sort of data because it's normally
inserted in a fairly clustered way in the first place.  The usual way
tables get unclustered involves random insertion and deletion, and that
just doesn't happen for data that's being imported daily and never deleted
afterwards; it's naturally clustered quite well.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: 10 TB database

From
Michelle Konzack
Date:
Hi Greg,

Am 2009-06-16 12:13:20, schrieb Greg Smith:
> The first level of problems you'll run into are how to keep up with
> loading data every day.  The main way to get bulk data in PostgreSQL,
> COPY, isn't particularly fast, and you'll be hard pressed to keep up with
> 250GB/day unless you write a custom data loader that keeps multiple cores

AFAIK he was talking about 250 GByte/month which are  around  8 GByte  a
day or 300 MByte per hour

Thanks, Greetings and nice Day/Evening
    Michelle Konzack
    Systemadministrator
    Tamay Dogan Network
    Debian GNU/Linux Consultant

--
Linux-User #280138 with the Linux Counter, http://counter.li.org/
##################### Debian GNU/Linux Consultant #####################
<http://www.tamay-dogan.net/>                 Michelle Konzack
<http://www.can4linux.org/>                   c/o Vertriebsp. KabelBW
<http://www.flexray4linux.org/>               Blumenstrasse 2
Jabber linux4michelle@jabber.ccc.de           77694 Kehl/Germany
IRC #Debian (irc.icq.com)                     Tel. DE: +49 177 9351947
ICQ #328449886                                Tel. FR: +33  6  61925193

Attachment

Re: 10 TB database

From
Greg Smith
Date:
On Tue, 16 Jun 2009, Michelle Konzack wrote:

> Am 2009-06-16 12:13:20, schrieb Greg Smith:
>> you'll be hard pressed to keep up with 250GB/day unless you write a
>> custom data loader that keeps multiple cores
>
> AFAIK he was talking about 250 GByte/month which are  around  8 GByte  a
> day or 300 MByte per hour

Right, that was just a typo in my response, the comments reflected what he
meant.  Note that your averages here presume you can spread that out over
a full 24 hour period--which you often can't, as this type of data tends
to come in a big clump after market close and needs to be loaded ASAP for
it to be useful.

It's harder than most people would guess to sustain that sort of rate
against real-world data (which even fails to import some days) in
PostgreSQL without running into a bottleneck in COPY, WAL traffic, or
database disk I/O (particularly if there's any random access stuff going
on concurrently with the load).  Just because your RAID array can write at
hundreds of MB/s does not mean you'll be able to sustain anywhere close to
that during your loading.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: 10 TB database

From
Grzegorz Jaśkiewicz
Date:
2009/6/16 Martin Gainty <mgainty@hotmail.com>:
> would suggest Oracle 11 for DB of 10TB or greater
> http://www.oracle.com/solutions/performance_scalability/tpc-h-10tb-11g.html

You are joking, right ?
Better invest that money in paying someone from -hackers to add
features required, if there will be any !
Or buy for that heftier RAID, with more disks...


--
GJ

Re: 10 TB database

From
Scott Marlowe
Date:
2009/6/16 Grzegorz Jaśkiewicz <gryzman@gmail.com>:
> 2009/6/16 Martin Gainty <mgainty@hotmail.com>:
>> would suggest Oracle 11 for DB of 10TB or greater
>> http://www.oracle.com/solutions/performance_scalability/tpc-h-10tb-11g.html
>
> You are joking, right ?
> Better invest that money in paying someone from -hackers to add
> features required, if there will be any !
> Or buy for that heftier RAID, with more disks...

You can throw a metric ton of hardware and development at a problem
for the cost of an Oracle license.