Thread: 10 TB database
Hi! We are thinking to create some stocks related search engine. It is experimental project just for fun. The problem is that we expect to have more than 250 GB of data every month. This data would be in two tables. About 50.000.000 new rows every month. We want to have access to all the date mostly for generating user requesting reports (aggregating). We would have about 10TB of data in three years. Do you think is it possible to build this with postgresql and have any idea how to start? :) Thanks in advance, Artur
On Mon, Jun 15, 2009 at 1:00 PM, Artur<a_wronski@gazeta.pl> wrote: > Hi! > > We are thinking to create some stocks related search engine. > It is experimental project just for fun. > > The problem is that we expect to have more than 250 GB of data every month. > This data would be in two tables. About 50.000.000 new rows every month. Well, obviously you need to decrease size of it, by doing some normalization than. If some information is the same across table, stick it into separate table, and assign id to it. If you can send me sample of that data, I could tell you where to cut size. I have that big databases under my wings, and that's where normalization starts to make sens, to save space (and hence speed things up). > We want to have access to all the date mostly for generating user requesting > reports (aggregating). > We would have about 10TB of data in three years. For that sort of database you will need partitioning for sure. Napisz do mnie, to moge pomoc prywatnie, moze za niewielka danina ;) -- GJ
I have a 300GB database, and I would like to look at partitioning as a possible way to speed it up a bit. I see the partitioning examples from the documentation: http://www.postgresql.org/docs/8.3/static/ddl-partitioning.html Is anyone aware of additional examples or tutorials on partitioning? Thanks, Whit 2009/6/15 Grzegorz Jaśkiewicz <gryzman@gmail.com>: > On Mon, Jun 15, 2009 at 1:00 PM, Artur<a_wronski@gazeta.pl> wrote: >> Hi! >> >> We are thinking to create some stocks related search engine. >> It is experimental project just for fun. >> >> The problem is that we expect to have more than 250 GB of data every month. >> This data would be in two tables. About 50.000.000 new rows every month. > > Well, obviously you need to decrease size of it, by doing some > normalization than. > If some information is the same across table, stick it into separate > table, and assign id to it. > > If you can send me sample of that data, I could tell you where to cut size. > I have that big databases under my wings, and that's where > normalization starts to make sens, to save space (and hence speed > things up). > >> We want to have access to all the date mostly for generating user requesting >> reports (aggregating). >> We would have about 10TB of data in three years. > > For that sort of database you will need partitioning for sure. > > > Napisz do mnie, to moge pomoc prywatnie, moze za niewielka danina ;) > > -- > GJ > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
Artur wrote: > Hi! > > We are thinking to create some stocks related search engine. > It is experimental project just for fun. > > The problem is that we expect to have more than 250 GB of data every month. > This data would be in two tables. About 50.000.000 new rows every month. Sounds a bit like what Truviso does ... -- Alvaro Herrera
> -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of Artur > Sent: Monday, June 15, 2009 5:00 AM > To: pgsql-general@postgresql.org > Subject: [GENERAL] 10 TB database > > Hi! > > We are thinking to create some stocks related search engine. > It is experimental project just for fun. > > The problem is that we expect to have more than 250 GB of data every > month. > This data would be in two tables. About 50.000.000 new rows every > month. > > We want to have access to all the date mostly for generating user > requesting reports (aggregating). > We would have about 10TB of data in three years. > > Do you think is it possible to build this with postgresql and have any > idea how to start? :) Consider summarization of this data into a data warehouse. Most of the data will be historical and therefore the vast majority of the data will be read-mostly (with the rare write operations probably consisting mostly of corrections). You won't want to scan the whole 10TB every time you make a summarization query. I have an idea that might make an interesting experiment: Create tables that are a combination of year and month. Create views that combine all 12 months into one yearly table. Create a view that combines all the yearly views into one global view. The reason that I think this suggestion may have some merit is that the historical trends will not need to be recalculated on a daily basis (but it would be nice if you could perform calculations against the whole pile at will on rare occasions). By maintaining separate tables by month, it will reduce the average depth of the b-trees. I guess that for the most part, the active calculations will be only against recent data (e.g. the past 6 months to one year or so). It could also be interesting to create a view that combines the N most recent months of data, where N is supplied on the fly (I do not know how difficult it would be to create this view or even if it is possible). If you are going to collect a terrific volume of data like this, I suggest that a mathematics package might be coupled with the data like SAS, R, Octave, SciLab, Maxima, etc. so that you can support decisions derived from the data effectively. You are also going to need high-end hardware to support a database like this. Just some ideas you might like to test when you start fooling around with this data. IMO-YMMV
Hi Artur, Some general comments: I'd look at partitioning and tablespaces to better manage the files where the data is stored, but also look at some efficientlyparallelised disks behind the filesystems. You might also look at optimising the filesystem &OS parameters toincrease efficiency as well, so it is a mix of hardware/OS/filesystem & db setup to optimise for such a situation. For data retrieval, clustered indexes may help, but as this requires a physical reordering of the data on disk, it may beimpractical. Cheers, Brent Wood Brent Wood DBA/GIS consultant NIWA, Wellington New Zealand >>> Artur <a_wronski@gazeta.pl> 06/16/09 3:30 AM >>> Hi! We are thinking to create some stocks related search engine. It is experimental project just for fun. The problem is that we expect to have more than 250 GB of data every month. This data would be in two tables. About 50.000.000 new rows every month. We want to have access to all the date mostly for generating user requesting reports (aggregating). We would have about 10TB of data in three years. Do you think is it possible to build this with postgresql and have any idea how to start? :) Thanks in advance, Artur -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general NIWA is the trading name of the National Institute of Water & Atmospheric Research Ltd.
Hi Artur, I am owner of a database about War, Worcrime and Terroism with more then 1,6 TByte and I am already fscked... Am 2009-06-15 14:00:05, schrieb Artur: > Hi! > > We are thinking to create some stocks related search engine. > It is experimental project just for fun. > > The problem is that we expect to have more than 250 GB of data every month. I have only 500 MByte per month... > This data would be in two tables. About 50.000.000 new rows every month. arround 123.000 new rows per month > We want to have access to all the date mostly for generating user > requesting reports (aggregating). > We would have about 10TB of data in three years. > > Do you think is it possible to build this with postgresql and have any > idea how to start? :) You have to use a physical cluster like me. Searches in a Database of more then 1 TByte even under using "tablespace" and "tablepartitioning" let you run into performance issues... I have now splited my Database in chunks of 250 GByte using a Cluster of 1U Servers from Sun Microsystems. Currently I run 8 servers with one proxy. Each server cost me 2.300 Euro. Note: On Friday I have a meeting with a Sun Partner in Germany because a bigger project... where I have to increase the performance of my database servers. I have to calculate with 150.000 customers. Thanks, Greetings and nice Day/Evening Michelle Konzack Systemadministrator Tamay Dogan Network Debian GNU/Linux Consultant -- Linux-User #280138 with the Linux Counter, http://counter.li.org/ ##################### Debian GNU/Linux Consultant ##################### <http://www.tamay-dogan.net/> Michelle Konzack <http://www.can4linux.org/> c/o Vertriebsp. KabelBW <http://www.flexray4linux.org/> Blumenstrasse 2 Jabber linux4michelle@jabber.ccc.de 77694 Kehl/Germany IRC #Debian (irc.icq.com) Tel. DE: +49 177 9351947 ICQ #328449886 Tel. FR: +33 6 61925193
Attachment
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello Arthur, We have a database that has a table growing ~1,5M rows each month. The overall growth for the db is ~1GB/month. PostgreSQL 8.2.9 on x86_64 - a very modest Dell R200 with 4GB of ram. Although poor planning was made in the beginning (i.e. no clustering, no partitioning...) - we weren't expecting the boom :] - that particular server runs like clockwork with hundreds of queries per minute and still doing so without any noticeable speed loss. We're, of course, planning for load balancing in the beginning of next year ... :) BR, Pedro Doria Meunier GSM: +351 96 17 20 188 Skype: pdoriam Michelle Konzack wrote: > Hi Artur, > > I am owner of a database about War, Worcrime and Terroism with more > then 1,6 TByte and I am already fscked... > > Am 2009-06-15 14:00:05, schrieb Artur: >> Hi! >> >> We are thinking to create some stocks related search engine. It >> is experimental project just for fun. >> >> The problem is that we expect to have more than 250 GB of data >> every month. > > I have only 500 MByte per month... > >> This data would be in two tables. About 50.000.000 new rows every >> month. > > arround 123.000 new rows per month > >> We want to have access to all the date mostly for generating user >> requesting reports (aggregating). We would have about 10TB of >> data in three years. >> >> Do you think is it possible to build this with postgresql and >> have any idea how to start? :) > > You have to use a physical cluster like me. Searches in a > Database of more then 1 TByte even under using "tablespace" and > "tablepartitioning" let you run into performance issues... > > I have now splited my Database in chunks of 250 GByte using a > Cluster of 1U Servers from Sun Microsystems. Currently I run 8 > servers with one proxy. Each server cost me 2.300 Euro. > > Note: On Friday I have a meeting with a Sun Partner in Germany > because a bigger project... where I have to increase the > performance of my database servers. I have to calculate with > 150.000 customers. > > Thanks, Greetings and nice Day/Evening Michelle Konzack > Systemadministrator Tamay Dogan Network Debian GNU/Linux Consultant > > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFKN13P2FH5GXCfxAsRAkIiAJ95GvbQhBrOglzK2d57F5Qv7E5NdgCfcKga bFpRiWf2vSY0oMOD40PgSsg= =4OB3 -----END PGP SIGNATURE-----
> The problem is that we expect to have more than 250 GB of data every month.
Sounds like Terradata or Netezza teritory
would suggest Oracle 11 for DB of 10TB or greater
http://www.oracle.com/solutions/performance_scalability/tpc-h-10tb-11g.html
Martin Gainty
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité
Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
Subject: Re: [GENERAL] 10 TB database
Date: Tue, 16 Jun 2009 08:24:07 -0700
From: tlieberman@marchex.com
To: pgsql-general@postgresql.org; pgsql-general@postgresql.org
> The problem is that we expect to have more than 250 GB of data every month.
Sounds like Terradata or Netezza teritory
Insert movie times and more without leaving Hotmail®. See how.
http://www.oracle.com/solutions/performance_scalability/tpc-h-10tb-11g.html
Martin Gainty
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité
Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le destinataire prévu, nous te demandons avec bonté que pour satisfaire informez l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est interdite. Ce message sert à l'information seulement et n'aura pas n'importe quel effet légalement obligatoire. Étant donné que les email peuvent facilement être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité pour le contenu fourni.
Subject: Re: [GENERAL] 10 TB database
Date: Tue, 16 Jun 2009 08:24:07 -0700
From: tlieberman@marchex.com
To: pgsql-general@postgresql.org; pgsql-general@postgresql.org
> The problem is that we expect to have more than 250 GB of data every month.
Sounds like Terradata or Netezza teritory
Insert movie times and more without leaving Hotmail®. See how.
On Mon, 15 Jun 2009, Artur wrote: > We are thinking to create some stocks related search engine. > It is experimental project just for fun. I hope your idea of fun involves spending a bunch of money on hardware and endless tweaking to get data loading every day with appropriate corrections, because that's just the first round of "fun" on a job like this. > The problem is that we expect to have more than 250 GB of data every > month. This data would be in two tables. About 50.000.000 new rows every > month. We want to have access to all the date mostly for generating user > requesting reports (aggregating). We would have about 10TB of data in > three years. Do you think is it possible to build this with postgresql > and have any idea how to start? :) You start by figuring out what sort of business model is going to justify this very expensive adventure in today's market where buyers of financial market products are pretty rare, but that's probably not the question you wanted an answer to. You can certainly build a server capable of handling this job with PostgreSQL here in 2009. Get 8 cores, a stack of 24 1TB disks and a RAID card with a write cache, and you'll have a big enough system to handle the job. Basic database design isn't too terribly difficult either. Stock data is trivial to partition up into tiny pieces at the database level (each day can be its own 250GB partition), and any system capable of holding that much data is going to have a giant stack of drives spreading out the disk I/O too. The first level of problems you'll run into are how to keep up with loading data every day. The main way to get bulk data in PostgreSQL, COPY, isn't particularly fast, and you'll be hard pressed to keep up with 250GB/day unless you write a custom data loader that keeps multiple cores going with that load. Commercial databases have some better solutions to solve this problem in the base product, or easily available from third party sources. The much, much bigger problem here is how exactly you're going to provide a user interface to this data. You can't just give people access to the whole thing and let them run queries; the first person who executes something like "select symbol,avg(price) from tickdata group by symbol" because they want to see the average price of some stock over its lifetime is going to kill the whole server. You really need to generate the aggregated reports ahead of time, using an approach like materialized views, and then only let people grab those. It's possible to manually create materialized views in PostgreSQL, but that will be yet another bit of custom development here. The third level of issue is how you scale the app up if you're actually successful. It's hard enough to get 250GB of daily data loaded into a single database and storing 10TB of data somewhere; doing the job across a replicated set of servers, so you can spread the queries out, is even more "fun" than that. P.S. If you're not already familiar with how to aggregate properly over a trading calendar that includes holidays and spots where the market is only open part of the day, give up now; that's the hardest issue specific to this particular type of application to get right, and a lot of people don't realize that early enough in the design process to properly plan for it. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mon, 15 Jun 2009, Whit Armstrong wrote: > Is anyone aware of additional examples or tutorials on partitioning? http://www.pgcon.org/2007/schedule/events/41.en.html http://blog.mozilla.com/webdev/2007/05/15/partitioning-fun-in-postgresql/ http://benjamin.smedbergs.us/blog/2007-05-12/when-partitioning-database-tables-explain-your-queries/ In that order really; those go from general commentary down to focusing on specific issues people tend to run into. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Tue, 16 Jun 2009, Brent Wood wrote: > For data retrieval, clustered indexes may help, but as this requires a > physical reordering of the data on disk, it may be impractical. This tends to be irrelevant for this sort of data because it's normally inserted in a fairly clustered way in the first place. The usual way tables get unclustered involves random insertion and deletion, and that just doesn't happen for data that's being imported daily and never deleted afterwards; it's naturally clustered quite well. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Hi Greg, Am 2009-06-16 12:13:20, schrieb Greg Smith: > The first level of problems you'll run into are how to keep up with > loading data every day. The main way to get bulk data in PostgreSQL, > COPY, isn't particularly fast, and you'll be hard pressed to keep up with > 250GB/day unless you write a custom data loader that keeps multiple cores AFAIK he was talking about 250 GByte/month which are around 8 GByte a day or 300 MByte per hour Thanks, Greetings and nice Day/Evening Michelle Konzack Systemadministrator Tamay Dogan Network Debian GNU/Linux Consultant -- Linux-User #280138 with the Linux Counter, http://counter.li.org/ ##################### Debian GNU/Linux Consultant ##################### <http://www.tamay-dogan.net/> Michelle Konzack <http://www.can4linux.org/> c/o Vertriebsp. KabelBW <http://www.flexray4linux.org/> Blumenstrasse 2 Jabber linux4michelle@jabber.ccc.de 77694 Kehl/Germany IRC #Debian (irc.icq.com) Tel. DE: +49 177 9351947 ICQ #328449886 Tel. FR: +33 6 61925193
Attachment
On Tue, 16 Jun 2009, Michelle Konzack wrote: > Am 2009-06-16 12:13:20, schrieb Greg Smith: >> you'll be hard pressed to keep up with 250GB/day unless you write a >> custom data loader that keeps multiple cores > > AFAIK he was talking about 250 GByte/month which are around 8 GByte a > day or 300 MByte per hour Right, that was just a typo in my response, the comments reflected what he meant. Note that your averages here presume you can spread that out over a full 24 hour period--which you often can't, as this type of data tends to come in a big clump after market close and needs to be loaded ASAP for it to be useful. It's harder than most people would guess to sustain that sort of rate against real-world data (which even fails to import some days) in PostgreSQL without running into a bottleneck in COPY, WAL traffic, or database disk I/O (particularly if there's any random access stuff going on concurrently with the load). Just because your RAID array can write at hundreds of MB/s does not mean you'll be able to sustain anywhere close to that during your loading. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
2009/6/16 Martin Gainty <mgainty@hotmail.com>: > would suggest Oracle 11 for DB of 10TB or greater > http://www.oracle.com/solutions/performance_scalability/tpc-h-10tb-11g.html You are joking, right ? Better invest that money in paying someone from -hackers to add features required, if there will be any ! Or buy for that heftier RAID, with more disks... -- GJ
2009/6/16 Grzegorz Jaśkiewicz <gryzman@gmail.com>: > 2009/6/16 Martin Gainty <mgainty@hotmail.com>: >> would suggest Oracle 11 for DB of 10TB or greater >> http://www.oracle.com/solutions/performance_scalability/tpc-h-10tb-11g.html > > You are joking, right ? > Better invest that money in paying someone from -hackers to add > features required, if there will be any ! > Or buy for that heftier RAID, with more disks... You can throw a metric ton of hardware and development at a problem for the cost of an Oracle license.