Re: 10 TB database - Mailing list pgsql-general
From | Greg Smith |
---|---|
Subject | Re: 10 TB database |
Date | |
Msg-id | alpine.GSO.2.01.0906161142140.17014@westnet.com Whole thread Raw |
In response to | 10 TB database (Artur <a_wronski@gazeta.pl>) |
Responses |
Re: 10 TB database
|
List | pgsql-general |
On Mon, 15 Jun 2009, Artur wrote: > We are thinking to create some stocks related search engine. > It is experimental project just for fun. I hope your idea of fun involves spending a bunch of money on hardware and endless tweaking to get data loading every day with appropriate corrections, because that's just the first round of "fun" on a job like this. > The problem is that we expect to have more than 250 GB of data every > month. This data would be in two tables. About 50.000.000 new rows every > month. We want to have access to all the date mostly for generating user > requesting reports (aggregating). We would have about 10TB of data in > three years. Do you think is it possible to build this with postgresql > and have any idea how to start? :) You start by figuring out what sort of business model is going to justify this very expensive adventure in today's market where buyers of financial market products are pretty rare, but that's probably not the question you wanted an answer to. You can certainly build a server capable of handling this job with PostgreSQL here in 2009. Get 8 cores, a stack of 24 1TB disks and a RAID card with a write cache, and you'll have a big enough system to handle the job. Basic database design isn't too terribly difficult either. Stock data is trivial to partition up into tiny pieces at the database level (each day can be its own 250GB partition), and any system capable of holding that much data is going to have a giant stack of drives spreading out the disk I/O too. The first level of problems you'll run into are how to keep up with loading data every day. The main way to get bulk data in PostgreSQL, COPY, isn't particularly fast, and you'll be hard pressed to keep up with 250GB/day unless you write a custom data loader that keeps multiple cores going with that load. Commercial databases have some better solutions to solve this problem in the base product, or easily available from third party sources. The much, much bigger problem here is how exactly you're going to provide a user interface to this data. You can't just give people access to the whole thing and let them run queries; the first person who executes something like "select symbol,avg(price) from tickdata group by symbol" because they want to see the average price of some stock over its lifetime is going to kill the whole server. You really need to generate the aggregated reports ahead of time, using an approach like materialized views, and then only let people grab those. It's possible to manually create materialized views in PostgreSQL, but that will be yet another bit of custom development here. The third level of issue is how you scale the app up if you're actually successful. It's hard enough to get 250GB of daily data loaded into a single database and storing 10TB of data somewhere; doing the job across a replicated set of servers, so you can spread the queries out, is even more "fun" than that. P.S. If you're not already familiar with how to aggregate properly over a trading calendar that includes holidays and spots where the market is only open part of the day, give up now; that's the hardest issue specific to this particular type of application to get right, and a lot of people don't realize that early enough in the design process to properly plan for it. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
pgsql-general by date: