Re: 10 TB database - Mailing list pgsql-general

From Greg Smith
Subject Re: 10 TB database
Date
Msg-id alpine.GSO.2.01.0906161142140.17014@westnet.com
Whole thread Raw
In response to 10 TB database  (Artur <a_wronski@gazeta.pl>)
Responses Re: 10 TB database
List pgsql-general
On Mon, 15 Jun 2009, Artur wrote:

> We are thinking to create some stocks related search engine.
> It is experimental project just for fun.

I hope your idea of fun involves spending a bunch of money on hardware and
endless tweaking to get data loading every day with appropriate
corrections, because that's just the first round of "fun" on a job like
this.

> The problem is that we expect to have more than 250 GB of data every
> month. This data would be in two tables. About 50.000.000 new rows every
> month. We want to have access to all the date mostly for generating user
> requesting reports (aggregating). We would have about 10TB of data in
> three years. Do you think is it possible to build this with postgresql
> and have any idea how to start? :)

You start by figuring out what sort of business model is going to justify
this very expensive adventure in today's market where buyers of financial
market products are pretty rare, but that's probably not the question you
wanted an answer to.

You can certainly build a server capable of handling this job with
PostgreSQL here in 2009.  Get 8 cores, a stack of 24 1TB disks and a RAID
card with a write cache, and you'll have a big enough system to handle the
job.  Basic database design isn't too terribly difficult either.  Stock
data is trivial to partition up into tiny pieces at the database level
(each day can be its own 250GB partition), and any system capable of
holding that much data is going to have a giant stack of drives spreading
out the disk I/O too.

The first level of problems you'll run into are how to keep up with
loading data every day.  The main way to get bulk data in PostgreSQL,
COPY, isn't particularly fast, and you'll be hard pressed to keep up with
250GB/day unless you write a custom data loader that keeps multiple cores
going with that load.  Commercial databases have some better solutions to
solve this problem in the base product, or easily available from third
party sources.

The much, much bigger problem here is how exactly you're going to provide
a user interface to this data.  You can't just give people access to the
whole thing and let them run queries; the first person who executes
something like "select symbol,avg(price) from tickdata group by symbol"
because they want to see the average price of some stock over its lifetime
is going to kill the whole server.  You really need to generate the
aggregated reports ahead of time, using an approach like materialized
views, and then only let people grab those.  It's possible to manually
create materialized views in PostgreSQL, but that will be yet another bit
of custom development here.

The third level of issue is how you scale the app up if you're actually
successful.  It's hard enough to get 250GB of daily data loaded into a
single database and storing 10TB of data somewhere; doing the job across a
replicated set of servers, so you can spread the queries out, is even more
"fun" than that.

P.S. If you're not already familiar with how to aggregate properly over a
trading calendar that includes holidays and spots where the market is only
open part of the day, give up now; that's the hardest issue specific to
this particular type of application to get right, and a lot of people
don't realize that early enough in the design process to properly plan for
it.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

pgsql-general by date:

Previous
From: Martin Gainty
Date:
Subject: Re: 10 TB database
Next
From: Greg Smith
Date:
Subject: Re: 10 TB database