Re: 110,000,000 rows - Mailing list pgsql-general

From Nikolas Everett
Subject Re: 110,000,000 rows
Date
Msg-id AANLkTiko6IWLWZI1skIXk8ikWAEDtwqEXbHg3NFfVF-i@mail.gmail.com
Whole thread Raw
In response to Re: 110,000,000 rows  ("Massa, Harald Armin" <chef@ghum.de>)
Responses Re: 110,000,000 rows  (Thom Brown <thombrown@gmail.com>)
List pgsql-general
I've had a reporting database with just about a billion rows.  Each row was horribly large because the legacy schema had problems.  We partitioned it out by month and it ran about 30 million rows a month.  With a reasonably large box you can get that kind of data into memory and indexes are almost unnecessary.  So long as you have constraint exclusion and a good partition scheme you should be fine.  Throw in a well designed schema and you'll be cooking well into the tens of billions of rows.

We ran self joins of that table reasonably consistently by the way:
SELECT lhs.id, rhs.id
FROM bigtable lhs, bigtable rhs
WHERE lhs.id > rhs.id 
     AND '' > lhs.timestamp AND lhs.timestamp >= ''
     AND '' > rhs.timestamp AND rhs.timestamp >= ''
     AND lhs.timestamp = rhs.timestamp
     AND lhs.foo = rhs.foo
     AND lhs.bar = rhs.bar

This really liked the timestamp index and we had to be careful to only do it for a few days at a time.  It took a few minutes each go but it was definitely doable.

Once you get this large you do have to be careful with a few things though:
*It's somewhat easy to write super long queries or updates.  This can lots of dead rows in your tables.  Limit your longest running queries to a day or so.  Note that queries are unlikely to take that long but updates with massive date ranges could.  SELECT COUNT(*) FROM bigtable too about 30 minutes when the server wasn't under heavy load.
*You sometimes get bad plans because:
**You don't or can't get enough statistics about a column.
**PostgreSQL doesn't capture statistics about two columns together.  PostgreSQL has no way of knowing that columnA = 'foo' implies columnB = 'bar' about 30% of the time.

Nik

On Thu, May 27, 2010 at 5:58 AM, Massa, Harald Armin <chef@ghum.de> wrote:
Dann,

There really are domains that big, so that there is no more normalization or other processes to mitigate the problem.

Examples:
Microsoft's registered customers database (all MS products bought by any customer, including operating systems)
Tolls taken on the New Jersey road system for FY 2009
DNA data from the Human Genome Project

.....

please also think of ouer most risk exposed users, the ones using Poker / Roulette simulation and analyzing software with an PostgrSQL database below. There are so many rounds of Poker to play .... :)

Harald
 

--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon
-
Using PostgreSQL is mostly about sleeping well at night.

pgsql-general by date:

Previous
From: Chris Roffler
Date:
Subject: Re: XML index
Next
From: Thom Brown
Date:
Subject: Re: 110,000,000 rows