Thread: 110,000,000 rows

110,000,000 rows

From

John Gage

Date:

26 May 2010, 20:30:09

Please forgive this intrusion, and please ignore it, but how many
applications out there have 110,000,000 row tables?  I recently
multiplied 85,000 by 1,400 and said now way Jose.

Thanks,

John Gage

Re: 110,000,000 rows

From

Alex Thurlow

Date:

26 May 2010, 20:43:14

I've had many times that before and things were very slow.  That's when
I partitioned it out.  Luckily that table was just for reporting and
could be slow.  Are you thinking you'll need that many rows and you just
don't know how to handle it?  I would recommend partitioning if at all
possible.
http://www.postgresql.org/docs/current/static/ddl-partitioning.html

     -Alex

On 5/26/2010 3:29 PM, John Gage wrote:
> Please forgive this intrusion, and please ignore it, but how many
> applications out there have 110,000,000 row tables?  I recently
> multiplied 85,000 by 1,400 and said now way Jose.
>
> Thanks,
>
> John Gage
>

Re: 110,000,000 rows

From

Vick Khera

Date:

26 May 2010, 21:13:13

On Wed, May 26, 2010 at 4:29 PM, John Gage <jsmgage@numericable.fr> wrote:
> Please forgive this intrusion, and please ignore it, but how many
> applications out there have 110,000,000 row tables?  I recently multiplied
> 85,000 by 1,400 and said now way Jose.

/me stands and waves hand.  I have two tables in my primary app that
are on this magnitude.  Each is served by 100 partitions split on mod
100 of the primary key as that worked out well for my use case.

Re: 110,000,000 rows

From

Thom Brown

Date:

26 May 2010, 21:14:07

On 26 May 2010 21:29, John Gage <jsmgage@numericable.fr> wrote:
> Please forgive this intrusion, and please ignore it, but how many
> applications out there have 110,000,000 row tables?  I recently multiplied
> 85,000 by 1,400 and said now way Jose.
>
> Thanks,
>
> John Gage
>

There's no reason why it can't have that many rows.  There's no limit
on the number of rows for tables, only table sizes, which is limited
to 32TB.  If you ever have a table that big though, you probably need
to rethink your schema.

Thom

Re: 110,000,000 rows

From

"Dann Corbit"

Date:

26 May 2010, 21:54:12

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Thom Brown
> Sent: Wednesday, May 26, 2010 2:14 PM
> To: John Gage
> Cc: PostgreSQL - General
> Subject: Re: [GENERAL] 110,000,000 rows
>
> On 26 May 2010 21:29, John Gage <jsmgage@numericable.fr> wrote:
> > Please forgive this intrusion, and please ignore it, but how many
> > applications out there have 110,000,000 row tables?  I recently
> multiplied
> > 85,000 by 1,400 and said now way Jose.
> >
> > Thanks,
> >
> > John Gage
> >
>
> There's no reason why it can't have that many rows.  There's no limit
> on the number of rows for tables, only table sizes, which is limited
> to 32TB.  If you ever have a table that big though, you probably need
> to rethink your schema.

There really are domains that big, so that there is no more normalization or other processes to mitigate the problem.

Examples:
Microsoft's registered customers database (all MS products bought by any customer, including operating systems)
Tolls taken on the New Jersey road system for FY 2009
DNA data from the Human Genome Project
Protein data from the Protein Folding Project
The US Census bureau's Tiger/Line data
Online orders processed by Amazon.com
Cellular phone calls for t-Mobile for 2008
FedEx shipments worldwide in 2008

We work with mainframe data on a regular basis and files of that size are not really very unusual.  Expansion of data
isexponential over time.  We need to prepare for it.  Database systems that cannot handle the volume will be supplanted
bythose that can.

Re: 110,000,000 rows

From

David Wilson

Date:

26 May 2010, 22:06:40

On Wed, May 26, 2010 at 4:29 PM, John Gage <jsmgage@numericable.fr> wrote:

Please forgive this intrusion, and please ignore it, but how many applications out there have 110,000,000 row tables? I recently multiplied 85,000 by 1,400 and said now way Jose.

I've got a DB with two 500,000,000+ row tables. Performance is quite good even on very limited hardware if you set up the tables correctly (cluster, indices, etc) and have a reasonable usage pattern.

--
- David T. Wilson
david.t.wilson@gmail.com

Re: 110,000,000 rows

From

david@gardnerit.net

Date:

26 May 2010, 23:51:33

At work I have one table with 32 million rows, not quite the size you
are talking about, but to give you an idea of the performance, the
following query returns 14,659 rows in 405ms:

SELECT * FROM farm.frame
WHERE process_start > '2010-05-26';

process_start is a timestamp without time zone column, and is covered by
an index. Rows are reletively evenly distributed over time, so the index
performs quite well.

A between select also performs well:
SELECT * FROM farm.frame
WHERE process_start
  BETWEEN '2010-05-26 08:00:00'
    AND '2010-05-26 09:00:00';

fetches 1,350 rows at 25ms.

I also have a summary table that is maintained by triggers, which is a
bit of denormalization, but speeds up common reporting queries.

On 22:29 Wed 26 May     , John Gage wrote:
> Please forgive this intrusion, and please ignore it, but how many
> applications out there have 110,000,000 row tables?  I recently
> multiplied 85,000 by 1,400 and said now way Jose.
>
> Thanks,
>
> John Gage
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: 110,000,000 rows

From

John Gage

Date:

27 May 2010, 09:45:30

Herbert Simon must be spinning in his grave...or smiling wisely.  What
does a human do with a petabyte of data?

But when a desktop machine for $1700 retail has a terabyte of storage,
the unix operating system, 4 gigs of memory, and an amazing 27 inch
display, I guess hardware isn't the problem (and I know one could put
together the same machine on Linux etc. for much less).

I sort of understood that the Amazon's of the world had this amount of
data, but it looks like the phenomenon is much, much more widespread.

Thanks for the instruction.  It will come in handy.

John



On May 27, 2010, at 12:18 AM, david@gardnerit.net wrote:

> At work I have one table with 32 million rows, not quite the size you
> are talking about, but to give you an idea of the performance, the
> following query returns 14,659 rows in 405ms:
>
> SELECT * FROM farm.frame
> WHERE process_start > '2010-05-26';
>
> process_start is a timestamp without time zone column, and is
> covered by
> an index. Rows are reletively evenly distributed over time, so the
> index
> performs quite well.
>
> A between select also performs well:
> SELECT * FROM farm.frame
> WHERE process_start
>  BETWEEN '2010-05-26 08:00:00'
>    AND '2010-05-26 09:00:00';
>
> fetches 1,350 rows at 25ms.
>
> I also have a summary table that is maintained by triggers, which is a
> bit of denormalization, but speeds up common reporting queries.
>
> On 22:29 Wed 26 May     , John Gage wrote:
>> Please forgive this intrusion, and please ignore it, but how many
>> applications out there have 110,000,000 row tables?  I recently
>> multiplied 85,000 by 1,400 and said now way Jose.
>>
>> Thanks,
>>
>> John Gage
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: 110,000,000 rows

From

"Massa, Harald Armin"

Date:

27 May 2010, 10:05:20

Dann,

There really are domains that big, so that there is no more normalization or other processes to mitigate the problem.

Examples:
Microsoft's registered customers database (all MS products bought by any customer, including operating systems)
Tolls taken on the New Jersey road system for FY 2009
DNA data from the Human Genome Project

.....

please also think of ouer most risk exposed users, the ones using Poker / Roulette simulation and analyzing software with an PostgrSQL database below. There are so many rounds of Poker to play .... :)

Harald

--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon
-
Using PostgreSQL is mostly about sleeping well at night.

Re: 110,000,000 rows

From

Nikolas Everett

Date:

27 May 2010, 13:49:47

I've had a reporting database with just about a billion rows. Each row was horribly large because the legacy schema had problems. We partitioned it out by month and it ran about 30 million rows a month. With a reasonably large box you can get that kind of data into memory and indexes are almost unnecessary. So long as you have constraint exclusion and a good partition scheme you should be fine. Throw in a well designed schema and you'll be cooking well into the tens of billions of rows.

We ran self joins of that table reasonably consistently by the way:

SELECT lhs.id, rhs.id

FROM bigtable lhs, bigtable rhs

WHERE lhs.id > rhs.id

AND '' > lhs.timestamp AND lhs.timestamp >= ''

AND '' > rhs.timestamp AND rhs.timestamp >= ''

AND lhs.timestamp = rhs.timestamp

AND lhs.foo = rhs.foo

AND lhs.bar = rhs.bar

This really liked the timestamp index and we had to be careful to only do it for a few days at a time. It took a few minutes each go but it was definitely doable.

Once you get this large you do have to be careful with a few things though:

*It's somewhat easy to write super long queries or updates. This can lots of dead rows in your tables. Limit your longest running queries to a day or so. Note that queries are unlikely to take that long but updates with massive date ranges could. SELECT COUNT(*) FROM bigtable too about 30 minutes when the server wasn't under heavy load.

*You sometimes get bad plans because:

**You don't or can't get enough statistics about a column.

**PostgreSQL doesn't capture statistics about two columns together. PostgreSQL has no way of knowing that columnA = 'foo' implies columnB = 'bar' about 30% of the time.

Nik

On Thu, May 27, 2010 at 5:58 AM, Massa, Harald Armin <chef@ghum.de> wrote:

Dann,

There really are domains that big, so that there is no more normalization or other processes to mitigate the problem.

Examples:
Microsoft's registered customers database (all MS products bought by any customer, including operating systems)
Tolls taken on the New Jersey road system for FY 2009
DNA data from the Human Genome Project

.....

please also think of ouer most risk exposed users, the ones using Poker / Roulette simulation and analyzing software with an PostgrSQL database below. There are so many rounds of Poker to play .... :)

Harald

--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon
-
Using PostgreSQL is mostly about sleeping well at night.

Re: 110,000,000 rows

From

Thom Brown

Date:

27 May 2010, 14:00:06

On 27 May 2010 14:48, Nikolas Everett <nik9000@gmail.com> wrote:
> I've had a reporting database with just about a billion rows.  Each row
> was horribly large because the legacy schema had problems.  We partitioned
> it out by month and it ran about 30 million rows a month.  With a reasonably
> large box you can get that kind of data into memory and indexes are
> almost unnecessary.  So long as you have constraint exclusion and a good
> partition scheme you should be fine.  Throw in a well designed schema and
> you'll be cooking well into the tens of billions of rows.
> We ran self joins of that table reasonably consistently by the way:
> SELECT lhs.id, rhs.id
> FROM bigtable lhs, bigtable rhs
> WHERE lhs.id > rhs.id
>      AND '' > lhs.timestamp AND lhs.timestamp >= ''
>      AND '' > rhs.timestamp AND rhs.timestamp >= ''
>      AND lhs.timestamp = rhs.timestamp
>      AND lhs.foo = rhs.foo
>      AND lhs.bar = rhs.bar
> This really liked the timestamp index and we had to be careful to only do it
> for a few days at a time.  It took a few minutes each go but it was
> definitely doable.
> Once you get this large you do have to be careful with a few things though:
> *It's somewhat easy to write super long queries or updates.  This can lots
> of dead rows in your tables.  Limit your longest running queries to a day or
> so.  Note that queries are unlikely to take that long but updates with
> massive date ranges could.  SELECT COUNT(*) FROM bigtable too about 30
> minutes when the server wasn't under heavy load.
> *You sometimes get bad plans because:
> **You don't or can't get enough statistics about a column.
> **PostgreSQL doesn't capture statistics about two columns together.
>  PostgreSQL has no way of knowing that columnA = 'foo' implies columnB =
> 'bar' about 30% of the time.
> Nik

What's that middle bit about?

>      AND '' > lhs.timestamp AND lhs.timestamp >= ''
>      AND '' > rhs.timestamp AND rhs.timestamp >= ''

If blank is greater than the timestamp?  What is that doing out of curiosity?

Thom

Re: 110,000,000 rows

From

Torsten Zühlsdorff

Date:

27 May 2010, 19:29:18

John Gage schrieb:

> Please forgive this intrusion, and please ignore it, but how many
> applications out there have 110,000,000 row tables?  I recently
> multiplied 85,000 by 1,400 and said now way Jose.

I have two private applications with about 250,000,000 rows a table. I
could cluster them, but the performance is good enough.

Greetings from Germany,
Torsten

--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse
auswerten kann.

Re: 110,000,000 rows

From

Torsten Zühlsdorff

Date:

27 May 2010, 19:29:46

John Gage schrieb:

> Herbert Simon must be spinning in his grave...or smiling wisely.  What
> does a human do with a petabyte of data?

for example i have a private search-engine for my most often used sites.
google and the other ones always know just a part of the whole site, my
own one knowns all. its a good research-tool (and mirror) and support a
lot more filter-posibilities than google. there are many great internet
sites out there, which have no search. after waiting for crawling this
is no longer a problem for me.

another big example in my private use is a neural network for figuring
out relations between news and stock-prices. or statistical data of
website usage. oh - analyse of the behavior of google is also a great
fun with much data. or a database for typical games like chess or poker
or something like this. i also have some databases with geo-data or free
avaiable data like statistics about birthnumbers in germany, a list of
all germany citys with its habitants (grouped by gender) and so on.

or calculating a list of prim-numbers on your own just to make some
implementation tests. sometime this databases just grow because you want
to see how long it can take to get x results and forgot to disable the
test after reaching the border :D

> But when a desktop machine for $1700 retail has a terabyte of storage,
> the unix operating system, 4 gigs of memory, and an amazing 27 inch
> display, I guess hardware isn't the problem (and I know one could put
> together the same machine on Linux etc. for much less).

yes and for private use you can use such a desktop machine as
database-server. it can work while you're on work ;)

Greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse
auswerten kann.

Re: 110,000,000 rows

From

Jasen Betts

Date:

31 May 2010, 10:49:05

On 2010-05-26, John Gage <jsmgage@numericable.fr> wrote:
> Please forgive this intrusion, and please ignore it, but how many
> applications out there have 110,000,000 row tables?  I recently
> multiplied 85,000 by 1,400 and said now way Jose.

census data would be one.
USA phone whitepages.
transaction records at a medium sized bank bank.

you're probably going to want to partition it somehow.

Re: 110,000,000 rows

From

Steve Crawford

Date:

01 June 2010, 20:27:33

On 05/31/2010 03:48 AM, Jasen Betts wrote:
> On 2010-05-26, John Gage<jsmgage@numericable.fr>  wrote:
>
>> Please forgive this intrusion, and please ignore it, but how many
>> applications out there have 110,000,000 row tables?  I recently
>> multiplied 85,000 by 1,400 and said now way Jose.
>>
> census data would be one.
> USA phone whitepages.
> transaction records at a medium sized bank bank
Text messages, phone-bills, tweets, etc. I have single tables of
market-research-related data that exceed 80-million rows.

But a question to the OP: Setting aside for the moment that
85000*1400=119,000,000, not 110,000,000; what is the significance to you
of these numbers?

Cheers,
Steve

Re: 110,000,000 rows

From

John Gage

Date:

01 June 2010, 20:41:57

I was aware that there are, in fact, many applications such as census
data or cell phone calls that would easily surpass this number.
However these applications exist in very large companies/organizations
that can throw essentially unlimited resources at the problem.  One
thinks of the NSA's computers, for example.

What I was surprised to learn is the extremely common occurence of
table sizes like this even in databases run by single individuals for
individual needs.

Putting aside the cognitive challenge posed by data like this, what
made me glad to have asked the question were the solutions provided in
the responses.

John

On Jun 1, 2010, at 10:27 PM, Steve Crawford wrote:

>> transaction records at a medium sized bank bank
> Text messages, phone-bills, tweets, etc. I have single tables of
> market-research-related data that exceed 80-million rows.
>
> But a question to the OP: Setting aside for the moment that
> 85000*1400=119,000,000, not 110,000,000; what is the significance to
> you of these numbers?
>
> Cheers,
> Steve
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general