Thread: Very large tables

Very large tables

From
"William Temperley"
Date:
Hi all

Has anyone any experience with very large tables?

I've been asked to store a grid of 1.5 million geographical locations,
fine. However, associated with each point are 288 months, and
associated with each month are 500 float values (a distribution
curve), i.e. 1,500,000 * 288 * 500 = 216 billion values :).

So a 216 billion row table is probably out of the question. I was
considering storing the 500 floats as bytea.

This means I'll need a table something like this:

grid_point_id | month_id | distribution_curve
(int4)            | (int2)       | (bytea?)
------------------+---------------+---------------

Any advice would be appreciated, especially on the storage of the 500 floats.

Another (somewhat far fetched) possibility was a custom data type,
which delegated it's data access to HDF5 somehow - perhaps by storing
a reference to a value location. The reason for this is that data will
be written using PyTables and HDF5. It is produced in 500 runs each
providing a value to the distribution curve for all points and months
-(500 updates of a 500 million row table...no thanks). Querying is the
opposite - we want the whole chunk of 500 values at a time. Is this a
fantasy?


Cheers

Will T

Re: Very large tables

From
Alvaro Herrera
Date:
William Temperley escribió:

> I've been asked to store a grid of 1.5 million geographical locations,
> fine. However, associated with each point are 288 months, and
> associated with each month are 500 float values (a distribution
> curve), i.e. 1,500,000 * 288 * 500 = 216 billion values :).
>
> So a 216 billion row table is probably out of the question. I was
> considering storing the 500 floats as bytea.

What about a float array, float[]?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Very large tables

From
"Grzegorz Jaśkiewicz"
Date:


On Fri, Nov 28, 2008 at 3:48 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote:
William Temperley escribió:

> I've been asked to store a grid of 1.5 million geographical locations,
> fine. However, associated with each point are 288 months, and
> associated with each month are 500 float values (a distribution
> curve), i.e. 1,500,000 * 288 * 500 = 216 billion values :).
>
> So a 216 billion row table is probably out of the question. I was
> considering storing the 500 floats as bytea.

What about a float array, float[]?
you seriously don't want to use bytea to store anything, especially if the datatype matching exists in db of choice.
also, consider partitioning it :)

Try to follow rules of normalization, as with that sort of data - less storage space used, the better :)
And well, I would look for a machine with rather fast raid storage :) (and spacious too).



--
GJ

Re: Very large tables

From
"William Temperley"
Date:
On Fri, Nov 28, 2008 at 3:48 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> William Temperley escribió:
>> So a 216 billion row table is probably out of the question. I was
>> considering storing the 500 floats as bytea.
>
> What about a float array, float[]?

I guess that would be the obvious choice... Just a lot of storage
space reqired I imagine.

On Fri, Nov 28, 2008 at 4:03 PM, Grzegorz Jaśkiewicz <gryzman@gmail.com> wrote:
>
>
> you seriously don't want to use bytea to store anything, especially if the
> datatype matching exists in db of choice.
> also, consider partitioning it :)
>
> Try to follow rules of normalization, as with that sort of data - less
> storage space used, the better :)

Any more normalized and I'd have 216 billion rows! Add an index and
I'd have - well, a far bigger table than 432 million rows each
containing a float array - I think?

Really I'm worried about reducing storage space and network overhead
- therefore a nicely compressed chunk of binary would be perfect for
the 500 values - wouldn't it?


Will

Re: Very large tables

From
Scara Maccai
Date:
> Really I'm worried about reducing storage space and network overhead
> - therefore a nicely compressed chunk of binary would be perfect for
> the 500 values - wouldn't it?
>

For storage space you might want to look at ZFS with compression on in
case you are using FreeBSD or Solaris.
That would speed up query time as well since you would be
writing/reading less I guess...




Re: Very large tables

From
Alvaro Herrera
Date:
William Temperley escribió:

> Really I'm worried about reducing storage space and network overhead
> - therefore a nicely compressed chunk of binary would be perfect for
> the 500 values - wouldn't it?

An array that large would likely be compressed by Postgres internally
for storage; see
http://www.postgresql.org/docs/8.3/static/storage-toast.html

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Very large tables

From
"Grzegorz Jaśkiewicz"
Date:


2008/11/28 William Temperley <willtemperley@gmail.com>

Any more normalized and I'd have 216 billion rows! Add an index and
I'd have - well, a far bigger table than 432 million rows each
containing a float array - I think?

Really I'm worried about reducing storage space and network overhead
- therefore a nicely compressed chunk of binary would be perfect for
the 500 values - wouldn't it?

true, if you don't want to search on values too much ,or at all - use float[]. But otherwise, keep stuff in a tables as such.
It might be humongous in size, but at the end of the day - prime thing when designing a db is speed of queries.

Still, I wouldn't go too far down the 'compress and stick in as bytea' road, cos it is quite slippery, even tho might look shiny at first,
 
you can also consider vertical partition (separate machines). Honestly, I would try different approaches first, on scaled down data set, but focusing on retrieval/update (well, whatever your applications are going to use it for).



--
GJ

Re: Very large tables

From
Simon Riggs
Date:
On Fri, 2008-11-28 at 15:40 +0000, William Temperley wrote:
> Hi all
>
> Has anyone any experience with very large tables?
>
> I've been asked to store a grid of 1.5 million geographical locations,
> fine. However, associated with each point are 288 months, and
> associated with each month are 500 float values (a distribution
> curve), i.e. 1,500,000 * 288 * 500 = 216 billion values :).
>
> So a 216 billion row table is probably out of the question. I was
> considering storing the 500 floats as bytea.
>
> This means I'll need a table something like this:
>
> grid_point_id | month_id | distribution_curve
> (int4)            | (int2)       | (bytea?)
> ------------------+---------------+---------------

I would look carefully at the number of bits required for each float
value. 4 bytes is the default, but you may be able to use less bits than
that rather than rely upon the default compression scheme working in
your favour. Custom datatypes are often good for this kind of thing.

Not sure it matters what the month_id datatype is.

Everything else depends upon the usage characteristics. You may want to
consider using table or server partitioning also.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Very large tables

From
"William Temperley"
Date:
On Fri, Nov 28, 2008 at 5:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> I would look carefully at the number of bits required for each float
> value. 4 bytes is the default, but you may be able to use less bits than
> that rather than rely upon the default compression scheme working in
> your favour. Custom datatypes are often good for this kind of thing.
>

I might try that, thanks.

On Fri, Nov 28, 2008 at 4:58 PM, Grzegorz Jaśkiewicz <gryzman@gmail.com> wrote:
>
> Still, I wouldn't go too far down the 'compress and stick in as bytea' road,
> cos it is quite slippery, even tho might look shiny at first,

Thanks for the warning!


Thanks for everyone's help and ideas on this. Off to experiment.

Cheers

Will

Re: Very large tables

From
Ron Mayer
Date:
Grzegorz Jaśkiewicz wrote:
>
> true, if you don't want to search on values too much ,or at all - use
> float[]. But otherwise, keep stuff in a tables as such.
> It might be humongous in size, but at the end of the day - prime thing
> when designing a db is speed of queries.

If he's worried about speed, he might look into porting contrib/intarray
to arrays of floats.   I've had good luck improving performance by
converting big tables of ints into intarray with indexes.