Re: Two billion records ok? - Mailing list pgsql-general

From Brent Wood
Subject Re: Two billion records ok?
Date
Msg-id 44FDE084.4020108@niwa.co.nz
Whole thread Raw
In response to Two billion records ok?  (Nick Bower <nick@nickbower.com>)
List pgsql-general
Nick Bower wrote:
> We're considering using Postgresql for storing gridded metadata - each point
> of our grids has a variety of metadata attached to it (including lat/lon,
> measurements, etc) and would constitute a record in Postgresql+Postgis.
>
> Size-wise, grids are about 4000x700 and are collected twice daily over say 10
> years.  As mentioned, each record would have up to 50 metadata attributes
> (columns) including geom, floats, varchars etc.
>
> So given 4000x700x2x365x10 > 2 billion, is this going to  be a problem if we
> will be wanting to query on datetimes, Postgis lat/lon, and integer-based
> metadata flags?
>

Hmmmm... 2 billion looks optimistic... I get 2.044e+10, which is 20 billion.

I'm currently working with a "table" of over 200,000,000 records. With a
clustered index & underlying partitioning, response times are more than
acceptable (a 25 wide self relation (left outer join) for 3 months data
(records are around 40/minute) is about 2 minutes. Simple query with a
where clause on timestamp is a few secs at most. This is on a 32 bit
Intel system with only 2Gb memory & mirrored 7200RPM SATA hard drives.

I'd suggest partition on timestamp, maybe per year at least, & use a
clustered index on timestamp. It should be viable if your hardware is up
to it.

I'd also strongly recommend a suitable platform,  64 bit Linux on AMD 64
or Opteron with as much memory & the fastest  RAID setup you can afford.
Make sure you use a fully 64 bit version of Postgres/Postgis on this
platform as well.

If the same grid is being resampled, then  a separate table defining the
grid, and a join on grid ID to the main (partitioned) table may improve
performance (& reduce data volume).

I wouldn't expect instantaneous answers from it, but don't know of any
reason it won't work. Depends very much on what level of performance is
considered acceptable.

I'm also using Postgis grids with up to a few million cells and spatial
joins to a millions or so tracklines to generate gridded models for
analysis in R. You seem to be scaling this up from where I'm at, so I'd
like to know how you get on..


Cheers,

   Brent Wood


pgsql-general by date:

Previous
From: romantercero@gmail.com
Date:
Subject: Database design and triggers...
Next
From: Valentin Militaru
Date:
Subject: Help using user rights