Thread: Two billion records ok?
We're considering using Postgresql for storing gridded metadata - each point of our grids has a variety of metadata attached to it (including lat/lon, measurements, etc) and would constitute a record in Postgresql+Postgis. Size-wise, grids are about 4000x700 and are collected twice daily over say 10 years. As mentioned, each record would have up to 50 metadata attributes (columns) including geom, floats, varchars etc. So given 4000x700x2x365x10 > 2 billion, is this going to be a problem if we will be wanting to query on datetimes, Postgis lat/lon, and integer-based metadata flags? If however I'm forced to sub-sample the grid, what rule of thumb should I be looking to be constrained by? Thanks for any pointers, Nick PS - Feel free to throw in any other ideas of grid-suitable databases :)
On Tue, Sep 05, 2006 at 09:26:59AM +0800, Nick Bower wrote: > So given 4000x700x2x365x10 > 2 billion, is this going to be a problem if we > will be wanting to query on datetimes, Postgis lat/lon, and integer-based > metadata flags? That figure is about 20 billion, which is indeed > 2 billion :-). If you plan to use integer row IDs then you'll need to use 64-bit bigint/bigserial instead of 32-bit integer/serial. I haven't worked with a database that large; maybe somebody else can give additional advice. -- Michael Fuhr
Nick, if you need very fast spatial queries (spherical) you may use our Q3C module for POstgreSQL (q3c.sf.net). We use it for providing access to very big astronomical catalogs. Oleg On Tue, 5 Sep 2006, Nick Bower wrote: > We're considering using Postgresql for storing gridded metadata - each point > of our grids has a variety of metadata attached to it (including lat/lon, > measurements, etc) and would constitute a record in Postgresql+Postgis. > > Size-wise, grids are about 4000x700 and are collected twice daily over say 10 > years. As mentioned, each record would have up to 50 metadata attributes > (columns) including geom, floats, varchars etc. > > So given 4000x700x2x365x10 > 2 billion, is this going to be a problem if we > will be wanting to query on datetimes, Postgis lat/lon, and integer-based > metadata flags? > > If however I'm forced to sub-sample the grid, what rule of thumb should I be > looking to be constrained by? > > Thanks for any pointers, Nick > > PS - Feel free to throw in any other ideas of grid-suitable databases :) > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Thanks - but what do you call big? My application is satellite data btw so the reference could be useful. On Tue, 5 Sep 2006 01:40 pm, Oleg Bartunov wrote: > Nick, > > if you need very fast spatial queries (spherical) you may use our > Q3C module for POstgreSQL (q3c.sf.net). We use it for providing access > to very big astronomical catalogs. > > > Oleg > > On Tue, 5 Sep 2006, Nick Bower wrote: > > We're considering using Postgresql for storing gridded metadata - each > > point of our grids has a variety of metadata attached to it (including > > lat/lon, measurements, etc) and would constitute a record in > > Postgresql+Postgis. > > > > Size-wise, grids are about 4000x700 and are collected twice daily over > > say 10 years. As mentioned, each record would have up to 50 metadata > > attributes (columns) including geom, floats, varchars etc. > > > > So given 4000x700x2x365x10 > 2 billion, is this going to be a problem if > > we will be wanting to query on datetimes, Postgis lat/lon, and > > integer-based metadata flags? > > > > If however I'm forced to sub-sample the grid, what rule of thumb should I > > be looking to be constrained by? > > > > Thanks for any pointers, Nick > > > > PS - Feel free to throw in any other ideas of grid-suitable databases :) > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 4: Have you searched our list archives? > > > > http://archives.postgresql.org > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83
Nick Bower wrote: > Thanks - but what do you call big? How many stars do you think there are? :-) > My application is satellite data btw so the reference could be useful. > > On Tue, 5 Sep 2006 01:40 pm, Oleg Bartunov wrote: > >>Nick, >> >>if you need very fast spatial queries (spherical) you may use our >>Q3C module for POstgreSQL (q3c.sf.net). We use it for providing access >>to very big astronomical catalogs. >> >>Oleg Tim -- ----------------------------------------------- Tim Allen tim@proximity.com.au Proximity Pty Ltd http://www.proximity.com.au/
On Tue, 5 Sep 2006, Nick Bower wrote: > Thanks - but what do you call big? Several billions of stars. You can try our Cone Search service at http://vo.astronet.ru/cas/conesearch.php Oleg > > My application is satellite data btw so the reference could be useful. > > On Tue, 5 Sep 2006 01:40 pm, Oleg Bartunov wrote: >> Nick, >> >> if you need very fast spatial queries (spherical) you may use our >> Q3C module for POstgreSQL (q3c.sf.net). We use it for providing access >> to very big astronomical catalogs. >> >> >> Oleg >> >> On Tue, 5 Sep 2006, Nick Bower wrote: >>> We're considering using Postgresql for storing gridded metadata - each >>> point of our grids has a variety of metadata attached to it (including >>> lat/lon, measurements, etc) and would constitute a record in >>> Postgresql+Postgis. >>> >>> Size-wise, grids are about 4000x700 and are collected twice daily over >>> say 10 years. As mentioned, each record would have up to 50 metadata >>> attributes (columns) including geom, floats, varchars etc. >>> >>> So given 4000x700x2x365x10 > 2 billion, is this going to be a problem if >>> we will be wanting to query on datetimes, Postgis lat/lon, and >>> integer-based metadata flags? >>> >>> If however I'm forced to sub-sample the grid, what rule of thumb should I >>> be looking to be constrained by? >>> >>> Thanks for any pointers, Nick >>> >>> PS - Feel free to throw in any other ideas of grid-suitable databases :) >>> >>> ---------------------------(end of broadcast)--------------------------- >>> TIP 4: Have you searched our list archives? >>> >>> http://archives.postgresql.org >> >> Regards, >> Oleg >> _____________________________________________________________ >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >> Sternberg Astronomical Institute, Moscow University, Russia >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >> phone: +007(495)939-16-83, +007(495)939-23-83 > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Nick Bower wrote: > We're considering using Postgresql for storing gridded metadata - each point > of our grids has a variety of metadata attached to it (including lat/lon, > measurements, etc) and would constitute a record in Postgresql+Postgis. > > Size-wise, grids are about 4000x700 and are collected twice daily over say 10 > years. As mentioned, each record would have up to 50 metadata attributes > (columns) including geom, floats, varchars etc. > > So given 4000x700x2x365x10 > 2 billion, is this going to be a problem if we > will be wanting to query on datetimes, Postgis lat/lon, and integer-based > metadata flags? > > If however I'm forced to sub-sample the grid, what rule of thumb should I be > looking to be constrained by? > > Thanks for any pointers, Nick Tablespaces and table partitioning will be crucial to your needs. I'm not sure if you can partition indexes, though. And too bad that compressed bit-map indexes have not been implemented yet. For indexes with high "key cardinality", they save a *lot* of space, and queries can run a lot faster. - -- Ron Johnson, Jr. Jefferson LA USA Is "common sense" really valid? For example, it is "common sense" to white-power racists that whites are superior to blacks, and that those with brown skins are mud people. However, that "common sense" is obviously wrong. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE/Ur7S9HxQb37XmcRAsKLAKDnC36QSzRuaedSsXe+rQp3fbDbOgCfSwlQ ip2em5mEmXF45kek2rHKJvw= =uqTK -----END PGP SIGNATURE-----
Nick Bower wrote: > We're considering using Postgresql for storing gridded metadata - each point > of our grids has a variety of metadata attached to it (including lat/lon, > measurements, etc) and would constitute a record in Postgresql+Postgis. > > Size-wise, grids are about 4000x700 and are collected twice daily over say 10 > years. As mentioned, each record would have up to 50 metadata attributes > (columns) including geom, floats, varchars etc. > > So given 4000x700x2x365x10 > 2 billion, is this going to be a problem if we > will be wanting to query on datetimes, Postgis lat/lon, and integer-based > metadata flags? > Hmmmm... 2 billion looks optimistic... I get 2.044e+10, which is 20 billion. I'm currently working with a "table" of over 200,000,000 records. With a clustered index & underlying partitioning, response times are more than acceptable (a 25 wide self relation (left outer join) for 3 months data (records are around 40/minute) is about 2 minutes. Simple query with a where clause on timestamp is a few secs at most. This is on a 32 bit Intel system with only 2Gb memory & mirrored 7200RPM SATA hard drives. I'd suggest partition on timestamp, maybe per year at least, & use a clustered index on timestamp. It should be viable if your hardware is up to it. I'd also strongly recommend a suitable platform, 64 bit Linux on AMD 64 or Opteron with as much memory & the fastest RAID setup you can afford. Make sure you use a fully 64 bit version of Postgres/Postgis on this platform as well. If the same grid is being resampled, then a separate table defining the grid, and a join on grid ID to the main (partitioned) table may improve performance (& reduce data volume). I wouldn't expect instantaneous answers from it, but don't know of any reason it won't work. Depends very much on what level of performance is considered acceptable. I'm also using Postgis grids with up to a few million cells and spatial joins to a millions or so tracklines to generate gridded models for analysis in R. You seem to be scaling this up from where I'm at, so I'd like to know how you get on.. Cheers, Brent Wood