Thread: Indexing large table of coordinates with GiST
Hi, I'm trying to create an index on coordinates (geography type) over a large table (4.5 billion records) using GiST... CREATE INDEX nodes_geom_idx ON nodes USING gist (geom); The command ran for 5 days until my computer stops because a power outage! Before restarting the index creation, I am asking the community if there are ways to shorten the time it took the first time :-) Any idea? Daniel
Hi, I'm trying to create an index on coordinates (geography type) over a
large table (4.5 billion records) using GiST...
CREATE INDEX nodes_geom_idx ON nodes USING gist (geom);
The command ran for 5 days until my computer stops because a power outage!
Before restarting the index creation, I am asking the community if there are
ways to shorten the time it took the first time :-)
Any idea?
Daniel
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 1/15/2015 6:44 AM, Daniel Begin wrote: > Hi, I'm trying to create an index on coordinates (geography type) over a > large table (4.5 billion records) using GiST... > > CREATE INDEX nodes_geom_idx ON nodes USING gist (geom); > > The command ran for 5 days until my computer stops because a power outage! > Before restarting the index creation, I am asking the community if there are > ways to shorten the time it took the first time :-) > > Any idea? > > Daniel > > > Set maintenance_work_mem as large as you can. -Andy
Daniel could you please supply the server hardware (cpu and storage) you're using for this data. I have a similar number of records and would like to know what it takes to handle such load.Hi, I'm trying to create an index on coordinates (geography type) over a large table (4.5 billion records) using GiST... CREATE INDEX nodes_geom_idx ON nodes USING gist (geom); The command ran for 5 days until my computer stops because a power outage! Before restarting the index creation, I am asking the community if there are ways to shorten the time it took the first time :-) Any idea? Daniel
TIA
If you really want to keep that much geometry,
you may want to partition your data on a regular grid.
Cheers,
Rémi-C
On 1/15/2015 6:44 AM, Daniel Begin wrote:Hi, I'm trying to create an index on coordinates (geography type) over a
large table (4.5 billion records) using GiST...
CREATE INDEX nodes_geom_idx ON nodes USING gist (geom);
The command ran for 5 days until my computer stops because a power outage!
Before restarting the index creation, I am asking the community if there are
ways to shorten the time it took the first time :-)
Any idea?
Daniel
Set maintenance_work_mem as large as you can.
-Andy
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Paul Ramsey
http://cleverelephant.ca
On January 15, 2015 at 8:44:03 AM, Rémi Cura (remi.cura@gmail.com) wrote:
If you really want to keep that much geometry,
you may want to partition your data on a regular grid.
Cheers,
Rémi-C
On 1/15/2015 6:44 AM, Daniel Begin wrote:Hi, I'm trying to create an index on coordinates (geography type) over a
large table (4.5 billion records) using GiST...
CREATE INDEX nodes_geom_idx ON nodes USING gist (geom);
The command ran for 5 days until my computer stops because a power outage!
Before restarting the index creation, I am asking the community if there are
ways to shorten the time it took the first time :-)
Any idea?
Daniel
Set maintenance_work_mem as large as you can.
-Andy
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Thank, there is a lot of potential ways to resolve this problem!
For Rob, here is a bit of context concerning my IT environment…
Windows 7 64b Desktop, running with an Intel i7 core and 16GB ram. The PostgreSQL 9.3 database is stored on a 3TB external drive (USB-3 connection with write cache enabled and backup battery) and a temp_tablespaces is pointing to a 1TB internal drive.
Now, let me answered/questioned given proposals in the order I received them…
1- Andy, I will set maintenance_work_mem as large as I can unless someone points to an important caveat.
2- Vick, partitioning the table could have been very interesting. However, I will have to query the table using both the node ID (which could have provided a nice partition criterion) and/or the node location (find nodes within a polygon). I am not familiar with table partition but I suspect I can’t create a spatial index on a table that have been partitioned (split into multiple tables that inherit from the “master" table). Am I right?
3- Rémi, so many rows does not necessarily mean either raster or points cloud (but it’s worth asking!-). As I mentioned previously, I must be able to query the table not only using nodes location (coordinates) but also using the few other fields the table contains (but mainly node IDs). So, I don’t think it could work, unless you tell me otherwise?
4- Paul, the nodes distribution is all over the world but mainly over inhabited areas. However, if I had to define a limit of some sort, I would use the dateline. Concerning spatial queries, I will want to find nodes that are within the boundary of irregular polygons (stored in another table). Is querying on irregular polygons is compatible with geohashing?
Regards,
Daniel
__________________________________________________________________
On Thu, Jan 15, 2015 at 7:44 AM, Daniel Begin <jfd553@hotmail.com> wrote:
Hi, I'm trying to create an index on coordinates (geography type) over a
large table (4.5 billion records) using GiST...
CREATE INDEX nodes_geom_idx ON nodes USING gist (geom);
The command ran for 5 days until my computer stops because a power outage!
Before restarting the index creation, I am asking the community if there are
ways to shorten the time it took the first time :-)
Any idea?
Daniel
On January 15, 2015 at 12:36:29 PM, Daniel Begin (jfd553@hotmail.com(mailto:jfd553@hotmail.com)) wrote: > Paul, the nodes distribution is all over the world but mainly over inhabited areas. However, if I had to define a limitof some sort, I would use the dateline. Concerning spatial queries, I will want to find nodes that are within the boundaryof irregular polygons (stored in another table). Is querying on irregular polygons is compatible with geohashing? Well… yes you can, although the relative efficiency compared to r-tree will depend a bit on how the query polygons interactwith the geohash split points. Also, if you’re planning to slam pretty large polygons through this process, expectit to be kind of slow. You’ll want to do some sharding, to spread the problem out over multiple nodes. -- Paul Ramsey http://cleverelephant.ca http://postgis.net
On 1/15/2015 12:36 PM, Daniel Begin wrote: > > Thank, there is a lot of potential ways to resolve this problem! > > For Rob, here is a bit of context concerning my IT environment… > > Windows 7 64b Desktop, running with an Intel i7 core and 16GB ram. The > PostgreSQL 9.3 database is stored on a 3TB external drive (USB-3 > connection with write cache enabled and backup battery) and a > temp_tablespaces is pointing to a 1TB internal drive. > > Now, let me answered/questioned given proposals in the order I > received them… > > 1-Andy, I will set maintenance_work_mem as large as I can unless > someone points to an important caveat. > > 2-Vick, partitioning the table could have been very interesting. > However, I will have to query the table using both the node ID (which > could have provided a nice partition criterion) and/or the node > location (find nodes within a polygon). I am not familiar with table > partition but I suspect I can’t create a spatial index on a table that > have been partitioned (split into multiple tables that inherit from > the “master" table). Am I right? > > 3-Rémi, so many rows does not necessarily mean either raster or points > cloud (but it’s worth asking!-). As I mentioned previously, I must be > able to query the table not only using nodes location (coordinates) > but also using the few other fields the table contains (but mainly > node IDs). So, I don’t think it could work, unless you tell me otherwise? > > 4-Paul, the nodes distribution is all over the world but mainly over > inhabited areas. However, if I had to define a limit of some sort, I > would use the dateline. Concerning spatial queries, I will want to > find nodes that are within the boundary of irregular polygons (stored > in another table). Is querying on irregular polygons is compatible > with geohashing? > > Regards, > > Daniel > > Provided you have an integer primary key on both your node tables and polygon tables, would it make sense to preprocess the overlaps and have a many-to-many table with the node-id and polygon-id? Depending on the speed in which data is ingested, you could easily build triggers to run after inserts/updates to keep the table updated, or you could create a globally unique autoincrement field that tracks revisions and update everything after a given high-water mark. Lookups and joins would be using integers and should give you much better performance than searching through the polygons. For the many-to-many table, something like (you can obviously parse it out into smaller batches on the insert if you need to so you don't blow up your memory usage. If needed you can have two tables partitioned on either the node-id or the polygon-id to speed up lookups, as this table has the potential to carry many times the records in either table - worst case would be a cartesian join if all nodes fall within all polygons): create table node_polygon ( node_id bigint not null, polygon_id bigint not null, primary key (node_id, polygon_id) ); insert into node_polygon (node_id, polygon_id) select node_id, polygon_id from node inner join polygon on node.point <@ polygon.shape; create index ix_node_polygon_polygon on node_polygon (polygon_id);
- first get group of points of potential interest
Rémi-C
On 1/15/2015 12:36 PM, Daniel Begin wrote:Provided you have an integer primary key on both your node tables and polygon tables, would it make sense to preprocess the overlaps and have a many-to-many table with the node-id and polygon-id? Depending on the speed in which data is ingested, you could easily build triggers to run after inserts/updates to keep the table updated, or you could create a globally unique autoincrement field that tracks revisions and update everything after a given high-water mark.
Thank, there is a lot of potential ways to resolve this problem!
For Rob, here is a bit of context concerning my IT environment…
Windows 7 64b Desktop, running with an Intel i7 core and 16GB ram. The PostgreSQL 9.3 database is stored on a 3TB external drive (USB-3 connection with write cache enabled and backup battery) and a temp_tablespaces is pointing to a 1TB internal drive.
Now, let me answered/questioned given proposals in the order I received them…
1-Andy, I will set maintenance_work_mem as large as I can unless someone points to an important caveat.
2-Vick, partitioning the table could have been very interesting. However, I will have to query the table using both the node ID (which could have provided a nice partition criterion) and/or the node location (find nodes within a polygon). I am not familiar with table partition but I suspect I can’t create a spatial index on a table that have been partitioned (split into multiple tables that inherit from the “master" table). Am I right?
3-Rémi, so many rows does not necessarily mean either raster or points cloud (but it’s worth asking!-). As I mentioned previously, I must be able to query the table not only using nodes location (coordinates) but also using the few other fields the table contains (but mainly node IDs). So, I don’t think it could work, unless you tell me otherwise?
4-Paul, the nodes distribution is all over the world but mainly over inhabited areas. However, if I had to define a limit of some sort, I would use the dateline. Concerning spatial queries, I will want to find nodes that are within the boundary of irregular polygons (stored in another table). Is querying on irregular polygons is compatible with geohashing?
Regards,
Daniel
Lookups and joins would be using integers and should give you much better performance than searching through the polygons.
For the many-to-many table, something like (you can obviously parse it out into smaller batches on the insert if you need to so you don't blow up your memory usage. If needed you can have two tables partitioned on either the node-id or the polygon-id to speed up lookups, as this table has the potential to carry many times the records in either table - worst case would be a cartesian join if all nodes fall within all polygons):
create table node_polygon (
node_id bigint not null,
polygon_id bigint not null,
primary key (node_id, polygon_id)
);
insert into node_polygon (node_id, polygon_id)
select
node_id,
polygon_id
from
node
inner join polygon
on node.point <@ polygon.shape;
create index ix_node_polygon_polygon on node_polygon (polygon_id);
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Nathan, I have to verify a few things before but it might be possible to proceed as you suggest. I will also dig a bit Paul's suggestionon geohashing. I should get you back once in place. Thanks all Daniel -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Nathan Clayton Sent: January-15-15 19:19 To: pgsql-general@postgresql.org Subject: Re: [GENERAL] Indexing large table of coordinates with GiST On 1/15/2015 12:36 PM, Daniel Begin wrote: > > Thank, there is a lot of potential ways to resolve this problem! > > For Rob, here is a bit of context concerning my IT environment… > > Windows 7 64b Desktop, running with an Intel i7 core and 16GB ram. The > PostgreSQL 9.3 database is stored on a 3TB external drive (USB-3 > connection with write cache enabled and backup battery) and a > temp_tablespaces is pointing to a 1TB internal drive. > > Now, let me answered/questioned given proposals in the order I > received them… > > 1-Andy, I will set maintenance_work_mem as large as I can unless > someone points to an important caveat. > > 2-Vick, partitioning the table could have been very interesting. > However, I will have to query the table using both the node ID (which > could have provided a nice partition criterion) and/or the node > location (find nodes within a polygon). I am not familiar with table > partition but I suspect I can’t create a spatial index on a table that > have been partitioned (split into multiple tables that inherit from > the “master" table). Am I right? > > 3-Rémi, so many rows does not necessarily mean either raster or points > cloud (but it’s worth asking!-). As I mentioned previously, I must be > able to query the table not only using nodes location (coordinates) > but also using the few other fields the table contains (but mainly > node IDs). So, I don’t think it could work, unless you tell me otherwise? > > 4-Paul, the nodes distribution is all over the world but mainly over > inhabited areas. However, if I had to define a limit of some sort, I > would use the dateline. Concerning spatial queries, I will want to > find nodes that are within the boundary of irregular polygons (stored > in another table). Is querying on irregular polygons is compatible > with geohashing? > > Regards, > > Daniel > > Provided you have an integer primary key on both your node tables and polygon tables, would it make sense to preprocess theoverlaps and have a many-to-many table with the node-id and polygon-id? Depending on the speed in which data is ingested,you could easily build triggers to run after inserts/updates to keep the table updated, or you could create a globallyunique autoincrement field that tracks revisions and update everything after a given high-water mark. Lookups and joins would be using integers and should give you much better performance than searching through the polygons. For the many-to-many table, something like (you can obviously parse it out into smaller batches on the insert if you needto so you don't blow up your memory usage. If needed you can have two tables partitioned on either the node-id or thepolygon-id to speed up lookups, as this table has the potential to carry many times the records in either table - worstcase would be a cartesian join if all nodes fall within all polygons): create table node_polygon ( node_id bigint not null, polygon_id bigint not null, primary key (node_id, polygon_id) ); insert into node_polygon (node_id, polygon_id) select node_id, polygon_id from node inner join polygon on node.point <@ polygon.shape; create index ix_node_polygon_polygon on node_polygon (polygon_id); -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Nathan, and all others, I already have links between each node and each polygon at an initial state. Creating a many-to-many table (nodes_polygons)and indexing it should be easy at this point. However, if nodes are expected to remain static, new irregularpolygons will be added without being able to preprocess the nodes. Instead of querying nodes table with these new polygons, I might rather query polygons table (which is much smaller and alreadyhas a GiST index) to find polygons that are intersecting a new one. From there, I can easily get the related nodessubset that will be many orders of magnitude smaller - in which case spatial indexing shouldn't be necessary. Once done for a new polygon, I could then update the polygon and nodes_polygons tables. Something I miss? Daniel -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Nathan Clayton Sent: January-15-15 19:19 To: pgsql-general@postgresql.org Subject: Re: [GENERAL] Indexing large table of coordinates with GiST On 1/15/2015 12:36 PM, Daniel Begin wrote: > > Thank, there is a lot of potential ways to resolve this problem! > > For Rob, here is a bit of context concerning my IT environment… > > Windows 7 64b Desktop, running with an Intel i7 core and 16GB ram. The > PostgreSQL 9.3 database is stored on a 3TB external drive (USB-3 > connection with write cache enabled and backup battery) and a > temp_tablespaces is pointing to a 1TB internal drive. > > Now, let me answered/questioned given proposals in the order I > received them… > > 1-Andy, I will set maintenance_work_mem as large as I can unless > someone points to an important caveat. > > 2-Vick, partitioning the table could have been very interesting. > However, I will have to query the table using both the node ID (which > could have provided a nice partition criterion) and/or the node > location (find nodes within a polygon). I am not familiar with table > partition but I suspect I can’t create a spatial index on a table that > have been partitioned (split into multiple tables that inherit from > the “master" table). Am I right? > > 3-Rémi, so many rows does not necessarily mean either raster or points > cloud (but it’s worth asking!-). As I mentioned previously, I must be > able to query the table not only using nodes location (coordinates) > but also using the few other fields the table contains (but mainly > node IDs). So, I don’t think it could work, unless you tell me otherwise? > > 4-Paul, the nodes distribution is all over the world but mainly over > inhabited areas. However, if I had to define a limit of some sort, I > would use the dateline. Concerning spatial queries, I will want to > find nodes that are within the boundary of irregular polygons (stored > in another table). Is querying on irregular polygons is compatible > with geohashing? > > Regards, > > Daniel > > Provided you have an integer primary key on both your node tables and polygon tables, would it make sense to preprocess theoverlaps and have a many-to-many table with the node-id and polygon-id? Depending on the speed in which data is ingested,you could easily build triggers to run after inserts/updates to keep the table updated, or you could create a globallyunique autoincrement field that tracks revisions and update everything after a given high-water mark. Lookups and joins would be using integers and should give you much better performance than searching through the polygons. For the many-to-many table, something like (you can obviously parse it out into smaller batches on the insert if you needto so you don't blow up your memory usage. If needed you can have two tables partitioned on either the node-id or thepolygon-id to speed up lookups, as this table has the potential to carry many times the records in either table - worstcase would be a cartesian join if all nodes fall within all polygons): create table node_polygon ( node_id bigint not null, polygon_id bigint not null, primary key (node_id, polygon_id) ); insert into node_polygon (node_id, polygon_id) select node_id, polygon_id from node inner join polygon on node.point <@ polygon.shape; create index ix_node_polygon_polygon on node_polygon (polygon_id); -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general