Thread: Simple query, 10 million records...MySQL ten times faster
I have this table: CREATE TABLE test_zip_assoc ( id serial NOT NULL, f_id integer DEFAULT 0 NOT NULL, lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL, long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL ); CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians); CREATE INDEX long_radians ON test_zip_assoc USING btree (long_radians); It's basically a table that associates some foreign_key (for an event, for instance) with a particular location using longitude and latitude. I'm basically doing a simple proximity search. I have populated the database with *10 million* records. I then test performance by picking 50 zip codes at random and finding the records within 50 miles with a query like this: SELECT id FROM test_zip_assoc WHERE lat_radians > 0.69014816041 AND lat_radians < 0.71538026567 AND long_radians > -1.35446228028 AND long_radians < -1.32923017502 On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB ram) this query averages 1.5 seconds each time it runs after a brief warmup period. In PostGreSQL it averages about 15 seconds. Both of those times are too slow. I need the query to run in under a second with as many as a billion records. I don't know if this is possible but I'm really hoping someone can help me restructure my indexes (multicolumn?, multiple indexes with a 'where' clause?) so that I can get this running as fast as possible. If I need to consider some non-database data structure in RAM I will do that too. Any help or tips would be greatly appreciated. I'm willing to go to greath lengths to test this if someone can make a good suggestion that sounds like it has a reasonable chance of improving the speed of this search. There's an extensive thread on my efforts already here: http://phpbuilder.com/board/showthread.php?t=10331619&page=10
On 24 Apr 2007 14:26:46 -0700, zardozrocks <zardozrocks@gmail.com> wrote: > I have this table: > > CREATE TABLE test_zip_assoc ( > id serial NOT NULL, > f_id integer DEFAULT 0 NOT NULL, > lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL, > long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL > ); > CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians); > CREATE INDEX long_radians ON test_zip_assoc USING btree > (long_radians); > > > > It's basically a table that associates some foreign_key (for an event, > for instance) with a particular location using longitude and > latitude. I'm basically doing a simple proximity search. I have > populated the database with *10 million* records. I then test > performance by picking 50 zip codes at random and finding the records > within 50 miles with a query like this: > > SELECT id > FROM test_zip_assoc > WHERE > lat_radians > 0.69014816041 > AND lat_radians < 0.71538026567 > AND long_radians > -1.35446228028 > AND long_radians < -1.32923017502 > > > On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB > ram) this query averages 1.5 seconds each time it runs after a brief > warmup period. In PostGreSQL it averages about 15 seconds. > > Both of those times are too slow. I need the query to run in under a > second with as many as a billion records. I don't know if this is > possible but I'm really hoping someone can help me restructure my > indexes (multicolumn?, multiple indexes with a 'where' clause?) so > that I can get this running as fast as possible. > > If I need to consider some non-database data structure in RAM I will > do that too. Any help or tips would be greatly appreciated. I'm > willing to go to greath lengths to test this if someone can make a > good suggestion that sounds like it has a reasonable chance of > improving the speed of this search. There's an extensive thread on my > efforts already here: You can always go the earthdist route. the index takes longer to build (like 5x) longer than btree, but will optimize that exact operation. merlin
In response to zardozrocks <zardozrocks@gmail.com>: > I have this table: > > CREATE TABLE test_zip_assoc ( > id serial NOT NULL, > f_id integer DEFAULT 0 NOT NULL, > lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL, > long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL > ); > CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians); > CREATE INDEX long_radians ON test_zip_assoc USING btree > (long_radians); > > > > It's basically a table that associates some foreign_key (for an event, > for instance) with a particular location using longitude and > latitude. I'm basically doing a simple proximity search. I have > populated the database with *10 million* records. I then test > performance by picking 50 zip codes at random and finding the records > within 50 miles with a query like this: > > SELECT id > FROM test_zip_assoc > WHERE > lat_radians > 0.69014816041 > AND lat_radians < 0.71538026567 > AND long_radians > -1.35446228028 > AND long_radians < -1.32923017502 > > > On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB > ram) this query averages 1.5 seconds each time it runs after a brief > warmup period. In PostGreSQL it averages about 15 seconds. > > Both of those times are too slow. I need the query to run in under a > second with as many as a billion records. I don't know if this is > possible but I'm really hoping someone can help me restructure my > indexes (multicolumn?, multiple indexes with a 'where' clause?) so > that I can get this running as fast as possible. > > If I need to consider some non-database data structure in RAM I will > do that too. Any help or tips would be greatly appreciated. I'm > willing to go to greath lengths to test this if someone can make a > good suggestion that sounds like it has a reasonable chance of > improving the speed of this search. There's an extensive thread on my > efforts already here: > > http://phpbuilder.com/board/showthread.php?t=10331619&page=10 Why didn't you investigate/respond to the last posts there? The advice to bump shared_buffers is good advice. work_mem might also need bumped. Figure out which postgresql.conf your system is using and get it dialed in for your hardware. You can make all the indexes you want, but if you've told Postgres that it only has 8M of RAM to work with, performance is going to suck. I don't see hardware specs on that thread (but I didn't read the whole thing) If the system you're using is a dedicated DB system, set shared_buffers to 1/3 - 1/2 of the physical RAM on the machine for starters. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ wmoran@collaborativefusion.com Phone: 412-422-3463x4023
zardozrocks wrote: > lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL, > long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL Native data types such as integer or real are much faster than numeric. If you need 6 digits, it's better to multiply your coordinates by 10^6 and store as INTEGER. > On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB > ram) this query averages 1.5 seconds each time it runs after a brief > warmup period. In PostGreSQL it averages about 15 seconds. What hard drive(s) and controller(s) do you have? Please post EXPLAIN ANALYZE output of the problem query and your postgresql.conf also. -- Benjamin Minshall <minshall@intellicon.biz> Senior Developer -- Intellicon, Inc.
Attachment
On Tue, 2007-04-24 at 16:26, zardozrocks wrote: > I have this table: > > CREATE TABLE test_zip_assoc ( > id serial NOT NULL, > f_id integer DEFAULT 0 NOT NULL, > lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL, > long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL > ); Like someone else mentioned numeric types are SLOW. See if you can use integers, or at least floats. I also wonder if you might be better served with geometric types and GiST indexes on them than using your lat / long grid. With a geometric type, you could define the lat / long as a point and use geometric operations with it. See the pgsql manual: http://www.postgresql.org/docs/8.1/static/datatype-geometric.html http://www.postgresql.org/docs/8.1/static/functions-geometry.html > It's basically a table that associates some foreign_key (for an event, > for instance) with a particular location using longitude and > latitude. I'm basically doing a simple proximity search. I have > populated the database with *10 million* records. I then test > performance by picking 50 zip codes at random and finding the records > within 50 miles with a query like this: I assume that there aren't 10 million zip codes, right? Are you storing the lat / long of the individual venues? Or the zip codes? If you're storing the lat / long of the zips, then I can't imagine there are 10 million zip codes. If you could use the lat / long numbers to find the zip codes that are in your range, then join that to a venue table that fks off of the zip code table, I would think it would be much faster, as you'd have a smaller data set to trundle through. > SELECT id > FROM test_zip_assoc > WHERE > lat_radians > 0.69014816041 > AND lat_radians < 0.71538026567 > AND long_radians > -1.35446228028 > AND long_radians < -1.32923017502 > > > On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB > ram) this query averages 1.5 seconds each time it runs after a brief > warmup period. In PostGreSQL it averages about 15 seconds. I wonder how well it would run if you had 10, 20, 30, 40 etc... users running it at the same time. My guess is that you'll be very lucky to get anything close to linear scaling in any database. That's because this is CPU / Memory bandwidth intensive, so it's gonna kill your DB. OTOH, if it was I/O bound you could throw more hardware at it (bb cache RAID controller, etc) > Both of those times are too slow. I need the query to run in under a > second with as many as a billion records. I don't know if this is > possible but I'm really hoping someone can help me restructure my > indexes (multicolumn?, multiple indexes with a 'where' clause?) so > that I can get this running as fast as possible. You're trying to do a whole lot of processing in a little time. You're either gonna have to accept a less exact answer (i.e. base it on zip codes) or come up with some way of mining the data for the answers ahead of time, kind of like a full text search for lat and long. So, have you tried what I suggested about increasing shared_buffers and work_mem yet?
zardozrocks wrote: > I have this table: > > CREATE TABLE test_zip_assoc ( > id serial NOT NULL, > f_id integer DEFAULT 0 NOT NULL, > lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL, > long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL > ); > CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians); > CREATE INDEX long_radians ON test_zip_assoc USING btree > (long_radians); Maybe I'm missing something, but wouldn't it be easier to just use PostGIS? Or failing that, using the vanilla built-in point type and an r-tree index? That's what r-tree indexes are made for. -- Jeff Hoffmann Head Plate Spinner PropertyKey.com
On 24 Apr 2007 14:26:46 -0700, zardozrocks <zardozrocks@gmail.com> wrote: > I have this table: > > CREATE TABLE test_zip_assoc ( > id serial NOT NULL, > f_id integer DEFAULT 0 NOT NULL, > lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL, > long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL > ); > CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians); > CREATE INDEX long_radians ON test_zip_assoc USING btree > (long_radians); This is a spatial search -- B-tree indexes are much less efficient than this than certain other data structures. The R-tree and its many variants are based on subdividing the space in regions, allowing you to do efficient checks on containment, intersection, etc., based on points or bounding boxes. PostgreSQL implements R-trees natively as well as through a mechanism called GiST, a framework for implementing pluggable tree-like indexes. It also provides some geometric data types. However, as far as I know, PostgreSQL's R-tree/GiST indexes do *not* provide the operators to do bounding box searches. For this you need PostGIS. PostGIS implements the whole GIS stack, and it's so good at this that it's practically the de facto tool among GIS analysts. Installing PostGIS into a database is simple, and once you have done this, you can augment your table with a geometry (*): alter table test_zip_assoc add column lonlat geometry; update test_zip_assoc set lonlat = makepoint( long_radians / (3.14159265358979 / 180), lat_radians / (3.14159265358979 / 180)); The division is to convert your radians into degrees; PostGIS works with degrees, at least out of the box. Now you can query on a bounding box (although, are you sure you got your lons and lats in order? That's Antarctica, isn't it?): select * from test_zip_assoc where lonlat && makebox2d( makepoint(-77.6049721697096, 39.5425768302107), makepoint(-76.1592790300818, 40.9882699698386)) This is bound to be blazingly fast. Next you can order by geographic distance if you like: order by distance_sphere(lonlat, makepoint(-77.6049721697096, 39.5425768302107)) Nobody has mentioned PostGIS so far, so I hope I'm not missing some crucial detail, like "no spatial indexes allowed!". (*) I cheated here. The PostGIS manual recommends that you use a function to create geometric column, because it will set up some auxilary data structures for you that are needed for certain operations. The recommended syntax is: select AddGeometryColumn('', 'test_zip_assoc', 'geom', -1, 'POINT', 2); Alexander.
On 4/27/07, Alexander Staubo <alex@purefiction.net> wrote: [snip] > PostGIS implements the whole GIS stack, and it's so good at this that > it's practically the de facto tool among GIS analysts. Installing > PostGIS into a database is simple, and once you have done this, you > can augment your table with a geometry (*): > > alter table test_zip_assoc add column lonlat geometry; I forgot to include the crucial step, of course: create index test_zip_assoc_lonlat_index on test_zip_assoc using gist (lonlat gist_geometry_ops); analyze test_zip_assoc_lonlat; This creates a GiST index on the geometry and (significantly) updates the table statistics. Alexander.
Re: Simple query, 10 million records...MySQL ten times faster
From
joe@omc-international.com.au
Date:
Is there a reason you are not using postgis. The R tree indexes are designed for exactly this type of query and should be able to do it very quickly. Hope that helps, Joe > I have this table: > > CREATE TABLE test_zip_assoc ( > id serial NOT NULL, > f_id integer DEFAULT 0 NOT NULL, > lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL, > long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL > ); > CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians); > CREATE INDEX long_radians ON test_zip_assoc USING btree > (long_radians); > > > > It's basically a table that associates some foreign_key (for an event, > for instance) with a particular location using longitude and > latitude. I'm basically doing a simple proximity search. I have > populated the database with *10 million* records. I then test > performance by picking 50 zip codes at random and finding the records > within 50 miles with a query like this: > > SELECT id > FROM test_zip_assoc > WHERE > lat_radians > 0.69014816041 > AND lat_radians < 0.71538026567 > AND long_radians > -1.35446228028 > AND long_radians < -1.32923017502 > > > On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB > ram) this query averages 1.5 seconds each time it runs after a brief > warmup period. In PostGreSQL it averages about 15 seconds. > > Both of those times are too slow. I need the query to run in under a > second with as many as a billion records. I don't know if this is > possible but I'm really hoping someone can help me restructure my > indexes (multicolumn?, multiple indexes with a 'where' clause?) so > that I can get this running as fast as possible. > > If I need to consider some non-database data structure in RAM I will > do that too. Any help or tips would be greatly appreciated. I'm > willing to go to greath lengths to test this if someone can make a > good suggestion that sounds like it has a reasonable chance of > improving the speed of this search. There's an extensive thread on my > efforts already here: > > http://phpbuilder.com/board/showthread.php?t=10331619&page=10 > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match >
Folks, we in astronomy permanently work with billiards objects with spherical atributes and have several sky-indexing schemes. See my page for links http://www.sai.msu.su/~megera/wiki/SkyPixelization We have q3c package for PostgreSQL available from q3c.sf.net, which we use in production with terabytes-sized database. Oleg On Thu, 26 Apr 2007, Scott Marlowe wrote: > On Tue, 2007-04-24 at 16:26, zardozrocks wrote: >> I have this table: >> >> CREATE TABLE test_zip_assoc ( >> id serial NOT NULL, >> f_id integer DEFAULT 0 NOT NULL, >> lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL, >> long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL >> ); > > Like someone else mentioned numeric types are SLOW. See if you can use > integers, or at least floats. > > I also wonder if you might be better served with geometric types and > GiST indexes on them than using your lat / long grid. With a geometric > type, you could define the lat / long as a point and use geometric > operations with it. > > See the pgsql manual: > > http://www.postgresql.org/docs/8.1/static/datatype-geometric.html > http://www.postgresql.org/docs/8.1/static/functions-geometry.html > >> It's basically a table that associates some foreign_key (for an event, >> for instance) with a particular location using longitude and >> latitude. I'm basically doing a simple proximity search. I have >> populated the database with *10 million* records. I then test >> performance by picking 50 zip codes at random and finding the records >> within 50 miles with a query like this: > > I assume that there aren't 10 million zip codes, right? > > Are you storing the lat / long of the individual venues? Or the zip > codes? If you're storing the lat / long of the zips, then I can't > imagine there are 10 million zip codes. If you could use the lat / long > numbers to find the zip codes that are in your range, then join that to > a venue table that fks off of the zip code table, I would think it would > be much faster, as you'd have a smaller data set to trundle through. > >> SELECT id >> FROM test_zip_assoc >> WHERE >> lat_radians > 0.69014816041 >> AND lat_radians < 0.71538026567 >> AND long_radians > -1.35446228028 >> AND long_radians < -1.32923017502 >> >> >> On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB >> ram) this query averages 1.5 seconds each time it runs after a brief >> warmup period. In PostGreSQL it averages about 15 seconds. > > I wonder how well it would run if you had 10, 20, 30, 40 etc... users > running it at the same time. My guess is that you'll be very lucky to > get anything close to linear scaling in any database. That's because > this is CPU / Memory bandwidth intensive, so it's gonna kill your DB. > OTOH, if it was I/O bound you could throw more hardware at it (bb cache > RAID controller, etc) > >> Both of those times are too slow. I need the query to run in under a >> second with as many as a billion records. I don't know if this is >> possible but I'm really hoping someone can help me restructure my >> indexes (multicolumn?, multiple indexes with a 'where' clause?) so >> that I can get this running as fast as possible. > > You're trying to do a whole lot of processing in a little time. You're > either gonna have to accept a less exact answer (i.e. base it on zip > codes) or come up with some way of mining the data for the answers ahead > of time, kind of like a full text search for lat and long. > > So, have you tried what I suggested about increasing shared_buffers and > work_mem yet? > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83