Thread: Simple query, 10 million records...MySQL ten times faster

Simple query, 10 million records...MySQL ten times faster

From
zardozrocks
Date:
I have this table:

CREATE TABLE test_zip_assoc (
    id serial NOT NULL,
    f_id integer DEFAULT 0 NOT NULL,
    lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL,
    long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL
);
CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians);
CREATE INDEX long_radians ON test_zip_assoc USING btree
(long_radians);



It's basically a table that associates some foreign_key (for an event,
for instance) with a particular location using longitude and
latitude.  I'm basically doing a simple proximity search.  I have
populated the database with *10 million* records.  I then test
performance by picking 50 zip codes at random and finding the records
within 50 miles with a query like this:

SELECT id
    FROM test_zip_assoc
    WHERE
        lat_radians > 0.69014816041
        AND lat_radians < 0.71538026567
        AND long_radians > -1.35446228028
        AND long_radians < -1.32923017502


On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
ram) this query averages 1.5 seconds each time it runs after a brief
warmup period.  In PostGreSQL it averages about 15 seconds.

Both of those times are too slow.  I need the query to run in under a
second with as many as a billion records.  I don't know if this is
possible but I'm really hoping someone can help me restructure my
indexes (multicolumn?, multiple indexes with a 'where' clause?) so
that I can get this running as fast as possible.

If I need to consider some non-database data structure in RAM I will
do that too.  Any help or tips would be greatly appreciated.  I'm
willing to go to greath lengths to test this if someone can make a
good suggestion that sounds like it has a reasonable chance of
improving the speed of this search.  There's an extensive thread on my
efforts already here:

http://phpbuilder.com/board/showthread.php?t=10331619&page=10


Re: Simple query, 10 million records...MySQL ten times faster

From
"Merlin Moncure"
Date:
On 24 Apr 2007 14:26:46 -0700, zardozrocks <zardozrocks@gmail.com> wrote:
> I have this table:
>
> CREATE TABLE test_zip_assoc (
>     id serial NOT NULL,
>     f_id integer DEFAULT 0 NOT NULL,
>     lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL,
>     long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL
> );
> CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians);
> CREATE INDEX long_radians ON test_zip_assoc USING btree
> (long_radians);
>
>
>
> It's basically a table that associates some foreign_key (for an event,
> for instance) with a particular location using longitude and
> latitude.  I'm basically doing a simple proximity search.  I have
> populated the database with *10 million* records.  I then test
> performance by picking 50 zip codes at random and finding the records
> within 50 miles with a query like this:
>
> SELECT id
>         FROM test_zip_assoc
>         WHERE
>                 lat_radians > 0.69014816041
>                 AND lat_radians < 0.71538026567
>                 AND long_radians > -1.35446228028
>                 AND long_radians < -1.32923017502
>
>
> On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
> ram) this query averages 1.5 seconds each time it runs after a brief
> warmup period.  In PostGreSQL it averages about 15 seconds.
>
> Both of those times are too slow.  I need the query to run in under a
> second with as many as a billion records.  I don't know if this is
> possible but I'm really hoping someone can help me restructure my
> indexes (multicolumn?, multiple indexes with a 'where' clause?) so
> that I can get this running as fast as possible.
>
> If I need to consider some non-database data structure in RAM I will
> do that too.  Any help or tips would be greatly appreciated.  I'm
> willing to go to greath lengths to test this if someone can make a
> good suggestion that sounds like it has a reasonable chance of
> improving the speed of this search.  There's an extensive thread on my
> efforts already here:

You can always go the earthdist route.  the index takes longer to
build (like 5x) longer than btree, but will optimize that exact
operation.

merlin

Re: Simple query, 10 million records...MySQL ten times faster

From
Bill Moran
Date:
In response to zardozrocks <zardozrocks@gmail.com>:

> I have this table:
>
> CREATE TABLE test_zip_assoc (
>     id serial NOT NULL,
>     f_id integer DEFAULT 0 NOT NULL,
>     lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL,
>     long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL
> );
> CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians);
> CREATE INDEX long_radians ON test_zip_assoc USING btree
> (long_radians);
>
>
>
> It's basically a table that associates some foreign_key (for an event,
> for instance) with a particular location using longitude and
> latitude.  I'm basically doing a simple proximity search.  I have
> populated the database with *10 million* records.  I then test
> performance by picking 50 zip codes at random and finding the records
> within 50 miles with a query like this:
>
> SELECT id
>     FROM test_zip_assoc
>     WHERE
>         lat_radians > 0.69014816041
>         AND lat_radians < 0.71538026567
>         AND long_radians > -1.35446228028
>         AND long_radians < -1.32923017502
>
>
> On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
> ram) this query averages 1.5 seconds each time it runs after a brief
> warmup period.  In PostGreSQL it averages about 15 seconds.
>
> Both of those times are too slow.  I need the query to run in under a
> second with as many as a billion records.  I don't know if this is
> possible but I'm really hoping someone can help me restructure my
> indexes (multicolumn?, multiple indexes with a 'where' clause?) so
> that I can get this running as fast as possible.
>
> If I need to consider some non-database data structure in RAM I will
> do that too.  Any help or tips would be greatly appreciated.  I'm
> willing to go to greath lengths to test this if someone can make a
> good suggestion that sounds like it has a reasonable chance of
> improving the speed of this search.  There's an extensive thread on my
> efforts already here:
>
> http://phpbuilder.com/board/showthread.php?t=10331619&page=10

Why didn't you investigate/respond to the last posts there?  The advice
to bump shared_buffers is good advice.  work_mem might also need bumped.

Figure out which postgresql.conf your system is using and get it dialed
in for your hardware.  You can make all the indexes you want, but if
you've told Postgres that it only has 8M of RAM to work with, performance
is going to suck.  I don't see hardware specs on that thread (but I
didn't read the whole thing)  If the system you're using is a dedicated
DB system, set shared_buffers to 1/3 - 1/2 of the physical RAM on the
machine for starters.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: Simple query, 10 million records...MySQL ten times faster

From
Benjamin Minshall
Date:
zardozrocks wrote:
>     lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL,
>     long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL

Native data types such as integer or real are much faster than numeric.
  If you need 6 digits, it's better to multiply your coordinates by 10^6
and store as INTEGER.

> On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
> ram) this query averages 1.5 seconds each time it runs after a brief
> warmup period.  In PostGreSQL it averages about 15 seconds.

What hard drive(s) and controller(s) do you have?  Please post EXPLAIN
ANALYZE output of the problem query and your postgresql.conf also.

--
Benjamin Minshall <minshall@intellicon.biz>
Senior Developer -- Intellicon, Inc.

Attachment

Re: Simple query, 10 million records...MySQL ten times faster

From
Scott Marlowe
Date:
On Tue, 2007-04-24 at 16:26, zardozrocks wrote:
> I have this table:
>
> CREATE TABLE test_zip_assoc (
>     id serial NOT NULL,
>     f_id integer DEFAULT 0 NOT NULL,
>     lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL,
>     long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL
> );

Like someone else mentioned numeric types are SLOW.  See if you can use
integers, or at least floats.

I also wonder if you might be better served with geometric types and
GiST indexes on them than using your lat / long grid.  With a geometric
type, you could define the lat / long as a point and use geometric
operations with it.

See the pgsql manual:

http://www.postgresql.org/docs/8.1/static/datatype-geometric.html
http://www.postgresql.org/docs/8.1/static/functions-geometry.html

> It's basically a table that associates some foreign_key (for an event,
> for instance) with a particular location using longitude and
> latitude.  I'm basically doing a simple proximity search.  I have
> populated the database with *10 million* records.  I then test
> performance by picking 50 zip codes at random and finding the records
> within 50 miles with a query like this:

I assume that there aren't 10 million zip codes, right?

Are you storing the lat / long of the individual venues?  Or the zip
codes?  If you're storing the lat / long of the zips, then I can't
imagine there are 10 million zip codes.  If you could use the lat / long
numbers to find the zip codes that are in your range, then join that to
a venue table that fks off of the zip code table, I would think it would
be much faster, as you'd have a smaller data set to trundle through.

> SELECT id
>     FROM test_zip_assoc
>     WHERE
>         lat_radians > 0.69014816041
>         AND lat_radians < 0.71538026567
>         AND long_radians > -1.35446228028
>         AND long_radians < -1.32923017502
>
>
> On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
> ram) this query averages 1.5 seconds each time it runs after a brief
> warmup period.  In PostGreSQL it averages about 15 seconds.

I wonder how well it would run if you had 10, 20, 30, 40 etc... users
running it at the same time.  My guess is that you'll be very lucky to
get anything close to linear scaling in any database.  That's because
this is CPU / Memory bandwidth intensive, so it's gonna kill your DB.
OTOH, if it was I/O bound you could throw more hardware at it (bb cache
RAID controller, etc)

> Both of those times are too slow.  I need the query to run in under a
> second with as many as a billion records.  I don't know if this is
> possible but I'm really hoping someone can help me restructure my
> indexes (multicolumn?, multiple indexes with a 'where' clause?) so
> that I can get this running as fast as possible.

You're trying to do a whole lot of processing in a little time.  You're
either gonna have to accept a less exact answer (i.e. base it on zip
codes) or come up with some way of mining the data for the answers ahead
of time, kind of like a full text search for lat and long.

So, have you tried what I suggested about increasing shared_buffers and
work_mem yet?

Re: Simple query, 10 million records...MySQL ten times faster

From
Jeff Hoffmann
Date:
zardozrocks wrote:
> I have this table:
>
> CREATE TABLE test_zip_assoc (
>     id serial NOT NULL,
>     f_id integer DEFAULT 0 NOT NULL,
>     lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL,
>     long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL
> );
> CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians);
> CREATE INDEX long_radians ON test_zip_assoc USING btree
> (long_radians);

Maybe I'm missing something, but wouldn't it be easier to just use
PostGIS?  Or failing that, using the vanilla built-in point type and an
r-tree index?  That's what r-tree indexes are made for.

--
Jeff Hoffmann
Head Plate Spinner
PropertyKey.com

Re: Simple query, 10 million records...MySQL ten times faster

From
"Alexander Staubo"
Date:
On 24 Apr 2007 14:26:46 -0700, zardozrocks <zardozrocks@gmail.com> wrote:
> I have this table:
>
> CREATE TABLE test_zip_assoc (
>     id serial NOT NULL,
>     f_id integer DEFAULT 0 NOT NULL,
>     lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL,
>     long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL
> );
> CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians);
> CREATE INDEX long_radians ON test_zip_assoc USING btree
> (long_radians);

This is a spatial search -- B-tree indexes are much less efficient
than this than certain other data structures. The R-tree and its many
variants are based on subdividing the space in regions, allowing you
to do efficient checks on containment, intersection, etc., based on
points or bounding boxes.

PostgreSQL implements R-trees natively as well as through a mechanism
called GiST, a framework for implementing pluggable tree-like indexes.
It also provides some geometric data types. However, as far as I know,
PostgreSQL's R-tree/GiST indexes do *not* provide the operators to do
bounding box searches. For this you need PostGIS.

PostGIS implements the whole GIS stack, and it's so good at this that
it's practically the de facto tool among GIS analysts. Installing
PostGIS into a database is simple, and once you have done this, you
can augment your table with a geometry (*):

  alter table test_zip_assoc add column lonlat geometry;
  update test_zip_assoc set lonlat = makepoint(
    long_radians / (3.14159265358979 / 180),
    lat_radians / (3.14159265358979 / 180));

The division is to convert your radians into degrees; PostGIS works
with degrees, at least out of the box.

Now you can query on a bounding box (although, are you sure you got
your lons and lats in order? That's Antarctica, isn't it?):

  select * from test_zip_assoc
  where lonlat && makebox2d(
    makepoint(-77.6049721697096, 39.5425768302107),
    makepoint(-76.1592790300818,  40.9882699698386))

This is bound to be blazingly fast. Next you can order by geographic
distance if you like:

  order by distance_sphere(lonlat,
    makepoint(-77.6049721697096, 39.5425768302107))

Nobody has mentioned PostGIS so far, so I hope I'm not missing some
crucial detail, like "no spatial indexes allowed!".

(*) I cheated here. The PostGIS manual recommends that you use a
function to create geometric column, because it will set up some
auxilary data structures for you that are needed for certain
operations. The recommended syntax is:

  select AddGeometryColumn('', 'test_zip_assoc', 'geom', -1, 'POINT', 2);

Alexander.

Re: Simple query, 10 million records...MySQL ten times faster

From
"Alexander Staubo"
Date:
On 4/27/07, Alexander Staubo <alex@purefiction.net> wrote:
[snip]
> PostGIS implements the whole GIS stack, and it's so good at this that
> it's practically the de facto tool among GIS analysts. Installing
> PostGIS into a database is simple, and once you have done this, you
> can augment your table with a geometry (*):
>
>   alter table test_zip_assoc add column lonlat geometry;

I forgot to include the crucial step, of course:

  create index test_zip_assoc_lonlat_index on test_zip_assoc
    using gist (lonlat gist_geometry_ops);
  analyze test_zip_assoc_lonlat;

This creates a GiST index on the geometry and (significantly) updates
the table statistics.

Alexander.

Re: Simple query, 10 million records...MySQL ten times faster

From
joe@omc-international.com.au
Date:

Is there a reason you are not using postgis. The R tree indexes are
designed for exactly this type of query and should be able to do it very
quickly.

Hope that helps,

Joe

> I have this table:
>
> CREATE TABLE test_zip_assoc (
>     id serial NOT NULL,
>     f_id integer DEFAULT 0 NOT NULL,
>     lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL,
>     long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL
> );
> CREATE INDEX lat_radians ON test_zip_assoc USING btree (lat_radians);
> CREATE INDEX long_radians ON test_zip_assoc USING btree
> (long_radians);
>
>
>
> It's basically a table that associates some foreign_key (for an event,
> for instance) with a particular location using longitude and
> latitude.  I'm basically doing a simple proximity search.  I have
> populated the database with *10 million* records.  I then test
> performance by picking 50 zip codes at random and finding the records
> within 50 miles with a query like this:
>
> SELECT id
>     FROM test_zip_assoc
>     WHERE
>         lat_radians > 0.69014816041
>         AND lat_radians < 0.71538026567
>         AND long_radians > -1.35446228028
>         AND long_radians < -1.32923017502
>
>
> On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
> ram) this query averages 1.5 seconds each time it runs after a brief
> warmup period.  In PostGreSQL it averages about 15 seconds.
>
> Both of those times are too slow.  I need the query to run in under a
> second with as many as a billion records.  I don't know if this is
> possible but I'm really hoping someone can help me restructure my
> indexes (multicolumn?, multiple indexes with a 'where' clause?) so
> that I can get this running as fast as possible.
>
> If I need to consider some non-database data structure in RAM I will
> do that too.  Any help or tips would be greatly appreciated.  I'm
> willing to go to greath lengths to test this if someone can make a
> good suggestion that sounds like it has a reasonable chance of
> improving the speed of this search.  There's an extensive thread on my
> efforts already here:
>
> http://phpbuilder.com/board/showthread.php?t=10331619&page=10
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>



Re: Simple query, 10 million records...MySQL ten times faster

From
Oleg Bartunov
Date:
Folks,

we in astronomy permanently work with billiards objects with spherical
atributes and have several sky-indexing schemes. See my page
for links http://www.sai.msu.su/~megera/wiki/SkyPixelization

We have q3c package for PostgreSQL available from q3c.sf.net, which
we use in production with terabytes-sized database.

Oleg
On Thu, 26 Apr 2007, Scott Marlowe wrote:

> On Tue, 2007-04-24 at 16:26, zardozrocks wrote:
>> I have this table:
>>
>> CREATE TABLE test_zip_assoc (
>>     id serial NOT NULL,
>>     f_id integer DEFAULT 0 NOT NULL,
>>     lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL,
>>     long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL
>> );
>
> Like someone else mentioned numeric types are SLOW.  See if you can use
> integers, or at least floats.
>
> I also wonder if you might be better served with geometric types and
> GiST indexes on them than using your lat / long grid.  With a geometric
> type, you could define the lat / long as a point and use geometric
> operations with it.
>
> See the pgsql manual:
>
> http://www.postgresql.org/docs/8.1/static/datatype-geometric.html
> http://www.postgresql.org/docs/8.1/static/functions-geometry.html
>
>> It's basically a table that associates some foreign_key (for an event,
>> for instance) with a particular location using longitude and
>> latitude.  I'm basically doing a simple proximity search.  I have
>> populated the database with *10 million* records.  I then test
>> performance by picking 50 zip codes at random and finding the records
>> within 50 miles with a query like this:
>
> I assume that there aren't 10 million zip codes, right?
>
> Are you storing the lat / long of the individual venues?  Or the zip
> codes?  If you're storing the lat / long of the zips, then I can't
> imagine there are 10 million zip codes.  If you could use the lat / long
> numbers to find the zip codes that are in your range, then join that to
> a venue table that fks off of the zip code table, I would think it would
> be much faster, as you'd have a smaller data set to trundle through.
>
>> SELECT id
>>     FROM test_zip_assoc
>>     WHERE
>>         lat_radians > 0.69014816041
>>         AND lat_radians < 0.71538026567
>>         AND long_radians > -1.35446228028
>>         AND long_radians < -1.32923017502
>>
>>
>> On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
>> ram) this query averages 1.5 seconds each time it runs after a brief
>> warmup period.  In PostGreSQL it averages about 15 seconds.
>
> I wonder how well it would run if you had 10, 20, 30, 40 etc... users
> running it at the same time.  My guess is that you'll be very lucky to
> get anything close to linear scaling in any database.  That's because
> this is CPU / Memory bandwidth intensive, so it's gonna kill your DB.
> OTOH, if it was I/O bound you could throw more hardware at it (bb cache
> RAID controller, etc)
>
>> Both of those times are too slow.  I need the query to run in under a
>> second with as many as a billion records.  I don't know if this is
>> possible but I'm really hoping someone can help me restructure my
>> indexes (multicolumn?, multiple indexes with a 'where' clause?) so
>> that I can get this running as fast as possible.
>
> You're trying to do a whole lot of processing in a little time.  You're
> either gonna have to accept a less exact answer (i.e. base it on zip
> codes) or come up with some way of mining the data for the answers ahead
> of time, kind of like a full text search for lat and long.
>
> So, have you tried what I suggested about increasing shared_buffers and
> work_mem yet?
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                http://www.postgresql.org/about/donate
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83