Thread: POSTGRES DB 3 800 000 rows table, speed up?

POSTGRES DB 3 800 000 rows table, speed up?

From
Eugene
Date:
Hello!
I've managed to import into postgre DB 3 800 000 rows of data (500 MB pure
CSV ~ 2 GB SQL DB)
It looks like this

"69110784","69111807","US","UNITED
STATES","ILLINOIS","BLOOMINGTON","40.4758","-88.9894","61701","LEVEL 3
COMMUNICATIONS INC","DSL-VERIZON.NET"
"69111808","69112831","US","UNITED
STATES","TEXAS","GRAPEVINE","32.9309","-97.0755","76051","LEVEL 3
COMMUNICATIONS INC","DSL-VERIZON.NET"
"69112832","69113087","US","UNITED
STATES","TEXAS","DENTON","33.2108","-97.1231","76201","LEVEL 3
COMMUNICATIONS INC","DSL-VERIZON.NET"


CREATE TABLE ipdb2
(
     ipFROM int4 NOT NULL,
     ipTO int4 NOT NULL ,
    countrySHORT CHARACTER(2) NOT NULL,
      countryLONG VARCHAR(64) NOT NULL,
     ipREGION VARCHAR(128) NOT NULL,
     ipCITY VARCHAR(128) NOT NULL,
   ipLATITUDE  DOUBLE PRECISION,
     ipLONGITUDE  DOUBLE PRECISION,
     ipZIPCODE VARCHAR(5),
     ipISP VARCHAR(255) NOT NULL,
     ipDOMAIN VARCHAR(128) NOT NULL

);


I've indexed first two columns they are IPfrom, IPto also table is btree
version of postgre is 7.4.8, on hosting
I ask db like this  SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom
AND ipto;

and get answer after 3-10 seconds, is there a way to speed it up somehow?
any tweaks and tuneups possible with it?

thanks!
----------------
eugene

Re: POSTGRES DB 3 800 000 rows table, speed up?

From
"Jim C. Nasby"
Date:
On Tue, Dec 27, 2005 at 11:25:37PM +0200, Eugene wrote:
> I've indexed first two columns they are IPfrom, IPto also table is btree
> version of postgre is 7.4.8, on hosting

You should ask them to upgrade; 7.4 is getting pretty old.

> I ask db like this  SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom
> AND ipto;

I'm pretty sure PostgreSQL won't be able to use any indexes for this
(EXPLAIN ANALYZE would verify that). Instead, expand the between out:

WHERE ipfrom >= '...' AND ipto <= '...'

Also, generally speaking, databases and CamelCase don't mix too well;
you'll probably be happier doing something like ip_from and ip_to.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: POSTGRES DB 3 800 000 rows table, speed up?

From
"Jonel Rienton"
Date:
I have a question about this, shouldn't it be the query should look like

SELECT *
FROM ipdb2
WHERE 3229285376 BETWEEN ipfrom AND ipto


Note the query doesn't quote the filter, since the ipfrom and ipto are both
integer types?

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Jim C. Nasby
Sent: Wednesday, December 28, 2005 11:33 AM
To: Eugene
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] POSTGRES DB 3 800 000 rows table, speed up?

On Tue, Dec 27, 2005 at 11:25:37PM +0200, Eugene wrote:
> I've indexed first two columns they are IPfrom, IPto also table is
> btree version of postgre is 7.4.8, on hosting

You should ask them to upgrade; 7.4 is getting pretty old.

> I ask db like this  SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN
> ipfrom AND ipto;

I'm pretty sure PostgreSQL won't be able to use any indexes for this
(EXPLAIN ANALYZE would verify that). Instead, expand the between out:

WHERE ipfrom >= '...' AND ipto <= '...'

Also, generally speaking, databases and CamelCase don't mix too well; you'll
probably be happier doing something like ip_from and ip_to.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings


Re: POSTGRES DB 3 800 000 rows table, speed up?

From
hubert depesz lubaczewski
Date:
On 12/27/05, Eugene <evgenius@hot.ee> wrote:
Hello!
I've managed to import into postgre DB 3 800 000 rows of data (500 MB pure
CSV ~ 2 GB SQL DB)
It looks like this
"69110784","69111807","US","UNITED
STATES","ILLINOIS","BLOOMINGTON"," 40.4758","-88.9894","61701","LEVEL 3
COMMUNICATIONS INC","DSL-VERIZON.NET"
"69111808","69112831","US","UNITED
STATES","TEXAS","GRAPEVINE","32.9309","-97.0755","76051","LEVEL 3
COMMUNICATIONS INC","DSL-VERIZON.NET"
"69112832","69113087","US","UNITED
STATES","TEXAS","DENTON","33.2108","-97.1231","76201","LEVEL 3
COMMUNICATIONS INC"," DSL-VERIZON.NET"
CREATE TABLE ipdb2
(
     ipFROM int4 NOT NULL,
     ipTO int4 NOT NULL ,
    countrySHORT CHARACTER(2) NOT NULL,
      countryLONG VARCHAR(64) NOT NULL,
     ipREGION VARCHAR(128) NOT NULL,
     ipCITY VARCHAR(128) NOT NULL,
   ipLATITUDE  DOUBLE PRECISION,
     ipLONGITUDE  DOUBLE PRECISION,
     ipZIPCODE VARCHAR(5),
     ipISP VARCHAR(255) NOT NULL,
     ipDOMAIN VARCHAR(128) NOT NULL
);

1st. of all - change ipfrom and ipto column types to int8.
integer types in postgresql are signed, so their effective "max" is around 2000000000, which makes your example with over 3000000000 technically not working.
2nd. do vacuum analyze
3rd. show explain analyze. how can we tell you how to speed it up, when we dont know what/how postgres is doing with it.

depesz

Re: POSTGRES DB 3 800 000 rows table, speed up?

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Tue, Dec 27, 2005 at 11:25:37PM +0200, Eugene wrote:
>> I ask db like this  SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom
>> AND ipto;

> I'm pretty sure PostgreSQL won't be able to use any indexes for this
> (EXPLAIN ANALYZE would verify that). Instead, expand the between out:

> WHERE ipfrom >= '...' AND ipto <= '...'

That won't help (it is in fact exactly the same query, because BETWEEN
is just rewritten into that).  The real problem is that btree indexes
are ill-suited to this type of condition.  If the typical row has only
a small distance between ipfrom and ipto then the query is actually
pretty selective, but there is no way to capture that selectivity in
a btree search, because neither of the single-column comparisons are
selective at all.  The planner realizes this and doesn't bother with
the index, instead it just does a seqscan.

You could probably get somewhere by casting the problem as an rtree
or GIST overlap/containment query, but with the currently available
tools it would be a pretty unnatural-looking query ... probably
something like
    box(point(ipfrom,ipfrom),point(ipto,ipto)) ~
    box(point(3229285376,3229285376),point(3229285376,3229285376))
after creating an rtree or GIST index on
    box(point(ipfrom,ipfrom),point(ipto,ipto))
(haven't tried this but there is a solution lurking somewhere in this
general vicinity).

Is there a good reason why the data is stored this way, and not as
say a single "cidr" column containing subnet addresses?  Querying
    WHERE '192.122.252.0' << cidrcolumn
would be a much more transparent way of expressing your problem.
We don't currently have an easy indexing solution for that one either,
but we might in the future.

            regards, tom lane

Re: POSTGRES DB 3 800 000 rows table, speed up?

From
Klein Balázs
Date:
Could you explain this a little bit more?
What are the conditions of this situation that makes b-tree ineffective?

Thanks
SWK

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: 2005. december 28. 20:04
To: Jim C. Nasby
Cc: Eugene; pgsql-general@postgresql.org
Subject: Re: [GENERAL] POSTGRES DB 3 800 000 rows table, speed up?

"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Tue, Dec 27, 2005 at 11:25:37PM +0200, Eugene wrote:
>> I ask db like this  SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom

>> AND ipto;

> I'm pretty sure PostgreSQL won't be able to use any indexes for this
> (EXPLAIN ANALYZE would verify that). Instead, expand the between out:

> WHERE ipfrom >= '...' AND ipto <= '...'

That won't help (it is in fact exactly the same query, because BETWEEN
is just rewritten into that).  The real problem is that btree indexes
are ill-suited to this type of condition.  If the typical row has only
a small distance between ipfrom and ipto then the query is actually
pretty selective, but there is no way to capture that selectivity in
a btree search, because neither of the single-column comparisons are
selective at all.  The planner realizes this and doesn't bother with
the index, instead it just does a seqscan.

You could probably get somewhere by casting the problem as an rtree
or GIST overlap/containment query, but with the currently available
tools it would be a pretty unnatural-looking query ... probably
something like
    box(point(ipfrom,ipfrom),point(ipto,ipto)) ~
    box(point(3229285376,3229285376),point(3229285376,3229285376))
after creating an rtree or GIST index on
    box(point(ipfrom,ipfrom),point(ipto,ipto))
(haven't tried this but there is a solution lurking somewhere in this
general vicinity).

Is there a good reason why the data is stored this way, and not as
say a single "cidr" column containing subnet addresses?  Querying
    WHERE '192.122.252.0' << cidrcolumn
would be a much more transparent way of expressing your problem.
We don't currently have an easy indexing solution for that one either,
but we might in the future.

            regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo@postgresql.org so that your
       message can get through to the mailing list cleanly


Re: POSTGRES DB 3 800 000 rows table, speed up?

From
Tom Lane
Date:
=?iso-8859-1?Q?Klein_Bal=E1zs?= <Balazs.Klein@axelero.hu> writes:
> Could you explain this a little bit more?
> What are the conditions of this situation that makes b-tree ineffective?

Well, what he's trying to do is (abstracting a little)

    WHERE low_bound_col <= probe_value AND probe_value <= high_bound_col

Given a btree index on low_bound_col, the best you could do with this is
scan all the index entries from the start of the index up to probe_value
... or about half the table, on average, which makes the index pretty
much useless.  On the assumption that low_bound_col and high_bound_col
are usually close together, all of the useful hits will occur near the
end of that scan, or the beginning if you scan backwards --- but there's
no way to know when it's OK to stop looking.

Making a double-column index on (low_bound_col, high_bound_col) does
not improve the situation much, because the additional condition
high_bound_col >= probe_value doesn't let you avoid scanning small
values of low_bound_col.  You might save some trips to the table proper
but you're still scanning half the index.

And of course indexing (high_bound_col, low_bound_col) isn't any better.

If you are willing to impose a hard-wired assumption about the possible
size of the low-bound-to-high-bound distance, you can extend the query
to something like

    WHERE low_bound_col <= probe_value AND probe_value <= high_bound_col
    AND low_bound_col >= (probe_value - max_distance)

which creates an efficiently indexable range limitation on
low_bound_col.  Of course this is a very sucky kluge.

You can do a lot better with index types that are designed for
two-dimensional data instead of one-dimensional data.  Btree is
a great data structure for one-dimensional searches, but that
doesn't make it the answer to everything.

            regards, tom lane

Re: POSTGRES DB 3 800 000 rows table, speed up?

From
Eugene
Date:
THanks for quick reply

this is what i get now

  Index Scan using ipt on ipdb2  (cost=0.00..74265.76 rows=989636
width=118) (actual time=0.216..2379.608 rows=1 loops=1)
    Index Cond: (3229285376::bigint <= ipto)
    Filter: (3229285376::bigint >= ipfrom)
  Total runtime: 2379.666 ms
(4 rows)

detectlo_db=> EXPLAIN ANALYZE SELECT * FROM ipdb2 WHERE '999998376'
BETWEEN ipfrom AND ipto;
                                                        QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
  Index Scan using ipf on ipdb2  (cost=0.00..6796.64 rows=154129 width=118)
(actual time=211.298..211.301 rows=1 loops=1)
    Index Cond: (999998376::bigint >= ipfrom)
    Filter: (999998376::bigint <= ipto)
  Total runtime: 211.371 ms
(4 rows)

detectlo_db=> drop index ipt
detectlo_db-> ;
DROP INDEX
detectlo_db=> analyze ipdb2;
ANALYZE
detectlo_db=> EXPLAIN SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN
ipfrom AND ipto;
                                  QUERY PLAN
----------------------------------------------------------------------------
  Index Scan using ipf on ipdb2  (cost=0.00..95179.56 rows=989181 width=117)
    Index Cond: (3229285376::bigint >= ipfrom)
    Filter: (3229285376::bigint <= ipto)
(3 rows)

detectlo_db=> EXPLAIN ANALYZE SELECT * FROM ipdb2 WHERE '3229285376'
BETWEEN ipfrom AND ipto;
                                                          QUERY PLAN


----------------------------------------------------------------------------------------------------------------------------
  Index Scan using ipf on ipdb2  (cost=0.00..95179.56 rows=989181
width=117) (actual time=3223.344..3223.347 rows=1 loops=1)
    Index Cond: (3229285376::bigint >= ipfrom)
    Filter: (3229285376::bigint <= ipto)
  Total runtime: 3223.410 ms
(4 rows)

detectlo_db=> SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom AND
ipto;
    ipfrom   |    ipto    | countryshort |  countrylong  | ipregion |
ipcity | iplatitude | iplongitude | ipzipcode |
ipisp               | ipdomain

------------+------------+--------------+---------------+----------+--------+------------+-------------+-----------+-----------------------------------+----------
  3229285376 | 3229285631 | US           | UNITED STATES | TEXAS    |
TYLER  | 32.3511    | -95.2922    | 75701     | HOWE-BAKER ENGINEERS
INCORPORATED | -
(1 row)

detectlo_db=>



On Wed, 28 Dec 2005 23:33:41 +0200, James Robinson
<jlrobins@socialserve.com> wrote:

>
> On Dec 28, 2005, at 4:24 PM, Eugene wrote:
>
>> THIS is what I get
>
> [snip ]
>
> Sorry -- use 'explain select ...', not 'analyze select ...' my bad.
>
> But anyway, looks like your query is still being served by your
> preexisting 'ipt' index. Try
>
> drop index ipt;
>
> then
>
> analyze ipdb2
>
> then
>     explain analyze select * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom
> AND ipto;
>
> [ Assuming this is not a currently running produciotn server ]
>
> ----
> James Robinson
> Socialserve.com
>



--

----------------
eugene

Re: POSTGRES DB 3 800 000 rows table, speed up?

From
Manfred Koizar
Date:
On Tue, 27 Dec 2005 23:25:37 +0200, Eugene <evgenius@hot.ee> wrote:
>SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom AND ipto;

If your [ipfrom, ipto] ranges are non-overlapping, you might be able
to exploit that fact by adding something like

    ... ORDER BY ipfrom DESC LIMIT 1

Servus
 Manfred

Re: POSTGRES DB 3 800 000 rows table, speed up?

From
Andrew - Supernews
Date:
On 2005-12-27, Eugene <evgenius@hot.ee> wrote:
> I've indexed first two columns they are IPfrom, IPto also table is btree
> version of postgre is 7.4.8, on hosting
> I ask db like this  SELECT * FROM ipdb2 WHERE '3229285376' BETWEEN ipfrom
> AND ipto;
>
> and get answer after 3-10 seconds, is there a way to speed it up somehow?
> any tweaks and tuneups possible with it?

http://pgfoundry.org/projects/ip4r

Designed specifically for storing IP ranges as in your example, it will
(if correctly used) reduce that query time down to a few milliseconds.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services