Thread: Slow join using network address function

Slow join using network address function

From
"Eric Jain"
Date:
I'm trying to join two tables on an inet column, where one of the
columns may contain a subnet rather than a single host. Somehow the
operation isn't completing quite fast enough, even though neither table
is very large:

       table        |  rows
--------------------+--------
 clients            | 115472
 clients_commercial |  11670


First attempt, cancelled after running for half an hour:

SELECT
  c.address AS address,
  cc.address AS network
FROM
  clients c
  JOIN clients_commercial cc ON (c.address <<= cc.address)
;

 Nested Loop
 (cost=189.00..27359887.76 rows=607947200 width=22)
   Join Filter: ("outer".address <<= "inner".address)
   ->  Seq Scan on clients c
       (cost=0.00..2074.76 rows=102176 width=11)
   ->  Materialize
       (cost=189.00..308.00 rows=11900 width=11)
         ->  Seq Scan on clients_commercial cc
             (cost=0.00..189.00 rows=11900 width=11)



Second attempt, completes within 10 min:

SELECT
  c.address AS address,
  cc.address AS network
FROM
  clients c,
  clients_commercial cc
WHERE
  c.commercial IS NULL
  AND c.address <<= cc.address
;

 Nested Loop
 (cost=189.00..139084.01 rows=3040450 width=22)
   Join Filter: ("outer".address <<= "inner".address)
   ->  Seq Scan on clients c
       (cost=0.00..2074.76 rows=511 width=11)
         Filter: (commercial IS NULL)
   ->  Materialize
       (cost=189.00..308.00 rows=11900 width=11)
         ->  Seq Scan on clients_commercial cc
             (cost=0.00..189.00 rows=11900 width=11)


Third attempt; provided some indexes, which unfortunately don't get
used, making the query twice as slow as the previous one:

SELECT
  c.address AS address,
  cc.address AS network
FROM
  clients c,
  clients_commercial cc
WHERE
  c.commercial IS NULL
  AND set_masklen(c.address, masklen(cc.address)) = cc.address
;

CREATE INDEX clients_commercial_masklen_idx
ON clients_commercial((masklen(address)));

CREATE INDEX clients_32_idx
ON clients((set_masklen(address, 32)));

CREATE INDEX clients_24_idx
ON clients((set_masklen(address, 24)));

CREATE INDEX clients_16_idx
ON clients((set_masklen(address, 16)));

 Nested Loop
 (cost=189.00..169488.51 rows=479 width=22)
   Join Filter: (set_masklen("outer".address, masklen("inner".address))
= "inner".address)
   ->  Seq Scan on clients c
       (cost=0.00..2074.76 rows=511 width=11)
         Filter: (commercial IS NULL)
   ->  Materialize
       (cost=189.00..308.00 rows=11900 width=11)
         ->  Seq Scan on clients_commercial cc
             (cost=0.00..189.00 rows=11900 width=11)


Anything else I could try? BTREE indexes don't seem to work with the <<=
operator; is this not possible in principal, or simply something that
has not been implmented yet?


Re: Slow join using network address function

From
Steve Atkins
Date:
On Mon, Feb 23, 2004 at 12:48:02PM +0100, Eric Jain wrote:
> I'm trying to join two tables on an inet column, where one of the
> columns may contain a subnet rather than a single host. Somehow the
> operation isn't completing quite fast enough, even though neither table
> is very large:
>
>        table        |  rows
> --------------------+--------
>  clients            | 115472
>  clients_commercial |  11670

[snip]

> Anything else I could try? BTREE indexes don't seem to work with the <<=
> operator; is this not possible in principal, or simply something that
> has not been implmented yet?

I've been looking at a similar problem for a while. I found that the inet
type didn't really give me the flexibility I needed, and indexing it in
a way that worked with CIDR blocks didn't seem easy (and maybe not possible).

So I rolled my own, based on the seg sample.

<http://word-to-the-wise.com/ipr.tgz> is a datatype that contains a range
of IPv4 addresses, and which has the various operators to make it GIST
indexable. Untar it into contrib and make as usual.

Input is of the form '10.11.12.13' or '10.11.12.13.0/25' or
'10.11.12.13-10.11.12.13.127'. The function display() takes an
ipr type and returns it formatted for display (as a dotted-quad if
a /32, as CIDR format if possible, as a range of dotted-quads otherwise).

A bunch of operators are included, but '&&' returns true if two ipr
fields intersect.

Bugs include:

  0.0.0.0/0 doesn't do what it should on input.
  No documentation.
  No cast operators between ipr and inet types.
  No documentation.

I was planning on doing some docs before releasing it, but here it
is anyway.

Cheers,
  Steve
--
-- Steve Atkins -- steve@blighty.com

Re: Slow join using network address function

From
Josh Berkus
Date:
Eric,

>  Nested Loop
>  (cost=189.00..27359887.76 rows=607947200 width=22)
>    Join Filter: ("outer".address <<= "inner".address)
>    ->  Seq Scan on clients c
>        (cost=0.00..2074.76 rows=102176 width=11)
>    ->  Materialize
>        (cost=189.00..308.00 rows=11900 width=11)
>          ->  Seq Scan on clients_commercial cc
>              (cost=0.00..189.00 rows=11900 width=11)

To help you, we need EXPLAIN ANALYZE, not just EXPLAIN.   Thanks!

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Slow join using network address function

From
Markus Bertheau
Date:
On Пнд, 2004-02-23 at 12:04 -0800, Josh Berkus wrote:
> Eric,
>
> >  Nested Loop
> >  (cost=189.00..27359887.76 rows=607947200 width=22)
> >    Join Filter: ("outer".address <<= "inner".address)
> >    ->  Seq Scan on clients c
> >        (cost=0.00..2074.76 rows=102176 width=11)
> >    ->  Materialize
> >        (cost=189.00..308.00 rows=11900 width=11)
> >          ->  Seq Scan on clients_commercial cc
> >              (cost=0.00..189.00 rows=11900 width=11)
>
> To help you, we need EXPLAIN ANALYZE, not just EXPLAIN.   Thanks!

He said he cancelled the query.

--
Markus Bertheau <twanger@bluetwanger.de>


Re: Slow join using network address function

From
"Eric Jain"
Date:
> <http://word-to-the-wise.com/ipr.tgz> is a datatype that contains
> a range of IPv4 addresses, and which has the various operators to
> make it GIST indexable.

Great, this looks very promising.


>   No cast operators between ipr and inet types.

Any way to work around this, short of dumping and reloading tables?

SELECT ipr '1.2.3.4'; -- Okay
SELECT ipr text(inet '1.2.3.4'); -- Syntax error, of course
SELECT ipr(text(inet '1.2.3.4')); -- Function does not exist, of course
...


Re: Slow join using network address function

From
Tom Lane
Date:
"Eric Jain" <Eric.Jain@isb-sib.ch> writes:
>> <http://word-to-the-wise.com/ipr.tgz> is a datatype that contains
>> a range of IPv4 addresses, and which has the various operators to
>> make it GIST indexable.

> Great, this looks very promising.

>> No cast operators between ipr and inet types.

> Any way to work around this, short of dumping and reloading tables?

Wouldn't it be better to implement the GIST indexing operators of that
package on the standard datatypes?  It wasn't apparent to me what "range
of IP addresses" does for you that isn't covered by "CIDR subnet" for
real-world cases.

            regards, tom lane

Re: Slow join using network address function

From
Nick Barr
Date:
Tom Lane wrote:

>"Eric Jain" <Eric.Jain@isb-sib.ch> writes:
>
>
>>><http://word-to-the-wise.com/ipr.tgz> is a datatype that contains
>>>a range of IPv4 addresses, and which has the various operators to
>>>make it GIST indexable.
>>>
>>>
>
>
>
>>Great, this looks very promising.
>>
>>
>
>
>
>>>No cast operators between ipr and inet types.
>>>
>>>
>
>
>
>>Any way to work around this, short of dumping and reloading tables?
>>
>>
>
>Wouldn't it be better to implement the GIST indexing operators of that
>package on the standard datatypes?  It wasn't apparent to me what "range
>of IP addresses" does for you that isn't covered by "CIDR subnet" for
>real-world cases.
>
>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 8: explain analyze is your friend
>
>
We currently only allow access to one of our apps based on IP address.
These IPs are stored one per row in a single table, but often represent
a contiguous piece of IP space, but does not represent a full subnet.
The current CIDR subnet has the limitation that it will only allow full
subnets, i.e. every IP address in 192.168.1.0/24. For example:

192.168.1.15 -> 192.168.1.31

This range cannot be represented by a CIDR subnet, or it might be able
to but I really dont want to figure it out each time. However this new
type allows us to store this range as one row. It allows an arbitrary
range of IP addresses, not just those in a specific subnet. I would see
this as a useful inclusion whether in the main src tree or in contrib
and we will probably be using it when we get to "mess" with the database
schema for this app in the next few months, in fact I have already
inserted it into our PG source tree ;-).

Nick

P.S. We are not responsible for the IP address ranges, we just get told
what they are.





Re: Slow join using network address function

From
Steve Atkins
Date:
On Tue, Feb 24, 2004 at 10:23:22AM -0500, Tom Lane wrote:
> "Eric Jain" <Eric.Jain@isb-sib.ch> writes:
> >> <http://word-to-the-wise.com/ipr.tgz> is a datatype that contains
> >> a range of IPv4 addresses, and which has the various operators to
> >> make it GIST indexable.
>
> > Great, this looks very promising.
>
> >> No cast operators between ipr and inet types.
>
> > Any way to work around this, short of dumping and reloading tables?
>
> Wouldn't it be better to implement the GIST indexing operators of that
> package on the standard datatypes?  It wasn't apparent to me what "range
> of IP addresses" does for you that isn't covered by "CIDR subnet" for
> real-world cases.

Well, maybe.

However, many of the cases where people want to use this sort of
functionality (address range ownership, email blacklists etc) an
entity is likely to associated with one or a small number of ranges
of contiguous addresses. Those ranges are often not simple CIDR
blocks, and deaggregating them into a sequence of CIDR blocks
doesn't buy anything and complicates the problem.

I also managed to convince myself that it wasn't possible to do
a useful GIST index of a CIDR datatype - as the union between two
adjacent CIDR blocks as a CIDR block is often far, far larger than
the actual range involved - consider 63.255.255.255/32 and 64.0.0.0/32.
That seemed to break the indexing algorithms. I'd like to be proven
wrong on that, but would still find ipr a more useful datatype than
inet for my applications.

Cheers,
  Steve


Re: Slow join using network address function

From
Steve Atkins
Date:
On Tue, Feb 24, 2004 at 01:07:10PM +0100, Eric Jain wrote:
> > <http://word-to-the-wise.com/ipr.tgz> is a datatype that contains
> > a range of IPv4 addresses, and which has the various operators to
> > make it GIST indexable.
>
> Great, this looks very promising.
>
> >   No cast operators between ipr and inet types.
>
> Any way to work around this, short of dumping and reloading tables?
>
> SELECT ipr '1.2.3.4'; -- Okay
> SELECT ipr text(inet '1.2.3.4'); -- Syntax error, of course
> SELECT ipr(text(inet '1.2.3.4')); -- Function does not exist, of course

There's probably some horrible SQL hack that would let you do it, but
I should add some casting code anyway. Shouldn't be too painful to do -
I'll try and get that, and some minimal documentation out today.

Cheers,
  Steve

Re: Slow join using network address function

From
Steve Atkins
Date:
On Tue, Feb 24, 2004 at 09:14:42AM -0800, Steve Atkins wrote:
> On Tue, Feb 24, 2004 at 01:07:10PM +0100, Eric Jain wrote:
> > > <http://word-to-the-wise.com/ipr.tgz> is a datatype that contains
> > > a range of IPv4 addresses, and which has the various operators to
> > > make it GIST indexable.
> >
> > Great, this looks very promising.
> >
> > >   No cast operators between ipr and inet types.
> >
> > Any way to work around this, short of dumping and reloading tables?
>
> There's probably some horrible SQL hack that would let you do it, but
> I should add some casting code anyway. Shouldn't be too painful to do -
> I'll try and get that, and some minimal documentation out today.

Done. <http://word-to-the-wise.com/ipr/>

This really isn't pgsql-performance content, so this is the last time
I'll mention it here.

Cheers,
  Steve