Thread: performance of implicit join vs. explicit conditions on inet queries?

performance of implicit join vs. explicit conditions on inet queries?

From
Robert Edmonds
Date:
The preliminaries:

    - PostgreSQL 8.1 beta 3, Debian experimental
    - database has been VACUUMed FULL ANALYZE.
    - a pg_dump -Fc exists at http://199.77.129.48/inet_test.db
    - ia32 hardware with 2 GB physical memory and the following settings:

shared_buffers = 40960
temp_buffers = 16384
work_mem = 131072
maintenance_work_mem = 262144
effective_cache_size = 65536


    I've populated a table evenly with about 2 million rows of RFC 1918
    addresses:


Table "public.inet_addresses"
 Column | Type | Modifiers
--------+------+-----------
 addr   | inet | not null
Indexes:
    "inet_addresses_pkey" PRIMARY KEY, btree (addr)


    The following query is very fast:


EXPLAIN ANALYZE
SELECT *
  FROM inet_addresses
 WHERE addr << inet('10.2.0.0/24')
    OR addr << inet('10.4.0.0/24')
    OR addr << inet('10.8.0.0/24');

 Bitmap Heap Scan on inet_addresses  (cost=6.51..324.48 rows=1792335 width=11) (actual time=0.350..1.104 rows=381
loops=1)
   Recheck Cond: ((addr << '10.2.0.0/24'::inet) OR (addr << '10.4.0.0/24'::inet) OR (addr << '10.8.0.0/24'::inet))
   Filter: ((addr << '10.2.0.0/24'::inet) OR (addr << '10.4.0.0/24'::inet) OR (addr << '10.8.0.0/24'::inet))
   ->  BitmapOr  (cost=6.51..6.51 rows=85 width=0) (actual time=0.336..0.336 rows=0 loops=1)
         ->  Bitmap Index Scan on inet_addresses_pkey  (cost=0.00..2.17 rows=28 width=0) (actual time=0.127..0.127
rows=127loops=1) 
               Index Cond: ((addr > '10.2.0.0/24'::inet) AND (addr <= '10.2.0.255'::inet))
         ->  Bitmap Index Scan on inet_addresses_pkey  (cost=0.00..2.17 rows=28 width=0) (actual time=0.109..0.109
rows=127loops=1) 
               Index Cond: ((addr > '10.4.0.0/24'::inet) AND (addr <= '10.4.0.255'::inet))
         ->  Bitmap Index Scan on inet_addresses_pkey  (cost=0.00..2.17 rows=28 width=0) (actual time=0.096..0.096
rows=127loops=1) 
               Index Cond: ((addr > '10.8.0.0/24'::inet) AND (addr <= '10.8.0.255'::inet))
 Total runtime: 1.613 ms


    Instead of specifying explicit address ranges in the query, I'd like
    to store the ranges in a table:


inet_test_db=# \d inet_ranges
   Table "public.inet_ranges"
  Column  |  Type   | Modifiers
----------+---------+-----------
 range    | inet    | not null
 range_id | integer |
Indexes:
    "inet_ranges_pkey" PRIMARY KEY, btree (range)
    "inet_ranges_range_id_idx" btree (range_id)

inet_test_db=# SELECT * FROM inet_ranges;
    range     | range_id
--------------+----------
 10.2.0.0/24  |        1
 10.4.0.0/24  |        1
 10.8.0.0/24  |        1
 10.16.0.0/24 |        2
 10.32.0.0/24 |        2
 10.64.0.0/24 |        2
(6 rows)


    This query is far slower, even though it generates the same result:


EXPLAIN ANALYZE
SELECT *
  FROM inet_addresses as ia, inet_ranges as ir
 WHERE ia.addr << ir.range
   AND ir.range_id=1;

 Nested Loop  (cost=0.00..171485.93 rows=3072574 width=26) (actual time=1465.803..16922.979 rows=381 loops=1)
   Join Filter: ("inner".addr << "outer".range)
   ->  Seq Scan on inet_ranges ir  (cost=0.00..1.07 rows=3 width=15) (actual time=0.008..0.021 rows=3 loops=1)
         Filter: (range_id = 1)
   ->  Seq Scan on inet_addresses ia  (cost=0.00..31556.83 rows=2048383 width=11) (actual time=0.003..2919.405
rows=2048383loops=3) 
 Total runtime: 16923.457 ms


    Even when disabling sequential scans, the query planner is unable to
    make use of the inet_addresses_pkey index:


 Nested Loop  (cost=100033605.21..100171874.11 rows=3072574 width=26) (actual time=2796.928..23453.585 rows=381
loops=1)
   Join Filter: ("inner".addr << "outer".range)
   ->  Index Scan using inet_ranges_range_id_idx on inet_ranges ir  (cost=0.00..3.04 rows=3 width=15) (actual
time=0.069..0.095rows=3 loops=1) 
         Index Cond: (range_id = 1)
   ->  Materialize  (cost=100033605.21..100054089.04 rows=2048383 width=11) (actual time=0.016..5133.349 rows=2048383
loops=3)
         ->  Seq Scan on inet_addresses ia  (cost=100000000.00..100031556.83 rows=2048383 width=11) (actual
time=0.005..2938.012rows=2048383 loops=1) 
 Total runtime: 23521.418 ms


    Is it possible to attain the speed of the first query and the
    flexibility of the second?  Or will I have to resort to generating
    queries of the first form with the range table in the application
    layer?

--
Robert Edmonds
edmonds42@bellsouth.net

Re: performance of implicit join vs. explicit conditions on inet queries?

From
"Qingqing Zhou"
Date:
"Robert Edmonds" <edmonds42@bellsouth.net> wrote
>
> EXPLAIN ANALYZE
> SELECT *
>  FROM inet_addresses
> WHERE addr << inet('10.2.0.0/24')
>    OR addr << inet('10.4.0.0/24')
>    OR addr << inet('10.8.0.0/24');
>
> Bitmap Heap Scan on inet_addresses  (cost=6.51..324.48 rows=1792335
> width=11) (actual time=0.350..1.104 rows=381 loops=1)
>   Recheck Cond: ((addr << '10.2.0.0/24'::inet) OR (addr <<
> '10.4.0.0/24'::inet) OR (addr << '10.8.0.0/24'::inet))
>   Filter: ((addr << '10.2.0.0/24'::inet) OR (addr << '10.4.0.0/24'::inet)
> OR (addr << '10.8.0.0/24'::inet))
>   ->  BitmapOr  (cost=6.51..6.51 rows=85 width=0) (actual
> time=0.336..0.336 rows=0 loops=1)
>         ->  Bitmap Index Scan on inet_addresses_pkey  (cost=0.00..2.17
> rows=28 width=0) (actual time=0.127..0.127 rows=127 loops=1)
>               Index Cond: ((addr > '10.2.0.0/24'::inet) AND (addr <=
> '10.2.0.255'::inet))
>         ->  Bitmap Index Scan on inet_addresses_pkey  (cost=0.00..2.17
> rows=28 width=0) (actual time=0.109..0.109 rows=127 loops=1)
>               Index Cond: ((addr > '10.4.0.0/24'::inet) AND (addr <=
> '10.4.0.255'::inet))
>         ->  Bitmap Index Scan on inet_addresses_pkey  (cost=0.00..2.17
> rows=28 width=0) (actual time=0.096..0.096 rows=127 loops=1)
>               Index Cond: ((addr > '10.8.0.0/24'::inet) AND (addr <=
> '10.8.0.255'::inet))
> Total runtime: 1.613 ms
>
>
>    Instead of specifying explicit address ranges in the query, I'd like
>    to store the ranges in a table:
>
>
> inet_test_db=# \d inet_ranges
>   Table "public.inet_ranges"
>  Column  |  Type   | Modifiers
> ----------+---------+-----------
> range    | inet    | not null
> range_id | integer |
> Indexes:
>    "inet_ranges_pkey" PRIMARY KEY, btree (range)
>    "inet_ranges_range_id_idx" btree (range_id)
>
> inet_test_db=# SELECT * FROM inet_ranges;
>    range     | range_id
> --------------+----------
> 10.2.0.0/24  |        1
> 10.4.0.0/24  |        1
> 10.8.0.0/24  |        1
> 10.16.0.0/24 |        2
> 10.32.0.0/24 |        2
> 10.64.0.0/24 |        2
> (6 rows)
>
>
>
> EXPLAIN ANALYZE
> SELECT *
>  FROM inet_addresses as ia, inet_ranges as ir
> WHERE ia.addr << ir.range
>   AND ir.range_id=1;
>
> Nested Loop  (cost=0.00..171485.93 rows=3072574 width=26) (actual
> time=1465.803..16922.979 rows=381 loops=1)
>   Join Filter: ("inner".addr << "outer".range)
>   ->  Seq Scan on inet_ranges ir  (cost=0.00..1.07 rows=3 width=15)
> (actual time=0.008..0.021 rows=3 loops=1)
>         Filter: (range_id = 1)
>   ->  Seq Scan on inet_addresses ia  (cost=0.00..31556.83 rows=2048383
> width=11) (actual time=0.003..2919.405 rows=2048383 loops=3)
> Total runtime: 16923.457 ms
>

Good illustration. I guess we have a problem of the historgram statistical
information. That is, the historgrams we used can effectively record the
linear space ranges(like ordinary <, >, =), but failed to do it for
nonlinear ranges like inet data type. So the Nested Loop node make an error
in estmating number of rows (est: 3072574, real: 381), thus a sequential
scan is obviously better under this estimation.

I am thinking the historgram problem is not easy to fix, but is there a way
to change Inet type a little bit to make it linear for your range operators?
(for example, align the length to 000.000.000.000/00?)

Regards,
Qingqing




Re: performance of implicit join vs. explicit conditions on inet queries?

From
Tom Lane
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> "Robert Edmonds" <edmonds42@bellsouth.net> wrote
>> Instead of specifying explicit address ranges in the query, I'd like
>> to store the ranges in a table:

> Good illustration. I guess we have a problem of the historgram statistical
> information.

No, that's completely irrelevant to his problem.  The reason we can't do
this is that the transformation from "x << const" to a range check on x
is a plan-time transformation; there's no mechanism in place to do it
at runtime.  This is not easy to fix, because the mechanism that's doing
it is primarily intended for LIKE/regex index optimization, and in that
case a runtime pattern might well not be optimizable at all.

            regards, tom lane

Re: performance of implicit join vs. explicit conditions on inet queries?

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> wrote
>
> No, that's completely irrelevant to his problem.  The reason we can't do
> this is that the transformation from "x << const" to a range check on x
> is a plan-time transformation; there's no mechanism in place to do it
> at runtime.  This is not easy to fix, because the mechanism that's doing
> it is primarily intended for LIKE/regex index optimization, and in that
> case a runtime pattern might well not be optimizable at all.
>

Not quite understand, sorry ...

(1) For this query (in an as-is PG syntax, which find out all rectangles lie
in a given rectangle) :

SELECT r FROM all_rectangles
  WHERE r << rectangle('(1,9),(9,1)');

If there is a GiST/Rtree index associated with all_rectangles.r, how do
optimizer estimate the cost to decide that we should use this index or
not(then by a seqscan)?

(2) Does your above explaination mean that we can't use GiST for a spatial
join operation?

Regards,
Qingqing



Re: performance of implicit join vs. explicit conditions on inet queries?

From
Tom Lane
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> wrote
>> No, that's completely irrelevant to his problem.  The reason we can't do
>> this is that the transformation from "x << const" to a range check on x
>> is a plan-time transformation; there's no mechanism in place to do it
>> at runtime.

> Not quite understand, sorry ...

> (1) For this query (in an as-is PG syntax, which find out all rectangles lie
> in a given rectangle) :

> SELECT r FROM all_rectangles
>   WHERE r << rectangle('(1,9),(9,1)');

No, you're thinking of the wrong << operator.  The question was about
the inet network inclusion operator.  We have a special case in
indxpath.c to transform "inetcol << inetconstant" into a range check
on the inet variable, much like we can transform a left-anchored LIKE
pattern into a range check on the text variable.

            regards, tom lane