Thread: performance of implicit join vs. explicit conditions on inet queries?
The preliminaries: - PostgreSQL 8.1 beta 3, Debian experimental - database has been VACUUMed FULL ANALYZE. - a pg_dump -Fc exists at http://199.77.129.48/inet_test.db - ia32 hardware with 2 GB physical memory and the following settings: shared_buffers = 40960 temp_buffers = 16384 work_mem = 131072 maintenance_work_mem = 262144 effective_cache_size = 65536 I've populated a table evenly with about 2 million rows of RFC 1918 addresses: Table "public.inet_addresses" Column | Type | Modifiers --------+------+----------- addr | inet | not null Indexes: "inet_addresses_pkey" PRIMARY KEY, btree (addr) The following query is very fast: EXPLAIN ANALYZE SELECT * FROM inet_addresses WHERE addr << inet('10.2.0.0/24') OR addr << inet('10.4.0.0/24') OR addr << inet('10.8.0.0/24'); Bitmap Heap Scan on inet_addresses (cost=6.51..324.48 rows=1792335 width=11) (actual time=0.350..1.104 rows=381 loops=1) Recheck Cond: ((addr << '10.2.0.0/24'::inet) OR (addr << '10.4.0.0/24'::inet) OR (addr << '10.8.0.0/24'::inet)) Filter: ((addr << '10.2.0.0/24'::inet) OR (addr << '10.4.0.0/24'::inet) OR (addr << '10.8.0.0/24'::inet)) -> BitmapOr (cost=6.51..6.51 rows=85 width=0) (actual time=0.336..0.336 rows=0 loops=1) -> Bitmap Index Scan on inet_addresses_pkey (cost=0.00..2.17 rows=28 width=0) (actual time=0.127..0.127 rows=127loops=1) Index Cond: ((addr > '10.2.0.0/24'::inet) AND (addr <= '10.2.0.255'::inet)) -> Bitmap Index Scan on inet_addresses_pkey (cost=0.00..2.17 rows=28 width=0) (actual time=0.109..0.109 rows=127loops=1) Index Cond: ((addr > '10.4.0.0/24'::inet) AND (addr <= '10.4.0.255'::inet)) -> Bitmap Index Scan on inet_addresses_pkey (cost=0.00..2.17 rows=28 width=0) (actual time=0.096..0.096 rows=127loops=1) Index Cond: ((addr > '10.8.0.0/24'::inet) AND (addr <= '10.8.0.255'::inet)) Total runtime: 1.613 ms Instead of specifying explicit address ranges in the query, I'd like to store the ranges in a table: inet_test_db=# \d inet_ranges Table "public.inet_ranges" Column | Type | Modifiers ----------+---------+----------- range | inet | not null range_id | integer | Indexes: "inet_ranges_pkey" PRIMARY KEY, btree (range) "inet_ranges_range_id_idx" btree (range_id) inet_test_db=# SELECT * FROM inet_ranges; range | range_id --------------+---------- 10.2.0.0/24 | 1 10.4.0.0/24 | 1 10.8.0.0/24 | 1 10.16.0.0/24 | 2 10.32.0.0/24 | 2 10.64.0.0/24 | 2 (6 rows) This query is far slower, even though it generates the same result: EXPLAIN ANALYZE SELECT * FROM inet_addresses as ia, inet_ranges as ir WHERE ia.addr << ir.range AND ir.range_id=1; Nested Loop (cost=0.00..171485.93 rows=3072574 width=26) (actual time=1465.803..16922.979 rows=381 loops=1) Join Filter: ("inner".addr << "outer".range) -> Seq Scan on inet_ranges ir (cost=0.00..1.07 rows=3 width=15) (actual time=0.008..0.021 rows=3 loops=1) Filter: (range_id = 1) -> Seq Scan on inet_addresses ia (cost=0.00..31556.83 rows=2048383 width=11) (actual time=0.003..2919.405 rows=2048383loops=3) Total runtime: 16923.457 ms Even when disabling sequential scans, the query planner is unable to make use of the inet_addresses_pkey index: Nested Loop (cost=100033605.21..100171874.11 rows=3072574 width=26) (actual time=2796.928..23453.585 rows=381 loops=1) Join Filter: ("inner".addr << "outer".range) -> Index Scan using inet_ranges_range_id_idx on inet_ranges ir (cost=0.00..3.04 rows=3 width=15) (actual time=0.069..0.095rows=3 loops=1) Index Cond: (range_id = 1) -> Materialize (cost=100033605.21..100054089.04 rows=2048383 width=11) (actual time=0.016..5133.349 rows=2048383 loops=3) -> Seq Scan on inet_addresses ia (cost=100000000.00..100031556.83 rows=2048383 width=11) (actual time=0.005..2938.012rows=2048383 loops=1) Total runtime: 23521.418 ms Is it possible to attain the speed of the first query and the flexibility of the second? Or will I have to resort to generating queries of the first form with the range table in the application layer? -- Robert Edmonds edmonds42@bellsouth.net
Re: performance of implicit join vs. explicit conditions on inet queries?
From
"Qingqing Zhou"
Date:
"Robert Edmonds" <edmonds42@bellsouth.net> wrote > > EXPLAIN ANALYZE > SELECT * > FROM inet_addresses > WHERE addr << inet('10.2.0.0/24') > OR addr << inet('10.4.0.0/24') > OR addr << inet('10.8.0.0/24'); > > Bitmap Heap Scan on inet_addresses (cost=6.51..324.48 rows=1792335 > width=11) (actual time=0.350..1.104 rows=381 loops=1) > Recheck Cond: ((addr << '10.2.0.0/24'::inet) OR (addr << > '10.4.0.0/24'::inet) OR (addr << '10.8.0.0/24'::inet)) > Filter: ((addr << '10.2.0.0/24'::inet) OR (addr << '10.4.0.0/24'::inet) > OR (addr << '10.8.0.0/24'::inet)) > -> BitmapOr (cost=6.51..6.51 rows=85 width=0) (actual > time=0.336..0.336 rows=0 loops=1) > -> Bitmap Index Scan on inet_addresses_pkey (cost=0.00..2.17 > rows=28 width=0) (actual time=0.127..0.127 rows=127 loops=1) > Index Cond: ((addr > '10.2.0.0/24'::inet) AND (addr <= > '10.2.0.255'::inet)) > -> Bitmap Index Scan on inet_addresses_pkey (cost=0.00..2.17 > rows=28 width=0) (actual time=0.109..0.109 rows=127 loops=1) > Index Cond: ((addr > '10.4.0.0/24'::inet) AND (addr <= > '10.4.0.255'::inet)) > -> Bitmap Index Scan on inet_addresses_pkey (cost=0.00..2.17 > rows=28 width=0) (actual time=0.096..0.096 rows=127 loops=1) > Index Cond: ((addr > '10.8.0.0/24'::inet) AND (addr <= > '10.8.0.255'::inet)) > Total runtime: 1.613 ms > > > Instead of specifying explicit address ranges in the query, I'd like > to store the ranges in a table: > > > inet_test_db=# \d inet_ranges > Table "public.inet_ranges" > Column | Type | Modifiers > ----------+---------+----------- > range | inet | not null > range_id | integer | > Indexes: > "inet_ranges_pkey" PRIMARY KEY, btree (range) > "inet_ranges_range_id_idx" btree (range_id) > > inet_test_db=# SELECT * FROM inet_ranges; > range | range_id > --------------+---------- > 10.2.0.0/24 | 1 > 10.4.0.0/24 | 1 > 10.8.0.0/24 | 1 > 10.16.0.0/24 | 2 > 10.32.0.0/24 | 2 > 10.64.0.0/24 | 2 > (6 rows) > > > > EXPLAIN ANALYZE > SELECT * > FROM inet_addresses as ia, inet_ranges as ir > WHERE ia.addr << ir.range > AND ir.range_id=1; > > Nested Loop (cost=0.00..171485.93 rows=3072574 width=26) (actual > time=1465.803..16922.979 rows=381 loops=1) > Join Filter: ("inner".addr << "outer".range) > -> Seq Scan on inet_ranges ir (cost=0.00..1.07 rows=3 width=15) > (actual time=0.008..0.021 rows=3 loops=1) > Filter: (range_id = 1) > -> Seq Scan on inet_addresses ia (cost=0.00..31556.83 rows=2048383 > width=11) (actual time=0.003..2919.405 rows=2048383 loops=3) > Total runtime: 16923.457 ms > Good illustration. I guess we have a problem of the historgram statistical information. That is, the historgrams we used can effectively record the linear space ranges(like ordinary <, >, =), but failed to do it for nonlinear ranges like inet data type. So the Nested Loop node make an error in estmating number of rows (est: 3072574, real: 381), thus a sequential scan is obviously better under this estimation. I am thinking the historgram problem is not easy to fix, but is there a way to change Inet type a little bit to make it linear for your range operators? (for example, align the length to 000.000.000.000/00?) Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > "Robert Edmonds" <edmonds42@bellsouth.net> wrote >> Instead of specifying explicit address ranges in the query, I'd like >> to store the ranges in a table: > Good illustration. I guess we have a problem of the historgram statistical > information. No, that's completely irrelevant to his problem. The reason we can't do this is that the transformation from "x << const" to a range check on x is a plan-time transformation; there's no mechanism in place to do it at runtime. This is not easy to fix, because the mechanism that's doing it is primarily intended for LIKE/regex index optimization, and in that case a runtime pattern might well not be optimizable at all. regards, tom lane
Re: performance of implicit join vs. explicit conditions on inet queries?
From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > > No, that's completely irrelevant to his problem. The reason we can't do > this is that the transformation from "x << const" to a range check on x > is a plan-time transformation; there's no mechanism in place to do it > at runtime. This is not easy to fix, because the mechanism that's doing > it is primarily intended for LIKE/regex index optimization, and in that > case a runtime pattern might well not be optimizable at all. > Not quite understand, sorry ... (1) For this query (in an as-is PG syntax, which find out all rectangles lie in a given rectangle) : SELECT r FROM all_rectangles WHERE r << rectangle('(1,9),(9,1)'); If there is a GiST/Rtree index associated with all_rectangles.r, how do optimizer estimate the cost to decide that we should use this index or not(then by a seqscan)? (2) Does your above explaination mean that we can't use GiST for a spatial join operation? Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> wrote >> No, that's completely irrelevant to his problem. The reason we can't do >> this is that the transformation from "x << const" to a range check on x >> is a plan-time transformation; there's no mechanism in place to do it >> at runtime. > Not quite understand, sorry ... > (1) For this query (in an as-is PG syntax, which find out all rectangles lie > in a given rectangle) : > SELECT r FROM all_rectangles > WHERE r << rectangle('(1,9),(9,1)'); No, you're thinking of the wrong << operator. The question was about the inet network inclusion operator. We have a special case in indxpath.c to transform "inetcol << inetconstant" into a range check on the inet variable, much like we can transform a left-anchored LIKE pattern into a range check on the text variable. regards, tom lane