Thread: Optimizer difference using function index between 7.3 and 7.4

Optimizer difference using function index between 7.3 and 7.4

From
Jeff Boes
Date:
We have a large (several million row) table with a field containing
URLs. Now, funny thing about URLs: they mostly start with a common
substring ("http://www."). But not all the rows start with this, so we
can't just lop off the first N characters. However, we noticed some time
ago that an index on this field wasn't as effective as an index on the
REVERSE of the field. So ...

CREATE OR REPLACE FUNCTION fn_urlrev(text) returns text as '
return reverse(lc($_[0]))
' language 'plperl' with (iscachable,isstrict);

and then

CREATE UNIQUE INDEX ix_links_3 ON links
(fn_urlrev(path_base));

seemed to be much faster. When we have to look up a single entry in
"links", we do so by something like --

SELECT * FROM links WHERE fn_urlrev(path_base) = ?;

and it's rather fast. When we have a bunch of them to do, under 7.3 we
found it useful to create a temporary table, fill it with reversed URLs,
and join:

INSERT INTO temp_link_urls VALUES (fn_urlrev(?));

SELECT l.path_base,l.link_id
        FROM links l
        JOIN temp_link_urls t
        ON (fn_urlrev(l.path_base) = t.rev_path_base);

Here are query plans from the two versions (using a temp table with 200
rows, after ANALYZE on the temp table):

7.3:

# explain select link_id from links l join clm_tmp_links t on
(fn_urlrev(l.path_base) = t.rev_path_base);
                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..3936411.13 rows=2000937 width=152)
   ->  Seq Scan on clm_tmp_links t  (cost=0.00..5.00 rows=200 width=74)
   ->  Index Scan using ix_links_3 on links l  (cost=0.00..19531.96
rows=10005 width=78)
         Index Cond: (fn_urlrev(l.path_base) = "outer".rev_path_base)
(4 rows)


7.4:

 # explain select link_id from links l join clm_tmp_links t on
(fn_urlrev(l.path_base) = t.rev_path_base);
                                  QUERY PLAN
------------------------------------------------------------------------------
 Hash Join  (cost=5.50..88832.88 rows=1705551 width=4)
   Hash Cond: (fn_urlrev("outer".path_base) = "inner".rev_path_base)
   ->  Seq Scan on links l  (cost=0.00..50452.50 rows=1705550 width=78)
   ->  Hash  (cost=5.00..5.00 rows=200 width=74)
         ->  Seq Scan on clm_tmp_links t  (cost=0.00..5.00 rows=200
width=74)
(5 rows)

Although the cost for the 7.4 query is lower, the 7.3 plan executes in
about 3 seconds, while the 7.4 plan executes in 59.8 seconds!

Now the odd part: if I change the query to this:

# explain analyze select link_id from links l join clm_tmp_links t on
(fn_urlrev(l.path_base) = fn_urlrev(t.rev_path_base));
                                                              QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=12.64..219974.16 rows=1705551 width=4) (actual
time=17.928..17.928 rows=0 loops=1)
   Merge Cond: (fn_urlrev("outer".path_base) = "inner"."?column2?")
   ->  Index Scan using ix_links_3 on links l  (cost=0.00..173058.87
rows=1705550 width=78) (actual time=0.229..0.285 rows=7 loops=1)
   ->  Sort  (cost=12.64..13.14 rows=200 width=74) (actual
time=9.652..9.871 rows=200 loops=1)
         Sort Key: fn_urlrev(t.rev_path_base)
         ->  Seq Scan on clm_tmp_links t  (cost=0.00..5.00 rows=200
width=74) (actual time=0.166..5.753 rows=200 loops=1)
 Total runtime: 18.125 ms

(i.e., apply the function to the data in the temp table), it runs a
whole lot faster! Is this a bug in the optimizer? Or did something
change about the way functional indexes are used?

--
Jeff Boes                                      vox 269.226.9550 ext 24
Database Engineer                                     fax 269.349.9076
Nexcerpt, Inc.                                 http://www.nexcerpt.com
           ...Nexcerpt... Extend your Expertise


Re: Optimizer difference using function index between 7.3 and 7.4

From
Tom Lane
Date:
Jeff Boes <jboes@nexcerpt.com> writes:
> Is this a bug in the optimizer? Or did something
> change about the way functional indexes are used?

In 7.3, the only possible plan for these queries was a nestloop or
nestloop with inner indexscan, because the planner could not generate
merge or hash joins on join conditions more complex than "var1 = var2".
You were fortunate that a nestloop was fast enough for your situation.

In 7.4 the planner can (as you see) generate both merge and hash options
for this query.  What it's not very good at yet is picking the best
option to use, because it doesn't have any statistics about the
distribution of functional indexes, and so it's pretty much guessing
about selectivity.

As of just a couple days ago, there is code in CVS tip that keeps and
uses stats about the values of functional index columns.  It seems
likely that this would help out tremendously in terms of estimating
the costs well for your problem.  Don't suppose you'd like to try
setting up a test system with your data and trying it ...

BTW, as best I can tell, the amazing speed for the mergejoin is a bit of
a fluke.

>  Merge Join  (cost=12.64..219974.16 rows=1705551 width=4) (actual
> time=17.928..17.928 rows=0 loops=1)
>    Merge Cond: (fn_urlrev("outer".path_base) = "inner"."?column2?")
>    ->  Index Scan using ix_links_3 on links l  (cost=0.00..173058.87
> rows=1705550 width=78) (actual time=0.229..0.285 rows=7 loops=1)
>    ->  Sort  (cost=12.64..13.14 rows=200 width=74) (actual
> time=9.652..9.871 rows=200 loops=1)
>          Sort Key: fn_urlrev(t.rev_path_base)
>          ->  Seq Scan on clm_tmp_links t  (cost=0.00..5.00 rows=200
> width=74) (actual time=0.166..5.753 rows=200 loops=1)
>  Total runtime: 18.125 ms

Notice how the indexscan on links is reporting that it only returned
7 rows.  Ordinarily you'd expect that it'd scan the whole table (and
that's what the cost estimate is expecting).  I think what must have
happened is that the scan stopped only a little way into the table,
because the sequence of values from the temp table ended with a value
that was close to the start of the range of values in the main table.
Mergejoin stops fetching as soon as it exhausts either input table.
This was good luck for you in this case but would likely not hold up
with another set of temp values.

            regards, tom lane

Re: Optimizer difference using function index between 7.3 and 7.4

From
"Simon Riggs"
Date:
>Jeff Boes writes
>  # explain select link_id from links l join clm_tmp_links t on
> (fn_urlrev(l.path_base) = t.rev_path_base);

> executes in 59.8 seconds!

> Now the odd part: if I change the query to this:
>
> # explain analyze select link_id from links l join clm_tmp_links t on
> (fn_urlrev(l.path_base) = fn_urlrev(t.rev_path_base));

>  Total runtime: 18.125 ms
>
> (i.e., apply the function to the data in the temp table), it runs a
> whole lot faster! Is this a bug in the optimizer? Or did something
> change about the way functional indexes are used?

Erm..I may have misunderstood your example, but surely the second
formulation of your query returns the wrong answer? It looks to me as if
you are comparing a reversed URL with a twice-reversed URL; if that's
true that would explain why it runs faster: They don't ever match. Is
that right?

Thanks for the idea of reversing the URLs, nice touch. I'd been thinking
about reverse key indexes as a way of relieving the hotspot down the
rightmost edge of an index during heavy insert traffic. I hadn't thought
this would also speed up the access also.

Best Regards, Simon Riggs