Thread: pg_trgm indexes giving bad estimations?

pg_trgm indexes giving bad estimations?

From
Ben
Date:
I've got another query I'm trying to optimize:

select aj.album from
public.track t
join public.albumjoin aj
on (aj.track = t.id)
join (select id from public.albummeta am where tracks between 10 and
14) lam
on (lam.id = aj.album)
where (t.name % '01New OrderEvil Dust' or t.name % '04OrbitalOpen Mind')
group by aj.album having count(*) >= 9.6;

This gives an expensive (but still reasonable) plan of:

HashAggregate  (cost=76523.64..76602.25 rows=4492 width=4)
    Filter: ((count(*))::numeric >= 9.6)
    ->  Hash Join  (cost=63109.73..76501.18 rows=4492 width=4)
          Hash Cond: ("outer".id = "inner".album)
          ->  Bitmap Heap Scan on albummeta am
(cost=1810.10..9995.34 rows=187683 width=4)
                Recheck Cond: ((tracks >= 10) AND (tracks <= 14))
                ->  Bitmap Index Scan on albummeta_tracks_index
(cost=0.00..1810.10 rows=187683 width=0)
                      Index Cond: ((tracks >= 10) AND (tracks <= 14))
          ->  Hash  (cost=61274.03..61274.03 rows=10243 width=4)
                ->  Nested Loop  (cost=163.87..61274.03 rows=10243
width=4)
                      ->  Bitmap Heap Scan on track t
(cost=163.87..28551.33 rows=10243 width=4)
                            Recheck Cond: (((name)::text % '01New
OrderEvil Dust'::text) OR ((name)::text % '04OrbitalOpen Mind'::text))
                            ->  BitmapOr  (cost=163.87..163.87
rows=10248 width=0)
                                  ->  Bitmap Index Scan on
track_name_trgm_idx  (cost=0.00..81.93 rows=5124 width=0)
                                        Index Cond: ((name)::text %
'01New OrderEvil Dust'::text)
                                  ->  Bitmap Index Scan on
track_name_trgm_idx  (cost=0.00..81.93 rows=5124 width=0)
                                        Index Cond: ((name)::text %
'04OrbitalOpen Mind'::text)
                      ->  Index Scan using albumjoin_trackindex on
albumjoin aj  (cost=0.00..3.18 rows=1 width=8)
                            Index Cond: (aj.track = "outer".id)
(19 rows)

Unfortunately, when I modify this example to use a more typical
number of trigram searches or'd together (anywhere from 10 to 20),
the planner thinks the bitmap heap scan on track t will return a lot
of rows, and so reverts to doing a sequential scan of albumjoin for
the next table join. That would make sense.... IF there were a lot of
rows returned by the bitmap index scans. But here is where the
planner gets it really wrong, if I'm reading it right.

It seems to think both my index scans will return 5124 rows, when, in
reality, it's a lot less:

select count(*) from public.track where name % '01New OrderEvil Dust';
count
-------
     20
(1 row)

select count(*) from public.track where name % '04OrbitalOpen Mind';
count
-------
    123
(1 row)


How can I get the planner to not expect so many rows to be returned?
A possibly related question is: because pg_tgrm lets me set the
matching threshold of the % operator, how does that affect the planner?

Re: pg_trgm indexes giving bad estimations?

From
Tom Lane
Date:
Ben <bench@silentmedia.com> writes:
> How can I get the planner to not expect so many rows to be returned?

Write an estimation function for the pg_trgm operator(s).  (Send in a
patch if you do!)  I see that % is using "contsel" which is only a stub,
and would likely be wrong for % even if it weren't.

> A possibly related question is: because pg_tgrm lets me set the
> matching threshold of the % operator, how does that affect the planner?

It hasn't a clue about that.

            regards, tom lane

Re: pg_trgm indexes giving bad estimations?

From
Ben
Date:
Now that I have a little time to work on this again, I've thought about it
and it seems that an easy and somewhat accurate cop-out to do this is to
use whatever the selectivity function would be for the like operator,
multiplied by a scalar that pg_tgrm should already have access to.

Unfortunately, it's not at all clear to me from reading
http://www.postgresql.org/docs/8.1/interactive/xoper-optimization.html#AEN33077
how like impliments selectivity. Any pointers on where to look?

On Wed, 4 Oct 2006, Tom Lane wrote:

> Ben <bench@silentmedia.com> writes:
>> How can I get the planner to not expect so many rows to be returned?
>
> Write an estimation function for the pg_trgm operator(s).  (Send in a
> patch if you do!)  I see that % is using "contsel" which is only a stub,
> and would likely be wrong for % even if it weren't.
>
>> A possibly related question is: because pg_tgrm lets me set the
>> matching threshold of the % operator, how does that affect the planner?
>
> It hasn't a clue about that.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

Re: pg_trgm indexes giving bad estimations?

From
Tom Lane
Date:
Ben <bench@silentmedia.com> writes:
> Unfortunately, it's not at all clear to me from reading
> http://www.postgresql.org/docs/8.1/interactive/xoper-optimization.html#AEN33077
> how like impliments selectivity. Any pointers on where to look?

likesel() and subsidiary functions in src/backend/utils/adt/selfuncs.c

            regards, tom lane