Thread: query planner woes

query planner woes

From
Dmitry Karasik
Date:
hi all,

I've constructed a simple query which takes too long
to finish. EXPLAIN ANALYZE ( all data shown below )
reports that at one stage there are >40K rows ( on a toy database,
 >2M on the real one), and while I undertstand that this is the
most probable cause of the slowness, I cannot think of any change to
the query to aviod this. Any help?

Thanks,
   Dmitry


mts=# \d addresses
                                    Table "public.addresses"
 Column  |           Type           |
Modifiers
---------+--------------------------+-----------------------------------------------------------
 id      | integer                  | not null default
nextval('public.addresses_id_seq'::text)
 address | text                     | not null
Indexes:
    "addresses_pkey" primary key, btree (id)
    "addresses_idx_address" btree (address)

mts=# \d recipients
                                    Table "public.recipients"
 Column  |           Type           |
Modifiers
---------+--------------------------+------------------------------------------------------------
 id      | integer                  | not null default
nextval('public.recipients_id_seq'::text)
 msgto   | integer                  | not null
Indexes:
    "recipients_pkey" primary key, btree (id)
    "recipients_idx_msgto" btree (msgto)

mts=# explain analyze select *
mts-#  from recipients,addresses
mts-#  where addresses.address ~ '@cat' and
mts-#  recipients.msgto = addresses.id;
                                                      QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=535.70..1603.96 rows=352 width=63)
            (actual time=88.387..422.922 rows=576 loops=1)
   Hash Cond: ("outer".msgto = "inner".id)
   ->  Seq Scan on recipients  (cost=0.00..737.42 rows=43642 width=24)
                               (actual time=0.012..178.258 rows=43642
loops=1)
                                                                ^^^^^
   ->  Hash  (cost=535.34..535.34 rows=147 width=39)
             (actual time=87.866..87.866 rows=0 loops=1)
         ->  Seq Scan on addresses  (cost=0.00..535.34 rows=147 width=39)
                                (actual time=0.958..87.806 rows=3 loops=1)
               Filter: (address ~ '@cat'::text)
 Total runtime: 425.061 ms
(7 rows)

mts# select count(*) from recipients,addresses where
     addresses.address ~ '@cat' and recipients.msgto = addresses.id;
count
-----
576





Re: query planner woes

From
Tom Lane
Date:
Dmitry Karasik <dmitry@karasik.eu.org> writes:
> I've constructed a simple query which takes too long
> to finish.

> mts=# explain analyze select *
> mts-#  from recipients,addresses
> mts-#  where addresses.address ~ '@cat' and
> mts-#  recipients.msgto = addresses.id;
>                                                       QUERY
> PLAN
>
-----------------------------------------------------------------------------------------------------------------------
>  Hash Join  (cost=535.70..1603.96 rows=352 width=63)
>             (actual time=88.387..422.922 rows=576 loops=1)
>    Hash Cond: ("outer".msgto = "inner".id)
>    ->  Seq Scan on recipients  (cost=0.00..737.42 rows=43642 width=24)
>                                (actual time=0.012..178.258 rows=43642
> loops=1)
>                                                                 ^^^^^
>    ->  Hash  (cost=535.34..535.34 rows=147 width=39)
>              (actual time=87.866..87.866 rows=0 loops=1)
>          ->  Seq Scan on addresses  (cost=0.00..535.34 rows=147 width=39)
>                                 (actual time=0.958..87.806 rows=3 loops=1)
>                Filter: (address ~ '@cat'::text)
>  Total runtime: 425.061 ms
> (7 rows)

Given that there are actually only 3 rows in addresses matching '@cat',
it would have been best to use a plan like
    Nestloop
        Seq Scan on addresses
        Index Scan on recipients
            Index Cond: msgto = outer.id
ie, use the index to visit each of those three recipients.  However,
for the 147 rows that the planner was expecting, it's not clear that
the hash plan choice is wrong; and with an even-less-selective address
filter, the hash plan will considerably beat the nestloop.  So ISTM that
the real problem is the inaccuracy of the estimate about how many
rows will be pulled from addresses.

Unfortunately there's not a lot you can do to really solve that problem
for arbitrary pattern-match queries :-(.  The planner has no stats that
would let it derive a non-guess estimate.

Depending on what your real workload will be like, you might be able to
formulate things so that common cases go faster.  For instance, I gather
that what you're doing here is looking at the site portion of email
addresses?  If you do that a lot, it might be worth splitting the site
portion out and storing it as a separate column.  Then the search
becomes an anchored-left match:
    ... where addresses.site ~ '^cat'
which the planner has significantly better chance of handling well.

            regards, tom lane