Re: query planner woes - Mailing list pgsql-novice

From Tom Lane
Subject Re: query planner woes
Date
Msg-id 25238.1086882800@sss.pgh.pa.us
Whole thread Raw
In response to query planner woes  (Dmitry Karasik <dmitry@karasik.eu.org>)
List pgsql-novice
Dmitry Karasik <dmitry@karasik.eu.org> writes:
> I've constructed a simple query which takes too long
> to finish.

> mts=# explain analyze select *
> mts-#  from recipients,addresses
> mts-#  where addresses.address ~ '@cat' and
> mts-#  recipients.msgto = addresses.id;
>                                                       QUERY
> PLAN
>
-----------------------------------------------------------------------------------------------------------------------
>  Hash Join  (cost=535.70..1603.96 rows=352 width=63)
>             (actual time=88.387..422.922 rows=576 loops=1)
>    Hash Cond: ("outer".msgto = "inner".id)
>    ->  Seq Scan on recipients  (cost=0.00..737.42 rows=43642 width=24)
>                                (actual time=0.012..178.258 rows=43642
> loops=1)
>                                                                 ^^^^^
>    ->  Hash  (cost=535.34..535.34 rows=147 width=39)
>              (actual time=87.866..87.866 rows=0 loops=1)
>          ->  Seq Scan on addresses  (cost=0.00..535.34 rows=147 width=39)
>                                 (actual time=0.958..87.806 rows=3 loops=1)
>                Filter: (address ~ '@cat'::text)
>  Total runtime: 425.061 ms
> (7 rows)

Given that there are actually only 3 rows in addresses matching '@cat',
it would have been best to use a plan like
    Nestloop
        Seq Scan on addresses
        Index Scan on recipients
            Index Cond: msgto = outer.id
ie, use the index to visit each of those three recipients.  However,
for the 147 rows that the planner was expecting, it's not clear that
the hash plan choice is wrong; and with an even-less-selective address
filter, the hash plan will considerably beat the nestloop.  So ISTM that
the real problem is the inaccuracy of the estimate about how many
rows will be pulled from addresses.

Unfortunately there's not a lot you can do to really solve that problem
for arbitrary pattern-match queries :-(.  The planner has no stats that
would let it derive a non-guess estimate.

Depending on what your real workload will be like, you might be able to
formulate things so that common cases go faster.  For instance, I gather
that what you're doing here is looking at the site portion of email
addresses?  If you do that a lot, it might be worth splitting the site
portion out and storing it as a separate column.  Then the search
becomes an anchored-left match:
    ... where addresses.site ~ '^cat'
which the planner has significantly better chance of handling well.

            regards, tom lane

pgsql-novice by date:

Previous
From: "Edgar Saenz"
Date:
Subject: Help with JSP connection with postgreSQL
Next
From: Eduardo V. Rodríguez
Date:
Subject: Re: Help with JSP connection with postgreSQL