Re: Regexps - never completing join. - Mailing list pgsql-performance

From Rusty Conover
Subject Re: Regexps - never completing join.
Date
Msg-id DEA15B6C-2C0B-4139-964E-2BC6FFFA0F12@infogears.com
Whole thread Raw
In response to Regexps - never completing join.  (Rusty Conover <rconover@infogears.com>)
Responses Re: Regexps - never completing join.
List pgsql-performance


On May 13, 2008, at 11:45 PM, Rusty Conover wrote:

> Hi Guys,
>
> I'm using postgresql 8.3.1 and I'm seeing weird behavior between
> what I expect and what's happening when the query is executed
>
> I'm trying to match a table that contains regexps against another
> table that is full of the text to match against so my query is:
>
> select wc_rule.id from classifications, wc_rule where
> classifications.classification ~* wc_rule.regexp;
>
> When I run that the query takes a very very long time (never ending
> so far 20 minutes or so) to execute.
>
> But if I loop through all of the rules and a query for each rule:
>
> select wc_rule.id from classifications, wc_rule where
> classifications.classification ~* wc_rule.regexp and wc_rule.id = ?
>
> All of the rules when run individually can be matched in a little
> under then 3 minutes.  I'd assume postgres would be equal to or
> faster with the single row execution method.
>
> The table schema:
>
> CREATE TABLE wc_rule (
>    id integer NOT NULL,
>    regexp text,
> );
>
> CREATE TABLE classifications (
>    id integer NOT NULL,
>    classification text NOT NULL
> );
>
> gb_render_1_db=# explain  select wc_rule.id from classifications,
> wc_rule where classifications.classification ~* wc_rule.regexp;
>                                 QUERY PLAN
> -----------------------------------------------------------------------------
> Nested Loop  (cost=13.71..891401.71 rows=197843 width=4)
>   Join Filter: (classifications.classification ~* wc_rule.regexp)
>   ->  Seq Scan on classifications  (cost=0.00..1093.46 rows=56446
> width=42)
>   ->  Materialize  (cost=13.71..20.72 rows=701 width=22)
>         ->  Seq Scan on wc_rule  (cost=0.00..13.01 rows=701 width=22)
> (5 rows)
>
>


As a followup I did some digging:

by editing:

src/backend/utils/adt/regexp.c

and increasing the cache size for regular expressions to an
arbitrarily large number

#define MAX_CACHED_RES  3200

Rather then the default of

#define MAX_CACHED_RES  32

I was able to get the query to complete in a respectable amount of time:

gb_render_1_db=# explain analyze  select wc_rule.id from
classifications, wc_rule where classifications.classification ~*
wc_rule.regexp;
                                                           QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
  Nested Loop  (cost=13.71..891401.71 rows=197843 width=4) (actual
time=72.714..366899.913 rows=55052 loops=1)
    Join Filter: (classifications.classification ~* wc_rule.regexp)
    ->  Seq Scan on classifications  (cost=0.00..1093.46 rows=56446
width=42) (actual time=28.820..109.895 rows=56446 loops=1)
    ->  Materialize  (cost=13.71..20.72 rows=701 width=22) (actual
time=0.000..0.193 rows=701 loops=56446)
          ->  Seq Scan on wc_rule  (cost=0.00..13.01 rows=701
width=22) (actual time=0.030..0.593 rows=701 loops=1)
  Total runtime: 366916.632 ms
(6 rows)

Which is still > 6 minutes, but at least it completed.

I'll keep digging into what is causing this bad performance.

Thanks,

Rusty
--
Rusty Conover
InfoGears Inc.
http://www.infogears.com


pgsql-performance by date:

Previous
From: Rusty Conover
Date:
Subject: Regexps - never completing join.
Next
From: PFC
Date:
Subject: Re: can I move sort to first outer join ?