Thread: Regexps - never completing join.

Regexps - never completing join.

From
Rusty Conover
Date:
Hi Guys,

I'm using postgresql 8.3.1 and I'm seeing weird behavior between what
I expect and what's happening when the query is executed

I'm trying to match a table that contains regexps against another
table that is full of the text to match against so my query is:

select wc_rule.id from classifications, wc_rule where
classifications.classification ~* wc_rule.regexp;

When I run that the query takes a very very long time (never ending so
far 20 minutes or so) to execute.

But if I loop through all of the rules and a query for each rule:

select wc_rule.id from classifications, wc_rule where
classifications.classification ~* wc_rule.regexp and wc_rule.id = ?

All of the rules when run individually can be matched in a little
under then 3 minutes.  I'd assume postgres would be equal to or faster
with the single row execution method.

The table schema:

CREATE TABLE wc_rule (
     id integer NOT NULL,
     regexp text,
);

CREATE TABLE classifications (
     id integer NOT NULL,
     classification text NOT NULL
);

gb_render_1_db=# explain  select wc_rule.id from classifications,
wc_rule where classifications.classification ~* wc_rule.regexp;
                                  QUERY PLAN
-----------------------------------------------------------------------------
  Nested Loop  (cost=13.71..891401.71 rows=197843 width=4)
    Join Filter: (classifications.classification ~* wc_rule.regexp)
    ->  Seq Scan on classifications  (cost=0.00..1093.46 rows=56446
width=42)
    ->  Materialize  (cost=13.71..20.72 rows=701 width=22)
          ->  Seq Scan on wc_rule  (cost=0.00..13.01 rows=701 width=22)
(5 rows)


gb_render_1_db=# select count(*) from classifications;
  count
-------
  56446
(1 row)

gb_render_1_db=# select count(*) from wc_rule;
  count
-------
    701
(1 row)

I have exports of the tables up at so you can try it if you'd like.

http://rusty.devel.infogears.com/regexp-tables.tar.bz2

Any insight is greatly appreciated, even if it's just showing me how I
made a mistake in the query.

Thanks,

Rusty
--
Rusty Conover
InfoGears Inc.
http://www.infogears.com


An example script that shows how each rule was run individually in perl.

$dbh->begin_work();
eval {
   my $all_rules = $dbh->selectall_arrayref("select id from wc_rule");
   foreach my $row (@$all_rules) {
     print "Doing rule: $row->[0]\n";
     eval {
       local $SIG{ALRM} = sub { die("Alarm") };
       alarm(5);
       my $results = $dbh->selectall_arrayref("select wc_rule.id from
classifications, wc_rule where classifications.classification ~*
wc_rule.regexp and wc_rule.id = ?", undef, $row->[0]);
       alarm(0);
     };
     if ($@) {
       alarm(0);
       print "Got bad rule id of : $row->[0]\n";
       exit(0);
     }
     alarm(0);
     print "ok rule: $row->[0]\n";
   }
};
if ($@) {
   print "Failed to run rules:\n$@\n";
   $dbh->rollback();
   $dbh->disconnect();
   exit(-1);
}

$dbh->commit();
$dbh->disconnect();
exit(0);








Re: Regexps - never completing join.

From
Rusty Conover
Date:


On May 13, 2008, at 11:45 PM, Rusty Conover wrote:

> Hi Guys,
>
> I'm using postgresql 8.3.1 and I'm seeing weird behavior between
> what I expect and what's happening when the query is executed
>
> I'm trying to match a table that contains regexps against another
> table that is full of the text to match against so my query is:
>
> select wc_rule.id from classifications, wc_rule where
> classifications.classification ~* wc_rule.regexp;
>
> When I run that the query takes a very very long time (never ending
> so far 20 minutes or so) to execute.
>
> But if I loop through all of the rules and a query for each rule:
>
> select wc_rule.id from classifications, wc_rule where
> classifications.classification ~* wc_rule.regexp and wc_rule.id = ?
>
> All of the rules when run individually can be matched in a little
> under then 3 minutes.  I'd assume postgres would be equal to or
> faster with the single row execution method.
>
> The table schema:
>
> CREATE TABLE wc_rule (
>    id integer NOT NULL,
>    regexp text,
> );
>
> CREATE TABLE classifications (
>    id integer NOT NULL,
>    classification text NOT NULL
> );
>
> gb_render_1_db=# explain  select wc_rule.id from classifications,
> wc_rule where classifications.classification ~* wc_rule.regexp;
>                                 QUERY PLAN
> -----------------------------------------------------------------------------
> Nested Loop  (cost=13.71..891401.71 rows=197843 width=4)
>   Join Filter: (classifications.classification ~* wc_rule.regexp)
>   ->  Seq Scan on classifications  (cost=0.00..1093.46 rows=56446
> width=42)
>   ->  Materialize  (cost=13.71..20.72 rows=701 width=22)
>         ->  Seq Scan on wc_rule  (cost=0.00..13.01 rows=701 width=22)
> (5 rows)
>
>


As a followup I did some digging:

by editing:

src/backend/utils/adt/regexp.c

and increasing the cache size for regular expressions to an
arbitrarily large number

#define MAX_CACHED_RES  3200

Rather then the default of

#define MAX_CACHED_RES  32

I was able to get the query to complete in a respectable amount of time:

gb_render_1_db=# explain analyze  select wc_rule.id from
classifications, wc_rule where classifications.classification ~*
wc_rule.regexp;
                                                           QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
  Nested Loop  (cost=13.71..891401.71 rows=197843 width=4) (actual
time=72.714..366899.913 rows=55052 loops=1)
    Join Filter: (classifications.classification ~* wc_rule.regexp)
    ->  Seq Scan on classifications  (cost=0.00..1093.46 rows=56446
width=42) (actual time=28.820..109.895 rows=56446 loops=1)
    ->  Materialize  (cost=13.71..20.72 rows=701 width=22) (actual
time=0.000..0.193 rows=701 loops=56446)
          ->  Seq Scan on wc_rule  (cost=0.00..13.01 rows=701
width=22) (actual time=0.030..0.593 rows=701 loops=1)
  Total runtime: 366916.632 ms
(6 rows)

Which is still > 6 minutes, but at least it completed.

I'll keep digging into what is causing this bad performance.

Thanks,

Rusty
--
Rusty Conover
InfoGears Inc.
http://www.infogears.com


Re: Regexps - never completing join.

From
Rusty Conover
Date:
Returning to this problem this morning, I made some more insight.

The regexp cache isn't getting very many hits because the executor is
looping through all of the classification rows then looping through
all of the regular expressions, causing each expression to be
recompiled every time since the cache limit is only for 32 cached
regular expressions.  You can think of the behavior like:

foreach classification {
   foreach regexp {
     do match
   }
}

Obviously to make this perform better without requiring a bigger
regexp cache I'd like it to run like:

foreach regexp {
   foreach classification {
     do match
   }
}

That way the cache wouldn't have to be very big at all since the last
used regular expression would be at the top of the cache.

Various methods of changing the query don't seem to have the desired
effect.  Even with setting join_collapse_limit to 1.

select wc_rule.id from wc_rule cross join classifications on
classifications.classification ~* wc_rule.regexp;

                                  QUERY PLAN
-----------------------------------------------------------------------------
  Nested Loop  (cost=13.71..891401.71 rows=197843 width=4)
    Join Filter: (classifications.classification ~* wc_rule.regexp)
    ->  Seq Scan on classifications  (cost=0.00..1093.46 rows=56446
width=42)
    ->  Materialize  (cost=13.71..20.72 rows=701 width=22)
          ->  Seq Scan on wc_rule  (cost=0.00..13.01 rows=701 width=22)
(5 rows)



select wc_rule.id from classifications cross join wc_rule on
classifications.classification ~* wc_rule.regexp;

                                  QUERY PLAN
-----------------------------------------------------------------------------
  Nested Loop  (cost=13.71..891401.71 rows=197843 width=4)
    Join Filter: (classifications.classification ~* wc_rule.regexp)
    ->  Seq Scan on classifications  (cost=0.00..1093.46 rows=56446
width=42)
    ->  Materialize  (cost=13.71..20.72 rows=701 width=22)
          ->  Seq Scan on wc_rule  (cost=0.00..13.01 rows=701 width=22)
(5 rows)



Both of those queries execute in the same looping order, there doesn't
seem to be a control to say use this table as the inner table and this
table as the outer table for the join that I could find.

One way I did find that worked to control the loop (but doesn't yield
the same results because its a left join)

select wc_rule.id from wc_rule left join classifications on
classifications.classification ~* wc_rule.regexp;

                                                             QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
  Nested Loop Left Join  (cost=1149.91..891457.45 rows=197843 width=4)
(actual time=0.627..149051.505 rows=55126 loops=1)
    Join Filter: (classifications.classification ~* wc_rule.regexp)
    ->  Seq Scan on wc_rule  (cost=0.00..13.01 rows=701 width=22)
(actual time=0.030..1.272 rows=701 loops=1)
    ->  Materialize  (cost=1149.91..1714.37 rows=56446 width=42)
(actual time=0.001..14.244 rows=56446 loops=701)
          ->  Seq Scan on classifications  (cost=0.00..1093.46
rows=56446 width=42) (actual time=0.022..29.913 rows=56446 loops=1)
  Total runtime: 149067.764 ms
(6 rows)

Thanks,

Rusty
--
Rusty Conover
InfoGears Inc.
http://www.infogears.com







Re: Regexps - never completing join.

From
"Scott Marlowe"
Date:
On Wed, May 14, 2008 at 9:33 AM, Rusty Conover <rconover@infogears.com> wrote:
> Returning to this problem this morning, I made some more insight.
>
> One way I did find that worked to control the loop (but doesn't yield the
> same results because its a left join)
>
> select wc_rule.id from wc_rule left join classifications on
> classifications.classification ~* wc_rule.regexp;

If you do that and exclude the extra rows added to the right with somthing like

and wc_rule.somefield IS NOT NULL

does it run fast and give you the same answers as the regular join?

I'm guessing that this could be optimized to use a hash agg method of
joining for text, but I'm no expert on the subject.

Re: Regexps - never completing join.

From
Rusty Conover
Date:
On May 16, 2008, at 2:35 PM, Scott Marlowe wrote:

> On Wed, May 14, 2008 at 9:33 AM, Rusty Conover
> <rconover@infogears.com> wrote:
>> Returning to this problem this morning, I made some more insight.
>>
>> One way I did find that worked to control the loop (but doesn't
>> yield the
>> same results because its a left join)
>>
>> select wc_rule.id from wc_rule left join classifications on
>> classifications.classification ~* wc_rule.regexp;
>
> If you do that and exclude the extra rows added to the right with
> somthing like
>
> and wc_rule.somefield IS NOT NULL
>
> does it run fast and give you the same answers as the regular join?
>
> I'm guessing that this could be optimized to use a hash agg method of
> joining for text, but I'm no expert on the subject.

Hi Scott,

It's not really a hash agg problem really just a looping inside/
outside table selection problem.

The slowdown is really the compilation of the regexp repeatedly by
RE_compile_and_cache() because the regexps are being run on the inside
of the loop rather then the outside.  And since the regexp cache is
only 32 items big, the every match is resulting in a recompilation of
the regexp since I have about 700 regexps.

Thanks,

Rusty
--
Rusty Conover
InfoGears Inc.
http://www.infogears.com



Re: Regexps - never completing join.

From
"Scott Marlowe"
Date:
On Fri, May 16, 2008 at 3:37 PM, Rusty Conover <rconover@infogears.com> wrote:
>
> On May 16, 2008, at 2:35 PM, Scott Marlowe wrote:
>
>> On Wed, May 14, 2008 at 9:33 AM, Rusty Conover <rconover@infogears.com>
>> wrote:
>>>
>>> Returning to this problem this morning, I made some more insight.
>>>
>>> One way I did find that worked to control the loop (but doesn't yield the
>>> same results because its a left join)
>>>
>>> select wc_rule.id from wc_rule left join classifications on
>>> classifications.classification ~* wc_rule.regexp;
>>
>> If you do that and exclude the extra rows added to the right with somthing
>> like
>>
>> and wc_rule.somefield IS NOT NULL
>>
>> does it run fast and give you the same answers as the regular join?
>>
>> I'm guessing that this could be optimized to use a hash agg method of
>> joining for text, but I'm no expert on the subject.
>
> Hi Scott,
>
> It's not really a hash agg problem really just a looping inside/outside
> table selection problem.
>
> The slowdown is really the compilation of the regexp repeatedly by
> RE_compile_and_cache() because the regexps are being run on the inside of
> the loop rather then the outside.  And since the regexp cache is only 32
> items big, the every match is resulting in a recompilation of the regexp
> since I have about 700 regexps.

That's not what I meant.  What I meant was it seems like a good
candidate for a hash aggregate solution.  I'm pretty sure pgsql can't
use hashagg for something like this right now.

If you hashagged each regexp and each column fed through it, you could
probably get good performance.  but that's a backend hacker thing, not
something I'd know how to do.