Re: Some regular-expression performance hacking - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Some regular-expression performance hacking
Date
Msg-id 2840118.1613671847@sss.pgh.pa.us
Whole thread Raw
In response to Re: Some regular-expression performance hacking  ("Joel Jacobson" <joel@compiler.org>)
Responses Re: Some regular-expression performance hacking
Re: Some regular-expression performance hacking
List pgsql-hackers
"Joel Jacobson" <joel@compiler.org> writes:
>> I've produced a new dataset which now also includes the regex flags (if
>> any) used for each subject applied to a pattern.

Again, thanks for collecting this data!  I'm a little confused about
how you produced the results in the "tests" table, though.  It sort
of looks like you tried to feed the Javascript flags to regexp_match(),
which unsurprisingly doesn't work all that well.  Even discounting
that, I'm not getting quite the same results, and I don't understand
why not.  So how was that made from the raw "patterns" and "subjects"
tables?

> PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
> Time: 758514.703 ms (12:38.515)
> Time: 755883.600 ms (12:35.884)
> Time: 746522.107 ms (12:26.522)
>
> PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
> HEAD (4e703d671)
> Time: 519620.646 ms (08:39.621)
> Time: 518998.366 ms (08:38.998)
> Time: 519696.129 ms (08:39.696)

Hmmm ... we haven't yet committed any performance-relevant changes to the
regex code, so it can't take any credit for this improvement from 13.2 to
HEAD.  I speculate that this is due to some change in our parallelism
stuff (since I observe that this query is producing a parallelized hash
plan).  Still, the next drop to circa 2:21 runtime is impressive enough
by itself.

> Heh, what a funny coincidence:
> The regex I used to shrink the very-long-pattern,
> actually happens to run a lot faster with the patches.

Yeah, that just happens to be a poster child for the MATCHALL idea:

> EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');

Each of the parenthesized subexpressions of the RE is successfully
recognized as being MATCHALL, with length range 1..80 for two of them and
0..infinity for the middle one.  That means the engine doesn't have to
physically scan the text to determine whether a possible division point
satisfies the sub-regexp; and that means we can find the correct division
points in O(N) not O(N^2) time.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Problem with accessing TOAST data in stored procedures
Next
From: Peter Eisentraut
Date:
Subject: Re: cursor sensitivity misunderstanding