Home > mailing lists

Re: Some regular-expression performance hacking - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Some regular-expression performance hacking
Date	February 18, 2021 18:10:47
Msg-id	2840118.1613671847@sss.pgh.pa.us Whole thread Raw
In response to	Re: Some regular-expression performance hacking ("Joel Jacobson" <joel@compiler.org>)
Responses	Re: Some regular-expression performance hacking Re: Some regular-expression performance hacking
List	pgsql-hackers

Tree view

"Joel Jacobson" <joel@compiler.org> writes:
>> I've produced a new dataset which now also includes the regex flags (if
>> any) used for each subject applied to a pattern.

Again, thanks for collecting this data!  I'm a little confused about
how you produced the results in the "tests" table, though.  It sort
of looks like you tried to feed the Javascript flags to regexp_match(),
which unsurprisingly doesn't work all that well.  Even discounting
that, I'm not getting quite the same results, and I don't understand
why not.  So how was that made from the raw "patterns" and "subjects"
tables?

> PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
> Time: 758514.703 ms (12:38.515)
> Time: 755883.600 ms (12:35.884)
> Time: 746522.107 ms (12:26.522)
>
> PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
> HEAD (4e703d671)
> Time: 519620.646 ms (08:39.621)
> Time: 518998.366 ms (08:38.998)
> Time: 519696.129 ms (08:39.696)

Hmmm ... we haven't yet committed any performance-relevant changes to the
regex code, so it can't take any credit for this improvement from 13.2 to
HEAD.  I speculate that this is due to some change in our parallelism
stuff (since I observe that this query is producing a parallelized hash
plan).  Still, the next drop to circa 2:21 runtime is impressive enough
by itself.

> Heh, what a funny coincidence:
> The regex I used to shrink the very-long-pattern,
> actually happens to run a lot faster with the patches.

Yeah, that just happens to be a poster child for the MATCHALL idea:

> EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');

Each of the parenthesized subexpressions of the RE is successfully
recognized as being MATCHALL, with length range 1..80 for two of them and
0..infinity for the middle one.  That means the engine doesn't have to
physically scan the text to determine whether a possible division point
satisfies the sub-regexp; and that means we can find the correct division
points in O(N) not O(N^2) time.

            regards, tom lane

pgsql-hackers by date:

From: Pavel Stehule
Date: 18 February 2021, 17:25:53
Subject: Re: Problem with accessing TOAST data in stored procedures

From: Peter Eisentraut
Date: 18 February 2021, 18:14:30
Subject: Re: cursor sensitivity misunderstanding

Re: Some regular-expression performance hacking - Mailing list pgsql-hackers

Previous

Next