Re: Some regular-expression performance hacking - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: Some regular-expression performance hacking
Date
Msg-id d8899c66-5492-43e9-8eff-a07d1d593ad0@www.fastmail.com
Whole thread Raw
In response to Re: Some regular-expression performance hacking  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Some regular-expression performance hacking
Re: Some regular-expression performance hacking
List pgsql-hackers
On Thu, Feb 18, 2021, at 19:10, Tom Lane wrote:
"Joel Jacobson" <joel@compiler.org> writes:
>> I've produced a new dataset which now also includes the regex flags (if
>> any) used for each subject applied to a pattern.

Again, thanks for collecting this data!  I'm a little confused about
how you produced the results in the "tests" table, though.  It sort
of looks like you tried to feed the Javascript flags to regexp_match(),
which unsurprisingly doesn't work all that well.

That's exactly what I did. Some of the flags work the same between Javascript and PostgreSQL, others don't.

I thought maybe something interesting would surface in just trying them blindly.

Flags that aren't supported and gives errors are reported as tests where error is not null.

Most patterns have no flags, and second most popular is just the "i" flag, which should work the same.

SELECT flags, COUNT(*) FROM patterns GROUP BY 1 ORDER BY 2 DESC;
flags | count
-------+--------
       | 151927
i     | 120336
gi    |  26057
g     |  13263
gm    |   4606
gim   |    699
im    |    491
y     |    367
m     |    365
gy    |    105
u     |     50
giy   |     38
giu   |     20
gimu  |     14
iy    |     11
iu    |      6
gimy  |      3
gu    |      2
gmy   |      2
imy   |      1
my    |      1
(21 rows)

This query shows what Javascript-regex-flags that could be used as-is without errors:

SELECT
  patterns.flags,
  COUNT(*)
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
WHERE tests.error IS NULL
GROUP BY 1
ORDER BY 2;

flags |  count
-------+---------
im    |    2534
m     |    4460
i     |  543598
       | 2704177
(4 rows)

I considered filtering/converting the flags to PostgreSQL,
maybe that would be an interesting approach to try as well.


Even discounting
that, I'm not getting quite the same results, and I don't understand
why not.  So how was that made from the raw "patterns" and "subjects"
tables?

The rows in the tests table were generated by the create_regexp_tests() function [1]

Each subject now has a foreign key to a specific pattern,
where the (pattern, flags) combination are unique in patterns.
The actual unique constraint is on (pattern_hash, flags) to avoid
an index directly on pattern which can be huge as we've seen.

So, for each subject, it is known via the pattern_id
exactly what flags were used when the regex was compiled
(and later executed/applied with the subject).



> PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
> Time: 758514.703 ms (12:38.515)
> Time: 755883.600 ms (12:35.884)
> Time: 746522.107 ms (12:26.522)

> PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
> HEAD (4e703d671)
> Time: 519620.646 ms (08:39.621)
> Time: 518998.366 ms (08:38.998)
> Time: 519696.129 ms (08:39.696)

Hmmm ... we haven't yet committed any performance-relevant changes to the
regex code, so it can't take any credit for this improvement from 13.2 to
HEAD.  I speculate that this is due to some change in our parallelism
stuff (since I observe that this query is producing a parallelized hash
plan).  Still, the next drop to circa 2:21 runtime is impressive enough
by itself.

OK. Another factor might perhaps be the PostgreSQL 10, 11, 12, 13 versions were compiled elsewhere,
I used the OS X binaries from https://postgresapp.com/, whereas version 14 I of course compiled myself.
Maybe I should have compiled 10, 11, 12, 13 myself instead, for a better comparison,
but I mostly just wanted to verify if I could find any differences, the performance comparison was a bonus.


> Heh, what a funny coincidence:
> The regex I used to shrink the very-long-pattern,
> actually happens to run a lot faster with the patches.

Yeah, that just happens to be a poster child for the MATCHALL idea:

> EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');

Each of the parenthesized subexpressions of the RE is successfully
recognized as being MATCHALL, with length range 1..80 for two of them and
0..infinity for the middle one.  That means the engine doesn't have to
physically scan the text to determine whether a possible division point
satisfies the sub-regexp; and that means we can find the correct division
points in O(N) not O(N^2) time.

Very nice.

Like you said earlier, perhaps the regex engine has been optimized enough for this time.
If not, you want to investigate an additional idea,
that I think can be seen as a generalization of the optimization trick for (.*),
if I've understood how it works correctly.

Let's see if I can explain the idea:

One of the problems with representing regexes with large bracket range expressions, like [a-z],
is you get an explosion of edges, if the model can only represent state transitions for single characters.

If we could instead let a single edge (for a state transition) represent a set of characters,
or normally even more efficiently, a set of range of characters, then we could reduce the
number of edges we need to represent the graph.

The naive approach to just use the ranges as-is doesn't work.

Instead, the graph must first be created with single-character edges.

It is then examined what ranges can be constructed in a way that no single range
overlaps any other range, so that every range can be seen as a character in an alphabet.

Perhaps a bit of fiddling with some examples is easiest
to get a grip of the idea.

Here is a live demo of the idea:

The graphs are rendered live when typing in the regex,
using a Javascript port of GraphViz.

For example, try entering the regex: t[a-z]*m

This generates this range-optimized graph for the regex:

              /--[a-ln-su-z]-----------------\
              |/------t--------------------\ |
              ||                           | |
-->(0)--t-->({0,1})----m-------->({0 1 2}) | |
               ^---[a-ln-su-z]--/          | |
               ^-------t-------/           | |
               ^---------------------------/ |
               ^-----------------------------/
Notice how the [a-z] bracket expression has been split up,
and we now have 3 distinct set of "ranges":
t
m
[a-ln-su-z]

Since no ranges are overlapping, each such range can safely be seen as a letter in an alphabet.

Once we have our final graph, but before we proceed to generate the machine code for it,
we can shrink the graph further by merging ranges together, which eliminate some edges:

              /--------------\
              |              |
--->(0)--t-->(1)<--[a-ln-z]--/
              |^-[a-lnz]-\
              \----m-->((2))<----\
                         |       |
                         \---m---/

Notice how [a-ln-su-z]+t becomes [a-ln-z].

Another optimization I've come up with (or probably re-invented because it feels quite obvious),
is to read more than one character, when knowing for sure multiple characters-in-a-row
are expected, by concatenating edges having only one parent and one child.

In our example, we know for sure at least two characters will be read for the regex t[a-z]*m,
so with this optimization enabled, we get this graph:

                        /--[a-ln-z]
                        |     |
--->(0)---t[a-ln-z]--->(1)<---+--[a-ln-z]
     |                  |             /
     |                   \---m--->((2))<------\
     \--------------tm------------^ |         |
                                    \----m----/


This makes not much difference for a few characters,
but if we have a long pattern with a long sentence
that is repeated, we could e.g. read in 32 bytes
and compare them all in one operation,
if our machine had 256-bits SIMD registers/instructions.

This idea has also been implemented in the online demo.

There is a level which can be adjusted
from 0 to 32 to control how many bytes to merge at most,
located in the "[+]dfa5 = merge_linear(dfa4)" step.

Anyway, I can totally understand if you've had enough of regex optimizations for this time,
but in case not, I wanted to share my work in this field, in case it's interesting to look at now or in the future.

/Joel

pgsql-hackers by date:

Previous
From: Amichai Amar
Date:
Subject: many-to-many problem
Next
From: Jacob Champion
Date:
Subject: Re: Support for NSS as a libpq TLS backend