Re: Some regular-expression performance hacking - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: Some regular-expression performance hacking
Date
Msg-id 5deabab3-a335-4b57-a046-8a9ad8f2f142@www.fastmail.com
Whole thread Raw
In response to Re: Some regular-expression performance hacking  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Some regular-expression performance hacking
List pgsql-hackers
On Wed, Feb 17, 2021, at 22:00, Tom Lane wrote:
Attached is an updated patch series; it's rebased over 4e703d671
which took care of some not-really-related fixes, and I made a
pass of cleanup and comment improvements.  I think this is pretty
much ready to commit, unless you want to do more testing or
code-reading.

I've produced a new dataset which now also includes the regex flags (if any) used for each subject applied to a pattern.

The new dataset contains 318364 patterns and 4474520 subjects.
(The old one had 235204 patterns and 1489489 subjects.)

I've tested the new dataset against PostgreSQL 10.16, 11.11, 12.6, 13.2, HEAD (4e703d671) and HEAD+patches.

I based the comparisons on the subjects that didn't cause an error on 13.2:

CREATE TABLE performance_test AS
SELECT
  subjects.subject,
  patterns.pattern,
  patterns.flags,
  tests.is_match,
  tests.captured
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
WHERE tests.error IS NULL
;

I then measured the query below for each PostgreSQL version:

\timing
SELECT version();
SELECT
  is_match <> (subject ~ pattern) AS is_match_diff,
  captured IS DISTINCT FROM regexp_match(subject, pattern, flags) AS captured_diff,
  COUNT(*)
FROM performance_test
GROUP BY 1,2
ORDER BY 1,2
;

All versions produces the same result:

is_match_diff | captured_diff |  count
---------------+---------------+---------
f             | f             | 3254769
(1 row)

Good! Not a single case that differs of over 3 million different regex pattern/subject combinations,
between five major PostgreSQL versions! That's a very stable regex engine.

To get a feeling for the standard deviation of the timings,
I executed the same query above three times for each PostgreSQL version:

PostgreSQL 10.16 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.2 (clang-700.1.81), 64-bit
Time: 795674.830 ms (13:15.675)
Time: 794249.704 ms (13:14.250)
Time: 771036.707 ms (12:51.037)

PostgreSQL 11.11 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit
Time: 765466.191 ms (12:45.466)
Time: 787135.316 ms (13:07.135)
Time: 779582.635 ms (12:59.583)

PostgreSQL 12.6 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit
Time: 785500.516 ms (13:05.501)
Time: 784511.591 ms (13:04.512)
Time: 786727.973 ms (13:06.728)

PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
Time: 758514.703 ms (12:38.515)
Time: 755883.600 ms (12:35.884)
Time: 746522.107 ms (12:26.522)

PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
HEAD (4e703d671)
Time: 519620.646 ms (08:39.621)
Time: 518998.366 ms (08:38.998)
Time: 519696.129 ms (08:39.696)

HEAD (4e703d671)+0001+0002+0003+0004+0005
Time: 141290.329 ms (02:21.290)
Time: 141849.709 ms (02:21.850)
Time: 141630.819 ms (02:21.631)

That's a mind-blowing speed-up!

I also ran the more detailed test between 13.2 and HEAD+patches,
that also tests for differences in errors.

Like before, one similar improvement was found,
which previously resulted in an error, but now goes through OK:

SELECT * FROM vdeviations;
-[ RECORD 1 ]----+-------------------------------------------------------------------------------------------------------
pattern          | \.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mi ... 100497 chars ... abs\.org|yolasite\.com|za\.net|za\.org)$
flags            |
subject          | www.aeroexpo.online
count            | 1
a_server_version | 13.2
a_duration       | 00:00:00.298253
a_is_match       |
a_captured       |
a_error          | invalid regular expression: regular expression is too complex
b_server_version | 14devel
b_duration       | 00:00:00.665958
b_is_match       | t
b_captured       | {online}
b_error          |

Very nice.

I've uploaded the new dataset to the same place as before.

The schema for it can be found at https://github.com/truthly/regexes-in-the-wild

If anyone else would like a copy of the 715MB dataset, please let me know.

/Joel

pgsql-hackers by date:

Previous
From: Guillaume Lelarge
Date:
Subject: Re: Extensions not dumped when --schema is used
Next
From: Masahiko Sawada
Date:
Subject: Re: 64-bit XIDs in deleted nbtree pages