Re: Some regular-expression performance hacking - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: Some regular-expression performance hacking
Date
Msg-id 913516b8-c3d7-4dea-bf60-d735cf4673f1@www.fastmail.com
Whole thread Raw
In response to Re: Some regular-expression performance hacking  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Some regular-expression performance hacking
Re: Some regular-expression performance hacking
List pgsql-hackers
On Sat, Feb 13, 2021, at 22:11, Tom Lane wrote:
>0001-invent-rainbow-arcs-2.patch
>0002-recognize-matchall-NFAs-2.patch

I've successfully tested both patches against the 1.5M regexes-in-the-wild dataset.

Out of the 1489489 (pattern, text string) pairs tested,
there was only one single deviation:

This 100577 bytes big regex (pattern_id = 207811)...

\.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mil\.ac| ... |wmflabs\.org|yolasite\.com|za\.net|za\.org)$

...previously raised...

    error invalid regular expression: regular expression is too complex

...but now goes through:

is_match <NULL> => t
captured <NULL> => {de}
error invalid regular expression: regular expression is too complex => <NULL>

Nice. The patched regex engine is apparently capable of handling even more complex regexes than before.

The test that found the deviation tests each (pattern, text string) individually,
to catch errors. But I've also made a separate query to just test regexes
known to not cause errors, to allow testing all regexes in one big query,
which fully utilizes the CPU cores and also runs quicker.

Below is the result of the performance test query:

\timing

SELECT
  tests.is_match IS NOT DISTINCT FROM (subjects.subject ~ patterns.pattern),
  tests.captured IS NOT DISTINCT FROM regexp_match(subjects.subject, patterns.pattern),
  COUNT(*)
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
JOIN server_versions ON server_versions.server_version_num = tests.server_version_num
WHERE server_versions.server_version = current_setting('server_version')
AND tests.error IS NULL
GROUP BY 1,2
ORDER BY 1,2;

-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a:

?column? | ?column? |  count
----------+----------+---------
t        | t        | 1448212
(1 row)

Time: 592196.145 ms (09:52.196)

-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches:

?column? | ?column? |  count
----------+----------+---------
t        | t        | 1448212
(1 row)

Time: 461739.364 ms (07:41.739)

That's an impressive 22% speed-up!

/Joel

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: How to get Relation tuples in C function
Next
From: Ranier Vilela
Date:
Subject: Re: pg_cryptohash_final possible out-of-bounds access (per Coverity)