Re: Some regular-expression performance hacking - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: Some regular-expression performance hacking
Date
Msg-id 99f6cbe0-6e58-4e41-add0-977271a5cf84@www.fastmail.com
Whole thread Raw
In response to Re: Some regular-expression performance hacking  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Some regular-expression performance hacking
List pgsql-hackers
On Mon, Feb 15, 2021, at 04:11, Tom Lane wrote:
>I got these runtimes (non-cassert builds):
>
>HEAD 313661.149 ms (05:13.661)
>+0001 297397.293 ms (04:57.397) 5% better than HEAD
>+0002 151995.803 ms (02:31.996) 51% better than HEAD
>+0003 139843.934 ms (02:19.844) 55% better than HEAD
>+0004 95034.611 ms (01:35.035) 69% better than HEAD
>
>Since I don't have all the tables used in your query, I can't
>try to reproduce your results exactly.  I suspect the reason
>I'm getting a better percentage improvement than you did is
>that the joining/grouping/ordering involved in your query
>creates a higher baseline query cost.

Mind blowing speed-up, wow!

I've tested all 4 patches successfully.

To eliminate the baseline cost of the join,
I first created this table:

CREATE TABLE performance_test AS
SELECT
  subjects.subject,
  patterns.pattern,
  tests.is_match,
  tests.captured
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
JOIN server_versions ON server_versions.server_version_num = tests.server_version_num
WHERE server_versions.server_version = current_setting('server_version')
AND tests.error IS NULL
;

Then I ran this query:

\timing

SELECT
  is_match <> (subject ~ pattern),
  captured IS DISTINCT FROM regexp_match(subject, pattern),
  COUNT(*)
FROM performance_test
GROUP BY 1,2
ORDER BY 1,2
;

All patches gave the same result:

?column? | ?column? |  count
----------+----------+---------
f        | f        | 1448212
(1 row)

I.e., no detected semantic differences.

Timing differences:

HEAD  570632.722 ms (09:30.633)
+0001 472938.857 ms (07:52.939) 17% better than HEAD
+0002 451638.049 ms (07:31.638) 20% better than HEAD
+0003 439377.813 ms (07:19.378) 23% better than HEAD
+0004 96447.038 ms (01:36.447) 83% better than HEAD

I tested on my MacBook Pro 2.4GHz 8-Core Intel Core i9, 32 GB 2400 MHz DDR4 running macOS Big Sur 11.1:

SELECT version();
                                                       version
----------------------------------------------------------------------------------------------------------------------
PostgreSQL 14devel on x86_64-apple-darwin20.2.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
(1 row)

My HEAD = 46d6e5f567906389c31c4fb3a2653da1885c18ee.

PostgreSQL was compiled with just ./configure, no parameters, and the only non-default postgresql.conf settings were these:
log_destination = 'csvlog'
logging_collector = on
log_filename = 'postgresql.log'

Amazing work!

I hope to have a new dataset ready soon with regex flags for applied subjects as well.

/Joel

pgsql-hackers by date:

Previous
From: Greg Nancarrow
Date:
Subject: Re: Parallel INSERT (INTO ... SELECT ...)
Next
From: Michael Paquier
Date:
Subject: Re: ERROR: invalid spinlock number: 0