Re: Row pattern recognition - Mailing list pgsql-hackers
| From | SungJun Jang |
|---|---|
| Subject | Re: Row pattern recognition |
| Date | |
| Msg-id | CAE+cgNgSG6oiAXT=FL+gK71Squ_eacyTCxSVXaAUTwODrkzqFg@mail.gmail.com Whole thread |
| In response to | Re: Row pattern recognition (Tatsuo Ishii <ishii@postgresql.org>) |
| Responses |
Re: Row pattern recognition
|
| List | pgsql-hackers |
Hi hackers,
Thank you for the excellent report! I am glad to see that PostgreSQL
RPR is much faster than Trino, especially in the match failure cases.
Is it possible to share the data generation script and the query for
PostgreSQL so that I could locally perform the tests?
Here is a self-contained guide to reproduce the RPR ABCD pattern test locally
using PostgreSQL, Trino, and Oracle.
Requirements:
PostgreSQL 19devel build with the RPR patch applied (local)
Trino and Oracle can be set up via Docker (see README for details):
https://github.com/assam258-5892/docker-databases
Start all services:
cd ~/docker-databases && docker compose up -d trino-service oracle-service
Note: docker compose up does not pull images automatically if they are already
cached locally. To fetch the latest images, run docker compose pull first.
Note: The Oracle image requires an Oracle account. Register at
https://container-registry.oracle.com, accept the license for the database
image, then log in before pulling:
docker login container-registry.oracle.com
Connect to each database shell:
Trino: docker compose exec -it trino-service trino
Oracle: docker compose exec -it oracle-service sqlplus / as sysdba
Step 1: Create the test table and data
PostgreSQL (1x scale, 20,000 rows):
DROP TABLE IF EXISTS abcd_test;
CREATE TABLE abcd_test AS
SELECT v,
CASE
WHEN v % 10000 < 3333 THEN 'A'
WHEN v % 10000 >= 3333 AND v % 10000 < 6666 THEN 'B'
WHEN v % 10000 >= 6666 AND v % 10000 < 9999 THEN 'C'
WHEN v % 10000 = 9999 THEN 'D'
END AS cat
FROM generate_series(0, 19999) AS v;
ANALYZE abcd_test;
Trino (1x scale, 20,000 rows):
CREATE SCHEMA IF NOT EXISTS memory.test;
DROP TABLE IF EXISTS memory.test.abcd_test;
CREATE TABLE memory.test.abcd_test AS
WITH nums AS (
SELECT a.v * 10000 + b.v AS v
FROM UNNEST(sequence(0, 1)) AS a(v)
CROSS JOIN UNNEST(sequence(0, 9999)) AS b(v)
)
SELECT CAST(v AS INTEGER) AS v,
CASE
WHEN v % 10000 < 3333 THEN 'A'
WHEN v % 10000 >= 3333 AND v % 10000 < 6666 THEN 'B'
WHEN v % 10000 >= 6666 AND v % 10000 < 9999 THEN 'C'
WHEN v % 10000 = 9999 THEN 'D'
END AS cat
FROM nums;
Note: Trino sequence() is limited to 10,000 elements per call, so a CROSS JOIN
is used. For scale Sx, change sequence(0, 1) to sequence(0, S*2-1).
Oracle (1x scale, 20,000 rows):
DROP TABLE abcd_test PURGE;
CREATE TABLE abcd_test AS
SELECT v,
CASE
WHEN MOD(v, 10000) < 3333 THEN 'A'
WHEN MOD(v, 10000) >= 3333 AND MOD(v, 10000) < 6666 THEN 'B'
WHEN MOD(v, 10000) >= 6666 AND MOD(v, 10000) < 9999 THEN 'C'
WHEN MOD(v, 10000) = 9999 THEN 'D'
END AS cat
FROM (SELECT LEVEL - 1 AS v FROM dual CONNECT BY LEVEL <= 20000);
EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'ABCD_TEST');
Verify data distribution (all engines):
SELECT cat, COUNT(*) AS cnt FROM abcd_test GROUP BY cat ORDER BY cat;
Expected (1x): A=6666, B=6666, C=6666, D=2
Step 2: Run Test 1 — A+ B+ C+ D (match expected)
Expected: 2 rows returned (one match per segment)
PostgreSQL:
SELECT match_first, match_last, match_len
FROM (
SELECT v,
first_value(v) OVER w AS match_first,
last_value(v) OVER w AS match_last,
count(*) OVER w AS match_len
FROM abcd_test
WINDOW w AS (
ORDER BY v
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A+ B+ C+ D)
DEFINE
A AS cat = 'A',
B AS cat = 'B',
C AS cat = 'C',
D AS cat = 'D'
)
) result
WHERE match_len > 0;
Trino / Oracle:
SELECT match_first, match_last, match_len
FROM abcd_test
MATCH_RECOGNIZE (
ORDER BY v
MEASURES
FIRST(v) AS match_first,
LAST(v) AS match_last,
COUNT(*) AS match_len
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A+ B+ C+ D)
DEFINE
A AS cat = 'A',
B AS cat = 'B',
C AS cat = 'C',
D AS cat = 'D'
) mr;
Note: Trino uses memory.test.abcd_test as the table name.
Expected result (1x):
match_first | match_last | match_len
------------|------------|----------
0 | 9999 | 10000
10000 | 19999 | 10000
Step 3: Run Test 2 — A+ B+ C+ E (match failure)
Expected: 0 rows (E does not exist)
Use the same queries as Test 1 with two changes:
PATTERN: (A+ B+ C+ D) → (A+ B+ C+ E)
DEFINE: cat = 'D' → cat = 'E'
Warning: Trino Test 2 at 1x scale takes approximately 5-6 minutes.
Step 4: Scale up (optional)
Re-create the test table at 2x scale (40,000 rows) and then repeat Step 2 and
Step 3.
PostgreSQL (2x scale, 40,000 rows):
DROP TABLE IF EXISTS abcd_test;
CREATE TABLE abcd_test AS
SELECT v,
CASE
WHEN v % 20000 < 6666 THEN 'A'
WHEN v % 20000 >= 6666 AND v % 20000 < 13332 THEN 'B'
WHEN v % 20000 >= 13332 AND v % 20000 < 19999 THEN 'C'
WHEN v % 20000 = 19999 THEN 'D'
END AS cat
FROM generate_series(0, 39999) AS v;
ANALYZE abcd_test;
Trino (2x scale, 40,000 rows):
CREATE SCHEMA IF NOT EXISTS memory.test;
DROP TABLE IF EXISTS memory.test.abcd_test;
CREATE TABLE memory.test.abcd_test AS
WITH nums AS (
SELECT a.v * 10000 + b.v AS v
FROM UNNEST(sequence(0, 3)) AS a(v)
CROSS JOIN UNNEST(sequence(0, 9999)) AS b(v)
)
SELECT CAST(v AS INTEGER) AS v,
CASE
WHEN v % 20000 < 6666 THEN 'A'
WHEN v % 20000 >= 6666 AND v % 20000 < 13332 THEN 'B'
WHEN v % 20000 >= 13332 AND v % 20000 < 19999 THEN 'C'
WHEN v % 20000 = 19999 THEN 'D'
END AS cat
FROM nums;
Oracle (2x scale, 40,000 rows):
DROP TABLE abcd_test PURGE;
CREATE TABLE abcd_test AS
SELECT v,
CASE
WHEN MOD(v, 20000) < 6666 THEN 'A'
WHEN MOD(v, 20000) >= 6666 AND MOD(v, 20000) < 13332 THEN 'B'
WHEN MOD(v, 20000) >= 13332 AND MOD(v, 20000) < 19999 THEN 'C'
WHEN MOD(v, 20000) = 19999 THEN 'D'
END AS cat
FROM (SELECT LEVEL - 1 AS v FROM dual CONNECT BY LEVEL <= 40000);
EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'ABCD_TEST');
Expected (2x): A=13332, B=13332, C=13332, D=4
Then run Step 2 and Step 3 as-is.
Warning: Trino Test 2 at 2x scale takes approximately 20-25 minutes.
Please let me know if you encounter any issues reproducing this.
Best regards
SungJun
pgsql-hackers by date: