Re: Row pattern recognition - Mailing list pgsql-hackers

From SungJun Jang
Subject Re: Row pattern recognition
Date
Msg-id CAE+cgNgSG6oiAXT=FL+gK71Squ_eacyTCxSVXaAUTwODrkzqFg@mail.gmail.com
Whole thread
In response to Re: Row pattern recognition  (Tatsuo Ishii <ishii@postgresql.org>)
Responses Re: Row pattern recognition
List pgsql-hackers
Hi hackers,

Thank you for the excellent report! I am glad to see that PostgreSQL
RPR is much faster than Trino, especially in the match failure cases.

Is it possible to share the data generation script and the query for
PostgreSQL so that I could locally perform the tests?

Here is a self-contained guide to reproduce the RPR ABCD pattern test locally
using PostgreSQL, Trino, and Oracle.


Requirements:

PostgreSQL 19devel build with the RPR patch applied (local)

Trino and Oracle can be set up via Docker (see README for details):

  https://github.com/assam258-5892/docker-databases

Start all services:

  cd ~/docker-databases && docker compose up -d trino-service oracle-service

Note: docker compose up does not pull images automatically if they are already
cached locally. To fetch the latest images, run docker compose pull first.

Note: The Oracle image requires an Oracle account. Register at
https://container-registry.oracle.com, accept the license for the database
image, then log in before pulling:

  docker login container-registry.oracle.com

Connect to each database shell:

  Trino:  docker compose exec -it trino-service trino
  Oracle: docker compose exec -it oracle-service sqlplus / as sysdba



Step 1: Create the test table and data

PostgreSQL (1x scale, 20,000 rows):

DROP TABLE IF EXISTS abcd_test;
CREATE TABLE abcd_test AS
SELECT v,
       CASE
           WHEN v % 10000 < 3333 THEN 'A'
           WHEN v % 10000 >= 3333 AND v % 10000 < 6666 THEN 'B'
           WHEN v % 10000 >= 6666 AND v % 10000 < 9999 THEN 'C'
           WHEN v % 10000 = 9999 THEN 'D'
       END AS cat
FROM generate_series(0, 19999) AS v;

ANALYZE abcd_test;



Trino (1x scale, 20,000 rows):

CREATE SCHEMA IF NOT EXISTS memory.test;

DROP TABLE IF EXISTS memory.test.abcd_test;
CREATE TABLE memory.test.abcd_test AS
WITH nums AS (
    SELECT a.v * 10000 + b.v AS v
    FROM UNNEST(sequence(0, 1)) AS a(v)
    CROSS JOIN UNNEST(sequence(0, 9999)) AS b(v)
)
SELECT CAST(v AS INTEGER) AS v,
       CASE
           WHEN v % 10000 < 3333 THEN 'A'
           WHEN v % 10000 >= 3333 AND v % 10000 < 6666 THEN 'B'
           WHEN v % 10000 >= 6666 AND v % 10000 < 9999 THEN 'C'
           WHEN v % 10000 = 9999 THEN 'D'
       END AS cat
FROM nums;


Note: Trino sequence() is limited to 10,000 elements per call, so a CROSS JOIN
is used. For scale Sx, change sequence(0, 1) to sequence(0, S*2-1).


Oracle (1x scale, 20,000 rows):

DROP TABLE abcd_test PURGE;
CREATE TABLE abcd_test AS
SELECT v,
       CASE
           WHEN MOD(v, 10000) < 3333 THEN 'A'
           WHEN MOD(v, 10000) >= 3333 AND MOD(v, 10000) < 6666 THEN 'B'
           WHEN MOD(v, 10000) >= 6666 AND MOD(v, 10000) < 9999 THEN 'C'
           WHEN MOD(v, 10000) = 9999 THEN 'D'
       END AS cat
FROM (SELECT LEVEL - 1 AS v FROM dual CONNECT BY LEVEL <= 20000);

EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'ABCD_TEST');



Verify data distribution (all engines):

SELECT cat, COUNT(*) AS cnt FROM abcd_test GROUP BY cat ORDER BY cat;

Expected (1x): A=6666, B=6666, C=6666, D=2


Step 2: Run Test 1 — A+ B+ C+ D (match expected)

Expected: 2 rows returned (one match per segment)

PostgreSQL:

SELECT match_first, match_last, match_len
FROM (
    SELECT v,
           first_value(v) OVER w AS match_first,
           last_value(v) OVER w AS match_last,
           count(*) OVER w AS match_len
    FROM abcd_test
    WINDOW w AS (
        ORDER BY v
        ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
        AFTER MATCH SKIP PAST LAST ROW
        PATTERN (A+ B+ C+ D)
        DEFINE
            A AS cat = 'A',
            B AS cat = 'B',
            C AS cat = 'C',
            D AS cat = 'D'
    )
) result
WHERE match_len > 0;



Trino / Oracle:

SELECT match_first, match_last, match_len
FROM abcd_test
MATCH_RECOGNIZE (
    ORDER BY v
    MEASURES
        FIRST(v) AS match_first,
        LAST(v) AS match_last,
        COUNT(*) AS match_len
    ONE ROW PER MATCH
    AFTER MATCH SKIP PAST LAST ROW
    PATTERN (A+ B+ C+ D)
    DEFINE
        A AS cat = 'A',
        B AS cat = 'B',
        C AS cat = 'C',
        D AS cat = 'D'
) mr;


Note: Trino uses memory.test.abcd_test as the table name.

Expected result (1x):

match_first | match_last | match_len
------------|------------|----------
          0 |       9999 |     10000
      10000 |      19999 |     10000



Step 3: Run Test 2 — A+ B+ C+ E (match failure)

Expected: 0 rows (E does not exist)

Use the same queries as Test 1 with two changes:

PATTERN: (A+ B+ C+ D) → (A+ B+ C+ E)

DEFINE: cat = 'D' → cat = 'E'

Warning: Trino Test 2 at 1x scale takes approximately 5-6 minutes.


Step 4: Scale up (optional)

Re-create the test table at 2x scale (40,000 rows) and then repeat Step 2 and
Step 3.

PostgreSQL (2x scale, 40,000 rows):

DROP TABLE IF EXISTS abcd_test;
CREATE TABLE abcd_test AS
SELECT v,
       CASE
           WHEN v % 20000 < 6666 THEN 'A'
           WHEN v % 20000 >= 6666 AND v % 20000 < 13332 THEN 'B'
           WHEN v % 20000 >= 13332 AND v % 20000 < 19999 THEN 'C'
           WHEN v % 20000 = 19999 THEN 'D'
       END AS cat
FROM generate_series(0, 39999) AS v;

ANALYZE abcd_test;



Trino (2x scale, 40,000 rows):

CREATE SCHEMA IF NOT EXISTS memory.test;

DROP TABLE IF EXISTS memory.test.abcd_test;
CREATE TABLE memory.test.abcd_test AS
WITH nums AS (
    SELECT a.v * 10000 + b.v AS v
    FROM UNNEST(sequence(0, 3)) AS a(v)
    CROSS JOIN UNNEST(sequence(0, 9999)) AS b(v)
)
SELECT CAST(v AS INTEGER) AS v,
       CASE
           WHEN v % 20000 < 6666 THEN 'A'
           WHEN v % 20000 >= 6666 AND v % 20000 < 13332 THEN 'B'
           WHEN v % 20000 >= 13332 AND v % 20000 < 19999 THEN 'C'
           WHEN v % 20000 = 19999 THEN 'D'
       END AS cat
FROM nums;



Oracle (2x scale, 40,000 rows):

DROP TABLE abcd_test PURGE;
CREATE TABLE abcd_test AS
SELECT v,
       CASE
           WHEN MOD(v, 20000) < 6666 THEN 'A'
           WHEN MOD(v, 20000) >= 6666 AND MOD(v, 20000) < 13332 THEN 'B'
           WHEN MOD(v, 20000) >= 13332 AND MOD(v, 20000) < 19999 THEN 'C'
           WHEN MOD(v, 20000) = 19999 THEN 'D'
       END AS cat
FROM (SELECT LEVEL - 1 AS v FROM dual CONNECT BY LEVEL <= 40000);

EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'ABCD_TEST');

Expected (2x): A=13332, B=13332, C=13332, D=4

Then run Step 2 and Step 3 as-is.

Warning: Trino Test 2 at 2x scale takes approximately 20-25 minutes.


Please let me know if you encounter any issues reproducing this.

Best regards
SungJun
 

pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: SQL Property Graph Queries (SQL/PGQ)
Next
From: Amit Kapila
Date:
Subject: Re: Report bytes and transactions actually sent downtream