Re: Row pattern recognition - Mailing list pgsql-hackers

From Henson Choi
Subject Re: Row pattern recognition
Date
Msg-id CAAAe_zChxC051WsfPrfTaJ0N362QPkj3vLnQ9pbAC-eYeco9NQ@mail.gmail.com
Whole thread Raw
In response to Re: Row pattern recognition  (Tatsuo Ishii <ishii@postgresql.org>)
Responses Re: Row pattern recognition
List pgsql-hackers
Subject: Re: Row pattern recognition

Hi Ishii-san,

I've been working on the pattern grammar and AST for row pattern recognition.
The attached patch extends the PATTERN clause to support more regex-like syntax.

The main additions are:

- Alternation: (A | B)
- Grouping with quantifiers: (A B)+, (A | B)*
- Full quantifier support: +, *, ?, {n}, {n,}, {n,m}
- Reluctant quantifiers: +?, *?, ?? (parsed but not executed yet)

I've introduced RPRPatternNode to represent the pattern as a Thompson-style
AST with four node types: VAR, SEQ, ALT, and GROUP. Each node stores min/max
bounds for quantifiers and a reluctant flag. This should make the eventual
NFA implementation more straightforward.

The parser also does some basic pattern optimizations. For example:
  - (A (B C)) gets flattened to (A B C)
  - ((A)) unwraps to just A
  - (A{2}){3} becomes A{6}
  - (A | B | A) simplifies to (A | B)
  - A A A merges into A{3,3}

I also updated ruleutils.c to deparse patterns properly.


Current state:

The executor still uses regex matching, which works but has some limitations.
Complex patterns with lots of alternations can be slow.

The 26-variable limit is still there since we're mapping to [a-z] for the regex.

Reluctant quantifiers (+?, *?, ??) are parsed but will raise an error if you
try to use them. They'll need proper NFA support to work correctly.

Some optimization code was removed in the process, but it should work better
once NFA matching is in place.


Next steps:

The plan is to replace regex matching with NFA-based execution. The AST
structure is already set up for that.

The first step is converting the Thompson AST into a flat representation -
flattening pattern elements and building actual NFA states with transitions,
then replacing the regex engine with direct NFA matching. This should handle all
the pattern syntax properly, including reluctant quantifiers.

Once that's working, the next step is handling multiple contexts during matching -
tracking which states are active and absorbing context as we go through rows.
This is essential for proper pattern matching behavior.

After that, I'll implement PATH and CLASSIFIER functions, which depend on having
the match context available.

Then there's room for various optimizations - PATH space optimization,
incremental aggregate computation over matched rows, better state merging, etc.

Longer-term, the goal is to get to MATCH_RECOGNIZE in the FROM clause for
full SQL standard compliance.

Let me know if you have any concerns or suggestions about the approach.

Best regards,
Henson
Attachment

pgsql-hackers by date:

Previous
From: Konstantin Knizhnik
Date:
Subject: Re: global temporary table (GTT) - are there some ideas how to implement it?
Next
From: "zengman"
Date:
Subject: Re: Use correct macro for accessing offset numbers.