Home > mailing lists

Re: Row pattern recognition - Mailing list pgsql-hackers

From	Henson Choi
Subject	Re: Row pattern recognition
Date	January 2 20:19:40
Msg-id	CAAAe_zC-53U0rwHCsvLT5=t0QtKFeBGNd2PCVsQSua2L2uUqpA@mail.gmail.com Whole thread Raw
In response to	Re: Row pattern recognition (Henson Choi <assam258@gmail.com>)
Responses	Re: Row pattern recognition
List	pgsql-hackers

Tree view

Hi hackers,

I've been working on an alternative implementation approach for Row
Pattern Recognition (RPR) that addresses some limitations in the
current regex-based patch.

Key differences from the existing approach:

1. Streaming NFA instead of regex engine
- Process rows incrementally without accumulating encoded strings
- Avoids O(V^N) combinatorial explosion in multi-variable scenarios

2. No 26-variable limit
- Variables identified by index, not alphabet encoding

3. Proper Lexical Order support
- Respects PATTERN alternative order for ONE ROW PER MATCH

4. GREEDY/RELUCTANT quantifier support
- Longest vs shortest match semantics

5. Incremental MEASURES computation
- Aggregate values computed during matching, no rescan needed

The attached document describes the design in detail, including:
- Data structures (Pattern, MatchState, MatchContext)
- Execution flow and state transitions
- Context absorption optimization for greedy patterns
- AST optimization passes

This is still a work in progress. I'd appreciate any feedback on
the approach before proceeding with PostgreSQL integration.

Best regards,
Henson

------------------------------------------------------------------------
Attachment
------------------------------------------------------------------------

========================================================================
RPR Streaming NFA Pattern Matching System Overview (DRAFT)
========================================================================

0. Motivation and Approach
------------------------------------------------------------------------

0.1 Background: Limitations of PostgreSQL RPR Patch

The existing PostgreSQL RPR implementation (RPR-base branch) uses
a regex-based approach:

┌─────────────────────────────────────────────────────────────────┐
│ Existing Implementation (Regex-based) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Evaluate all DEFINE conditions for each row │
│ → Encode matching variables as alphabets (a, b, c, ...) │
│ │
│ 2. Accumulate encoded string │
│ "aabbc" (A match, A match, B match, B match, C match) │
│ │
│ 3. Convert PATTERN to regex │
│ PATTERN (A+ B+ C) → "^a+b+c" │
│ │
│ 4. Match using PostgreSQL regex engine │
│ pg_regexec(&preg, encoded_str, ...) │
│ │
└─────────────────────────────────────────────────────────────────┘

Limitations (based on nodeWindowAgg.c analysis):

┌──────────────────┬──────────────────────────────────────────────┐
│ Limitation │ Cause │
├──────────────────┼──────────────────────────────────────────────┤
│ Combinatorial │ Cartesian product per row in │
│ Explosion O(V^N) │ generate_patterns(). 3 vars, 20 rows → 3.4B │
├──────────────────┼──────────────────────────────────────────────┤
│ 26 Variable │ a-z alphabet encoding (NUM_ALPHABETS=26) │
│ Limit │ │
├──────────────────┼──────────────────────────────────────────────┤
│ Lexical Order │ Encoded by DEFINE declaration order, │
│ Not Guaranteed │ PATTERN alternative order ignored. │
│ │ In (B|A) pattern, A encoded first │
├──────────────────┼──────────────────────────────────────────────┤
│ No RELUCTANT │ Only greedy flag exists, no shortest match │
│ Support │ logic │
├──────────────────┼──────────────────────────────────────────────┤
│ MEASURES Not │ Rescan-based workaround, no incremental │
│ Implemented │ aggregation │
├──────────────────┼──────────────────────────────────────────────┤
│ O(N^2) Retry on │ No intermediate state maintained. On failure │
│ Match Failure │ at row N, retry from row K+1 from scratch │
└──────────────────┴──────────────────────────────────────────────┘

0.2 Approach: Streaming NFA-based Pattern Matching

We adopt an NFA-based approach similar to Apache Flink CEP:

┌─────────────────────────────────────────────────────────────────┐
│ NFA-based Approach │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Core Ideas: │
│ - Compile pattern to flat array (PatternElement[]) │
│ - Perform state transitions per row (no regex matching) │
│ - Merge identical states to prevent redundant computation │
│ │
│ Time Complexity: O(N x S x E) │
│ N: Number of input rows │
│ S: Number of concurrent active states (bounded by merge) │
│ E: Number of pattern elements │
│ │
└─────────────────────────────────────────────────────────────────┘

Resolving Limitations:

┌──────────────────┬──────────────────────────────────────────────┐
│ Original Limit │ Resolution with Streaming NFA │
├──────────────────┼──────────────────────────────────────────────┤
│ Combinatorial │ State transition based, identical state merge│
│ Explosion O(V^N) │ → O(N×S×E), S proportional to pattern size │
├──────────────────┼──────────────────────────────────────────────┤
│ 26 Variable │ Integer varId, size adjustable as needed │
│ Limit │ │
├──────────────────┼──────────────────────────────────────────────┤
│ Lexical Order │ #ALT branches generated in PATTERN order │
│ Not Guaranteed │ → Alternative order preserved in paths[][] │
├──────────────────┼──────────────────────────────────────────────┤
│ No RELUCTANT │ Custom NFA enables shortest/longest match │
│ Support │ control │
├──────────────────┼──────────────────────────────────────────────┤
│ MEASURES Not │ Incremental aggregation computes SUM, COUNT │
│ Implemented │ etc. in real-time │
├──────────────────┼──────────────────────────────────────────────┤
│ O(N^2) Retry on │ Multiple Contexts maintained concurrently │
│ Match Failure │ → Parallel tracking per start point, O(N) │
└──────────────────┴──────────────────────────────────────────────┘

New Advantages:

┌──────────────────┬──────────────────────────────────────────────┐
│ Advantage │ Description │
├──────────────────┼──────────────────────────────────────────────┤
│ ALL ROWS Mode │ Track all possible match paths concurrently │
│ │ → Each Context branches/completes separately │
├──────────────────┼──────────────────────────────────────────────┤
│ Context │ Merge Contexts upon reaching identical state │
│ Absorption │ → Eliminate redundant computation, memory │
├──────────────────┼──────────────────────────────────────────────┤
│ Incremental │ Compute SUM, COUNT etc. during matching │
│ Aggregation │ → No rescan needed after completion │
├──────────────────┼──────────────────────────────────────────────┤
│ Flat Array │ Contiguous memory layout for states[], │
│ Structure │ contexts[] → Good cache locality, no pointer │
├──────────────────┼──────────────────────────────────────────────┤
│ Path Sharing │ Chunk tree + hash table for path sharing │
│ │ → O(1) reference on fork, memory dedup │
└──────────────────┴──────────────────────────────────────────────┘

0.3 Difference from Flink: History Elimination

┌─────────────────────────────────────────────────────────────────┐
│ Flink CEP (Streaming) PostgreSQL (Batch) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Original data flows away Materialized data │
│ → Cannot re-access → Rescan anytime │
│ → Store row data in History → No History needed │
│ → Pointer tree + ref counting → Store match_start/end only│
│ │
│ Fork: pointer copy O(1) Fork: copy counts[] │
│ MEASURES: backtrack History MEASURES: rescan range │
│ │
└─────────────────────────────────────────────────────────────────┘

Our Implementation Choices:
- Adopt Incremental Aggregation
- Compute SUM, COUNT etc. in real-time during matching
- Maintain only aggregate values without History pointers
- Store path info as varId arrays in paths[][]

0.4 Design Goals

┌──────────────┬──────────────────────────────────────────────────┐
│ Goal │ Description │
├──────────────┼──────────────────────────────────────────────────┤
│ Linear │ O(N×S×E), no combinatorial explosion │
│ Complexity │ │
├──────────────┼──────────────────────────────────────────────────┤
│ Standard │ SQL:2016 RPR semantics (GREEDY, RELUCTANT, SKIP) │
│ Compliance │ │
├──────────────┼──────────────────────────────────────────────────┤
│ Extensibility│ Future extensions: MEASURES, SUBSET, etc. │
├──────────────┼──────────────────────────────────────────────────┤
│ Simplicity │ Flat array pattern, clear state transition rules │
└──────────────┴──────────────────────────────────────────────────┘

0.5 Path Storage Optimization

Risk of memory explosion when paths[][] are copied on Context fork.
Share identical paths via chunk tree + hash table:

┌─────────────────────────────────────────────────────────────────┐
│ Chunk Tree Structure │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Chunk: fixed-size(2) array + parent pointer + ref count │
│ │
│ Example: Two paths [A,A,C,D] and [A,A,C,E] │
│ │
│ Chunk1[A,A] ← Chunk2[C,D] (Path1) │
│ RC:2 ↖ │
│ Chunk3[C,E] (Path2) │
│ │
│ → Share parent chunk (Chunk1), create new chunk from fork │
│ │
│ Hash table: (parent, values) → reuse identical chunks │
│ │
└─────────────────────────────────────────────────────────────────┘

1. System Architecture
------------------------------------------------------------------------

┌────────────────────────────────────────────────────────────────────────┐

│ Processing Pipeline │
├────────────────────────────────────────────────────────────────────────┤

│ │
│ Pattern String Parser Pattern Executor Result │
│ "A+ B | C" ───▶ (AST) ───▶ (elements) ───▶ (Contexts) ───▶ Match │
│ │
│ ┌──────────────────┐ ┌───────────────────┐ ┌───────────────────┐ │
│ │ Parser │ │ Context │ │ Executor │ │
│ │ │ │ │ │ │ │
│ │ Tokenize │ │ MatchState │ │ NFAExecutor │ │
│ │ Build AST │ │ MatchContext │ │ Multi-Context │ │
│ │ Optimize │ │ State Transition │ │ Absorb/SKIP │ │
│ │ Compile │ │ Greedy/Reluctant │ │ Output Mgmt │ │
│ └──────────────────┘ └───────────────────┘ └───────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────┘

2. Document Structure
------------------------------------------------------------------------

┌──────────────┬────────────────────────────────────────────────────┐
│ Parser │ Pattern string → PatternElement[] transformation │
│ │ - Tokenizer: string → tokens │
│ │ - Parser: tokens → AST │
│ │ - Optimizer: AST simplification │
│ │ - Compiler: AST → PatternElement[] │
├──────────────┼────────────────────────────────────────────────────┤
│ Context │ Single Context internal behavior │
│ │ - MatchState: current position + counters + path │
│ │ - MatchContext: collection of States │
│ │ - State transition: variable match, #ALT, #END, │
│ │ #FIN handling │
│ │ - Greedy/Reluctant quantifier behavior │
├──────────────┼────────────────────────────────────────────────────┤
│ Executor │ Multi-Context management │
│ │ - NFAExecutor: main execution engine │
│ │ - Context creation/absorption/removal │
│ │ - SKIP modes (PAST LAST / TO NEXT) │
│ │ - Output rows (ONE ROW / ALL ROWS) │
├──────────────┼────────────────────────────────────────────────────┤
│ Improvements │ Future enhancements │
│ │ - Path storage optimization (chunk tree + hash) │
└──────────────┴────────────────────────────────────────────────────┘

3. Key Data Structures
------------------------------------------------------------------------

3.1 Pattern
The entire compiled pattern.

Fields:
- maxDepth: maximum nesting depth
- elements[]: PatternElement array
- variables[]: variable name array (variables[varId] → name)
- firstMatchIsGreedy: whether first match is Greedy
(Context absorption condition)

3.2 PatternElement
A single element of the compiled pattern.

Fields:
- varId: variable identifier or special marker (0+ variable,
negative special)
- reluctant: flag (true = shortest, false = longest)
- min, max: quantifier range
- depth: nesting depth
- next: next element index
- jump: alternative/repeat jump target

3.3 MatchState
A single state during NFA execution.

Fields:
- elementIndex: current PatternElement position
- counts[]: per-depth repeat counters
- summaries[]: Summary array (creation order preserved)

State equivalence: identical if elementIndex + counts[] match
On merge, summaries are combined

3.4 Summary
Aggregate values and path information.

Fields:
- aggregates{}: incremental aggregates (SUM, COUNT, FIRST,
LAST, MIN, MAX)
- paths[][]: matched variable paths (creation order preserved)

Merge rules:
- If aggregate values identical → merge Summaries, combine paths
- summaries[0].paths[0] = Lexical Order priority path

3.5 MatchContext
Collection of States with the same start point.

Fields:
- matchStart: match start row (unique identifier)
- matchEnd: match end row (stores currentRow when matchedState
updated)
- states[]: active States (creation order preserved, all branch
paths)
- matchedState: MatchState | null (for Greedy fallback)

3.6 NFAExecutor
Main execution engine.

Fields:
- pattern: compiled pattern
- contexts[]: active Contexts (creation order preserved)
- completedContexts[]: completed Contexts (sorted by start,
pending emit)
- skipMode: PAST_LAST / TO_NEXT
- currentRow: currently processing row number
- history[]: per-row execution snapshots (for debugging)

4. Execution Flow
------------------------------------------------------------------------

┌─────────────────────────────────────────────────────────────────────┐
│ Per-Row Processing │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ FOR EACH row IN data: │
│ │ │
│ ├─▶ 1. tryStartNewContext(row) │
│ │ └─ Create Context if new match can start │
│ │ │
│ ├─▶ 2. FOR EACH context: │
│ │ └─ processContext(context, row) │
│ │ └─ State transitions + incremental aggregation │
│ │ │
│ ├─▶ 3. mergeStates() + absorbContexts() │
│ │ ├─ State Merge (same elementIndex+counts) │
│ │ ├─ Summary Merge (same aggregates → combine paths) │
│ │ └─ Absorb later Context into earlier one │
│ │ │
│ ├─▶ 4. Remove dead/completed Contexts │
│ │ │
│ └─▶ 5. emitRows() │
│ ├─ contexts[1+] complete → queue in completedContexts │
│ └─ contexts[0] complete → emit immediately, process │
│ queue │
│ └─ Iterate items where start < contexts[0].start │
│ ├─ PAST LAST: discard if start <= prev end │
│ └─ TO NEXT: end >= ctx[0].start → stop (wait) │
│ end < ctx[0].start → emit │
│ │
└─────────────────────────────────────────────────────────────────────┘

5. Core Concepts
------------------------------------------------------------------------

5.1 State Transitions

Variable element: If condition true, move to next, add to paths
#ALT: Branch to all alternatives (next, jump chain)
#END: Check counter, repeat (jump) or escape (next)
#FIN: Match complete

5.2 Matching Modes (3 Axes)

┌──────────────────────────────────────────────────────────────────┐
│ [Axis 1] Output Rows: ONE ROW / ALL ROWS │
│ [Axis 2] Quantifier: GREEDY / RELUCTANT │
│ [Axis 3] SKIP Option: PAST LAST / TO NEXT │
└──────────────────────────────────────────────────────────────────┘

Output Rows:
- ONE ROW: Output paths[0] (Lexical Order priority path)
- ALL ROWS: Output all paths

Quantifier Greediness:
- GREEDY: Longest match first, preserve completion and continue
- RELUCTANT: Shortest match first, return immediately on completion

SKIP Option:
- PAST LAST: Multiple Contexts active, discard overlaps on emit
- TO NEXT: Start possible every row, allow overlapping matches

5.3 Merge Rules

State Merge (within Context):
- Condition: same elementIndex + counts[]
- Action: combine summaries (preserve creation order)
- Result: reduce State count at same position, prevent redundant
computation

Summary Merge (within State):
- Condition: same aggregates{} values
- Action: combine paths (preserve creation order)
- Result: summaries[0].paths[0] = Lexical Order priority path

5.4 Context Absorption

Condition: First match in pattern is Greedy (max=Infinity AND
reluctant=false)
(same elementIndex AND earlier.counts >= later.counts)
Action: Remove later Context
Result: Earlier continues with longer match, prevent redundant
computation

5.5 Emit Flow

- contexts[0] complete → emit immediately
- contexts[1+] complete → queue in completedContexts
- After contexts[0] emits, process queue (by start order):
1. start >= contexts[0].start → stop (not yet eligible)
2. PAST LAST: start <= previous emit's end → discard, continue
3. TO NEXT: end >= contexts[0].start (overlaps with ctx[0])
→ stop (this item waits until ctx[0] clears)
4. TO NEXT: end < contexts[0].start (no overlap)
→ emit, continue to next item

6. Module Dependencies
------------------------------------------------------------------------

Parser ────────────────────────────────────────────────────────────
Output: PatternElement[]
→ Context: State references PatternElement
→ Executor: passed as pattern field

Context ───────────────────────────────────────────────────────────
Input: PatternElement[] (read-only)
Output: MatchContext (completed/in-progress status)
→ Executor: State transitions in processContext()

Executor ──────────────────────────────────────────────────────────
Input: Pattern, data rows
Internal: Context creation/management
Output: Match results (completedContexts)

7. Context Absorption
------------------------------------------------------------------------

When the pattern starts with a Greedy quantifier (e.g., A+), later
Contexts can be absorbed into earlier ones if they reach the same
state with fewer repetitions. This eliminates redundant computation.

┌─────────────────────────────────────────────────────────────────────┐
│ Context Absorption │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Pattern: A+ B (A is Greedy: max=Infinity, reluctant=false) │
│ │
│ Row 1: A matched │
│ contexts[0]: start=1, elementIndex=A, counts=[1] │
│ │
│ Row 2: A matched │
│ contexts[0]: start=1, elementIndex=A, counts=[2] ← continue │
│ contexts[1]: start=2, elementIndex=A, counts=[1] ← new Context │
│ │
│ Absorption check: │
│ - First element is Greedy? YES (A+) │
│ - Same elementIndex? YES (both at A) │
│ - earlier.counts[0] >= later.counts[0]? YES (2 >= 1) │
│ → Absorb: Remove contexts[1] │
│ │
│ Result: │
│ contexts[0]: start=1, counts=[2] (only survivor) │
│ │
│ Why: Later Context's path is subset of earlier's longer match │
│ │
└─────────────────────────────────────────────────────────────────────┘

Condition:
1. Pattern's first match is Greedy (max=Infinity AND reluctant=false)
- Simple: A+ B
- Group: (A B)+ C (group repeat is also Greedy)
2. Both Contexts at same elementIndex
3. earlier.counts[depth] >= later.counts[depth]

Action: Remove later Context

Result: Earlier Context continues with longer match,
prevents redundant computation

8. State Fork and Merge
------------------------------------------------------------------------

States fork when multiple paths are possible (e.g., at #ALT or when
min <= count < max at #END). States merge when they reach identical
positions, combining their paths while preventing redundant work.

┌─────────────────────────────────────────────────────────────────────┐
│ State Fork │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Pattern: A{1,2} B (A can match 1 or 2 times) │
│ │
│ At #END element (after A matched once, count=1): │
│ - count(1) >= min(1)? YES → can escape to B (next) │
│ - count(1) < max(2)? YES → can repeat A (jump) │
│ │
│ Fork into two States: │
│ State1: elementIndex=B, counts=[1] ← escaped (try B) │
│ State2: elementIndex=A, counts=[1] ← repeat (try more A) │
│ │
│ Both States continue in parallel within same Context │
│ │
└─────────────────────────────────────────────────────────────────────┘

Fork Trigger:
- #END element where min <= count < max
- #ALT element (branch to all alternatives)

Fork Action: Create new State for each branch path

Copy-on-Write Optimization:
- Forked States share summaries[] via refcount
- Only copied when one branch modifies (adds path/updates aggregate)
- Reduces memory allocation during fork-heavy patterns

┌─────────────────────────────────────────────────────────────────────┐
│ State Merge │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Pattern: (A | B) C │
│ │
│ After row matches both A and B conditions: │
│ State1: path=[A], elementIndex=C, counts=[1] │
│ State2: path=[B], elementIndex=C, counts=[1] │
│ │
│ Merge check: │
│ - Same elementIndex? YES (both at C) │
│ - Same counts[]? YES ([1] == [1]) │
│ → Merge States │
│ │
│ Merged State: │
│ elementIndex=C, counts=[1] │
│ summaries: [ │
│ { paths: [[A]] }, ← from State1 (creation order) │
│ { paths: [[B]] } ← from State2 │
│ ] │
│ │
│ Result: Single State, multiple paths preserved │
│ │
└─────────────────────────────────────────────────────────────────────┘

Merge Condition:
- Same elementIndex
- Same counts[] array

Merge Action: Combine summaries (preserve creation order)

Result: Reduces State count, prevents redundant computation
summaries[0].paths[0] = Lexical Order priority path

┌─────────────────────────────────────────────────────────────────────┐
│ Summary Array Merge (during State Merge) │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Same State = Same Future │
│ States with identical (elementIndex + counts[]) will produce │
│ identical future paths. Only past paths differ. │
│ │
│ Before State merge: │
│ State1: elementIndex=C, counts=[1] │
│ summaries: [{ agg:{sum:10}, paths:[[A,C]] }] │
│ State2: elementIndex=C, counts=[1] │
│ summaries: [{ agg:{sum:20}, paths:[[B,C]] }] │
│ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ │
│ same state = same future │
│ │
│ After State merge: │
│ Merged: elementIndex=C, counts=[1] │
│ summaries: [ │
│ { agg:{sum:10}, paths:[[A,C]] }, │
│ { agg:{sum:20}, paths:[[B,C]] } │
│ ] │
│ │
│ Why: Merge States now, they will have identical continuations │
│ summaries[] combined as array (preserve creation order) │
│ │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ Summary Merge (= Path Merge) │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ When: Two Summaries have identical aggregates{} │
│ → Merge Summaries, which merges their paths │
│ │
│ Before merge: │
│ Summary1: { aggregates: {sum:10}, paths: [[A,A]] } │
│ Summary2: { aggregates: {sum:10}, paths: [[A,B]] } │
│ │
│ After merge: │
│ Summary: { aggregates: {sum:10}, paths: [[A,A], [A,B]] } │
│ │
│ Why: Same aggregate = same "value", only paths differ │
│ Merging Summaries automatically combines their paths │
│ │
└─────────────────────────────────────────────────────────────────────┘

Summary Merge Condition:
- Same aggregates{} values (SUM, COUNT, etc. all equal)

Summary Merge Action:
- Merge two Summaries into one
- paths[][] arrays combined (preserve creation order)
- paths[0] = Lexical Order priority path

┌─────────────────────────────────────────────────────────────────────┐
│ Path Deduplication │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ When: Duplicate paths exist after merging │
│ │
│ Before dedup: │
│ paths: [[A,B,C], [A,B,C], [A,B,D]] │
│ ~~~~~~ ~~~~~~ │
│ duplicate │
│ │
│ After dedup: │
│ paths: [[A,B,C], [A,B,D]] │
│ │
│ Why: Duplicate paths provide no additional information │
│ Remove to reduce memory and output redundancy │
│ │
└─────────────────────────────────────────────────────────────────────┘

Deduplication Condition:
- Identical variable sequence

Deduplication Action:
- Keep first occurrence (preserve creation order)
- Discard duplicates

9. Completion Handling Algorithm
------------------------------------------------------------------------

When a State reaches #FIN (elementIndex = -1), the pattern has matched.
Whether to finalize or continue depends on quantifier mode. GREEDY
preserves completion and continues seeking longer matches, while
RELUCTANT completes immediately with the shortest match.

┌──────────────────────────────────────────────────────────────────────┐
│ Completion Handling Algorithm │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ Completion State Occurs (elementIndex = -1) │
│ │ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ Active states remain AND │ │
│ │ can match pattern input? │ │
│ └──────────────┬───────────────┘ │
│ │ │
│ YES ┌──────────────┴─────────────────┐ NO │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────┐ ┌─────────────────────────────┐ │
│ │ Can continue more │ │ Cannot continue more │ │
│ └────────────┬────────────┘ └──────────────┬──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────┐ ┌─────────────────────────────┐ │
│ │ Is GREEDY quantifier? │ │ Finalize result │ │
│ └───────────┬─────────────┘ │ │ │
│ │ │ If preserved → fallback │ │
│ │ NO │ If none → match failed │ │
│ YES ┌─┴──────────────┐ │ → Output Lexical Order 1 │ │
│ │ │ └─────────────────────────────┘ │
│ ▼ ▼ │
│ ┌────────────────────┐ ┌───────────────────────┐ │
│ │ Preserve, continue │ │ Complete immediately │ │
│ │ (for fallback) │ │ (RELUCTANT) │ │
│ └────────────────────┘ └───────────────────────┘ │
│ │
├──────────────────────────────────────────────────────────────────────┤
│ Preservation Rule (GREEDY): │
│ 1. Preserve 1 State with longest completion path │
│ 2. If same length, choose Lexical Order priority State │
│ 3. Replace when new longer completion occurs │
│ │
│ Preserved Content: │
│ - state: completed MatchState (elementIndex=-1, ...) │
│ - matchEnd: currentRow at matchedState update time │
│ │
│ Fallback Condition (GREEDY): │
│ - All active states dead (input unmatched) │
│ - matchedState exists → finalize with preserved State │
│ - Use preserved matchEnd (saved at update time) │
└──────────────────────────────────────────────────────────────────────┘

10. AST Optimization
------------------------------------------------------------------------

After parsing, the AST undergoes three optimization passes to simplify
structure before compilation. These reduce PatternElement count,
improving execution performance.

┌─────────────────────────────────────────────────────────────────────┐
│ 1. unwrapGroups (Unnecessary Group Removal) │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Case 1: Remove {1,1} Group wrapper │
│ │
│ Before: ((A)) After: A │
│ │
│ GROUP {1,1} VAR:A {1,1} │
│ └── GROUP {1,1} │
│ └── VAR:A {1,1} │
│ │
│ Case 2: Flatten nested SEQ │
│ │
│ Before: (A B) C After: A B C │
│ │
│ SEQ SEQ │
│ ├── GROUP {1,1} ├── VAR:A {1,1} │
│ │ └── SEQ ├── VAR:B {1,1} │
│ │ ├── VAR:A └── VAR:C {1,1} │
│ │ └── VAR:B │
│ └── VAR:C │
│ │
│ Case 3: Unwrap single-item SEQ/ALT │
│ │
│ Before: SEQ[A] After: A │
│ Before: ALT[A] After: A │
│ │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ 2. removeDuplicates (Duplicate Alternative Removal) │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Before: A | B | A After: A | B │
│ │
│ ALT ALT │
│ ├── VAR:A {1,1} ├── VAR:A {1,1} │
│ ├── VAR:B {1,1} └── VAR:B {1,1} │
│ └── VAR:A {1,1} ← duplicate │
│ │
│ Comparison: astEqual() for structural equality │
│ │
│ Before: (A B) | (A B) | C After: (A B) | C │
│ │
│ ALT ALT │
│ ├── SEQ[A,B] ├── SEQ[A,B] │
│ ├── SEQ[A,B] ← duplicate └── VAR:C │
│ └── VAR:C │
│ │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ 3. optimizeQuantifiers (Quantifier Optimization) │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Case 1: Merge consecutive identical variables │
│ │
│ Before: A A A B After: A{3} B │
│ │
│ SEQ SEQ │
│ ├── VAR:A {1,1} ├── VAR:A {3,3} │
│ ├── VAR:A {1,1} └── VAR:B {1,1} │
│ ├── VAR:A {1,1} │
│ └── VAR:B {1,1} │
│ │
│ Case 2: Merge nested quantifiers (when one is fixed) │
│ │
│ Before: (A{2}){3} After: A{6} │
│ │
│ GROUP {3,3} VAR:A {6,6} │
│ └── VAR:A {2,2} │
│ │
│ Calculation: min = 2*3 = 6, max = 2*3 = 6 │
│ │
│ Note: Only safe when inner or outer quantifier has min == max │
│ │
└─────────────────────────────────────────────────────────────────────┘

Optimization Order:
1. unwrapGroups - Remove unnecessary structure
2. removeDuplicates - Eliminate redundant alternatives
3. optimizeQuantifiers - Merge quantifiers

Combined Example:

Input: "((A | A) B B)"

Step 1 (unwrapGroups):
GROUP{1,1} -> SEQ, inner GROUP{1,1} -> ALT
Result: SEQ[ALT[A,A], B, B]

Step 2 (removeDuplicates):
ALT[A,A] -> A
Result: SEQ[A, B, B]

Step 3 (optimizeQuantifiers):
B B -> B{2}
Result: SEQ[A, B{2}]

11. AST to PatternElement[] Compilation
------------------------------------------------------------------------

The optimized AST is flattened into a linear PatternElement array.
Each element contains pointers (next, jump) for navigation and depth
for counter indexing. This flat structure enables efficient state
transitions during NFA execution.

┌─────────────────────────────────────────────────────────────────────┐
│ AST Flattening Example │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Pattern: "(A | B)+ C" │
│ │
│ AST: PatternElement[]: │
│ │
│ SEQ [0] #ALT next:1 d:1 │
│ ├── GROUP {1,∞} [1] A next:3, jump:2 d:2 │
│ │ └── ALT [2] B next:3, jump:-1 d:2 │
│ │ ├── VAR:A [3] #END next:4, jump:0 d:0 │
│ │ └── VAR:B min:1, max:∞ │
│ └── VAR:C {0,1} [4] C next:5, min:0, max:1 d:0 │
│ [5] #FIN next:-1 │
│ │
└─────────────────────────────────────────────────────────────────────┘

Pointer Rules:

next pointer:
- Sequential elements: index of next element
- Last element before #FIN: points to #FIN
- #FIN: -1 (terminal)

jump pointer:
- #END: group start index (loop back)
- ALT branch first element: next alternative start
- Last alternative: -1

depth:
- Top level: 0
- Inside GROUP/ALT: parent depth + 1
- #END: parent depth (for repeat counter indexing)

Flattening Process:

1. Traverse AST recursively
2. For each node type:
- VAR: emit PatternElement with varId, min, max, depth
- GROUP: flatten content, emit #END with min/max/jump
- ALT: emit #ALT, flatten each alternative with jump chain
- SEQ: flatten each item sequentially
3. Emit #FIN at end
4. Resolve next pointers (forward references)

┌─────────────────────────────────────────────────────────────────────┐
│ Pointer Resolution Example │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Pattern: "A | B | C" │
│ │
│ [0] #ALT next:1, jump:-1 │
│ │ │
│ ├──▶ [1] A next:4, jump:2 ──┐ │
│ │ │ │
│ ├──▶ [2] B next:4, jump:3 ──┤ (all point to #FIN) │
│ │ │ │
│ └──▶ [3] C next:4, jump:-1 ──┘ │
│ │
│ [4] #FIN next:-1 │
│ │
│ #ALT.next = first branch (A) │
│ Each branch.jump = next branch (or -1 for last) │
│ Each branch.next = element after ALT (#FIN) │
│ │
└─────────────────────────────────────────────────────────────────────┘

pgsql-hackers by date:

From: "Jelte Fennema-Nio"
Date: 02 January, 19:06:49
Subject: Re: Decouple C++ support in Meson's PGXS from LLVM enablement

From: Bryan Green
Date: 02 January, 20:40:11
Subject: Re: Use Python "Limited API" in PL/Python

Re: Row pattern recognition - Mailing list pgsql-hackers

Previous

Next