Re: Row pattern recognition - Mailing list pgsql-hackers
| From | Henson Choi |
|---|---|
| Subject | Re: Row pattern recognition |
| Date | |
| Msg-id | CAAAe_zDq7R8CaDfmh8C+H3_639Y5LtJD+2Z=1txDt=vaOr90rQ@mail.gmail.com Whole thread |
| In response to | Re: Row pattern recognition (Tatsuo Ishii <ishii@postgresql.org>) |
| Responses |
Re: Row pattern recognition
|
| List | pgsql-hackers |
Hi hackers,
I'd like to propose an extension to the context absorption optimization
for Row Pattern Recognition in the window clause: PREFIX pattern
absorption. This refines the "anchored pattern absorption" concept I
posted earlier [1], renaming it to "PREFIX pattern absorption" per
Ishii-san's feedback [2], and formalizing the absorption conditions
with count-dominance analysis.
It builds on the existing absorption framework to handle patterns where
a fixed-length prefix precedes the unbounded greedy repetition, using a
"shadow path" design.
[1] https://www.postgresql.org/message-id/CAAAe_zAEg7sVM=WDwXMyE-odGmQyXSVi5ZzWgye6SupSjdMKpg@mail.gmail.com
[2] https://www.postgresql.org/message-id/20260217.180053.64368434967157940.ishii@postgresql.org
1. Problem Definition
=====================
1.1 Current Absorption Scope
Context absorption currently applies to patterns with an unbounded-start
structure -- where an unbounded greedy repetition appears at the very
beginning of the pattern.
Example: A+ B -- the first element A+ is an unbounded repetition, so an
older context always has a repeat count greater than or equal to any
newer context, making immediate absorption possible.
1.2 The PREFIX Pattern Problem
For patterns like START SECOND A+ B+, where a finite-length prefix
(PREFIX) precedes the unbounded repetition (A+), absorption does not
currently work.
Why:
- A new context starts at the START element, so it is not at the same
pattern position as an older context in the A+ region.
- The allStatesAbsorbable condition is not satisfied, so absorption
never fires.
- Contexts accumulate while traversing the PREFIX, causing O(n^2)
regression.
2. Relationship to the Existing Framework
==========================================
2.1 Position in the Three-Phase Execution Model
The Match -> Absorb -> Advance model is preserved as-is. PREFIX
absorption is an extension of the Absorb phase's eligibility criteria,
not a new phase.
2.2 Prerequisites
The same four prerequisites as existing absorption apply:
1. Greedy quantifier -- A+ must be greedy.
2. SKIP TO PAST LAST ROW -- prior matches suppress subsequent start
points.
3. UNBOUNDED FOLLOWING -- all contexts see the same future rows.
4. Match-start-independent DEFINE predicates -- including predicates of
PREFIX elements.
2.3 Extension of the Dominance Order
In existing absorption, the dominance condition is:
C_old and C_new are in the absorbable region of the same element,
and C_old.counts >= C_new.counts.
In PREFIX patterns this condition cannot be directly satisfied: C_new is
in the PREFIX region while C_old is in the BODY region. We therefore
need a mechanism to track what C_new's state *would be* if it had
already reached the BODY region.
3. Design: Shadow Path
=======================
3.1 Core Idea
Each context logically executes two parallel paths:
- Original path: executes the full pattern in order (PREFIX -> BODY).
This is the path that produces the actual match result and evaluates
PREFIX predicates.
- Shadow path: skips the PREFIX and starts directly in the BODY region
(A+).
- Used solely for absorption eligibility decisions.
- Does not produce match results.
- Actually evaluates the BODY element's DEFINE predicates to track
counts accurately.
- The shadow is placed at the BODY start state upon creation and
immediately advances, so its initial count is 1.
- Not created if there is no preceding context (nothing to be
absorbed into).
- Discarded when the preceding context dies.
- Discarded when the shadow leaves the absorbable region (BODY->TAIL
transition or match failure).
3.2 Absorption Conditions
For C_old to absorb C_new, all of the following must hold:
1. C_old is past the PREFIX and in the absorbable region (BODY).
- hasAbsorbableState = true (existing flag)
2. C_new was created at or after the row where C_old entered the
absorbable region.
- "Same row" means: in the Match phase C_old enters BODY and C_new
is created; the subsequent Absorb phase of that same row can then
perform the absorption check.
3. C_new's shadow path is alive in the absorbable region.
- allStatesAbsorbable = true (evaluated on the shadow)
4. C_old.counts[depth] >= C_new.shadow.counts[depth]
- Same count-dominance condition as existing absorption (equality
counts as dominance).
- The shadow's count increases as it processes rows, so the
comparison uses actual counts at absorption time.
Condition 2 is the key insight. Only contexts created after C_old has
passed through the PREFIX into the absorbable region are eligible for
absorption. Contexts created while C_old is still in the PREFIX cannot
be absorbed.
3.3 Common Fate
Soundness argument for absorption:
C_old's original path and C_new's shadow path are at the same BODY
element (A+). By match-start independence, the DEFINE predicates of
both paths produce identical results on the same row. If C_new's
shadow can match at some future row r, then C_old can also complete
a match at row r (it has matched at least as many times, sees the
same future rows, and DEFINE evaluations are identical). Under SKIP
TO PAST LAST ROW, C_old's match includes C_new's start point, so
removing C_new does not lose any reported match.
3.4 Complexity
With PREFIX absorption, the maximum number of simultaneously active
contexts is bounded by PREFIX_length + 1. Every C_new created after
C_old enters BODY is absorbed immediately, so the only contexts that can
coexist are those still traversing the PREFIX plus the single C_old in
BODY. Since PREFIX length is a per-pattern constant, overall execution
complexity is O(n).
3.5 Removal of Non-Absorbable Contexts
Contexts that do not satisfy condition 2 -- i.e., those created before
C_old entered BODY -- cannot be removed by PREFIX absorption.
These contexts are removed by the existing SKIP TO PAST LAST ROW
mechanism, which is separate from PREFIX absorption. When C_old
successfully matches, all subsequent contexts whose start points fall
within C_old's match range are removed as overlaps. Since contexts
created during the PREFIX region have start points within C_old's match
range, they are automatically removed upon C_old's match completion.
The number of such contexts is at most PREFIX_length, which does not
affect the O(n) complexity (see Section 3.4).
4. Compile-Time Analysis
=========================
4.1 Pattern Structure Recognition
At compile time, traverse the pattern elements from the start until the
first unbounded greedy repetition is found. All preceding fixed-length
elements form the PREFIX; the unbounded element is the BODY; everything
after it is the TAIL. If the first element is already unbounded, the
pattern has no PREFIX and existing absorption applies directly.
Pattern: E1 E2 ... Ek A+ B+ ...
\__________/ \__/ \___/
PREFIX BODY TAIL
(absorbable) (non-absorbable)
5. Worked Example
==================
Pattern: START SECOND A+ B+
PREFIX: START, SECOND (length 2)
BODY: A+ (absorbable)
TAIL: B+ (non-absorbable)
Data: r0-r3 satisfy A only, r4-r5 satisfy B only, r6 satisfies neither
A nor B.
Row C1(orig) C2(orig) C2(shadow) C3(orig) C3(shadow) Active
r0 START - - - - 1
r1 SECOND START A(cnt=1) - - 2
r2 A(cnt=1) SECOND A(cnt=2) START A(cnt=1) 3
^ C1 reaches BODY
-> C3: shadow A(cnt=1), created at C1's BODY entry
-> absorbed (1 >= 1)
-> C2: created before C1's BODY entry
-> not absorbable 2
r3 A(cnt=2) A(cnt=1) A(cnt=3) - - 2
r4 B(cnt=1) B(cnt=1) (discard) - - 2
^ TAIL ^ TAIL ^ shadow A+->B+ -> TAIL -> discard
r5 B(cnt=2) B(cnt=2) - - - 2
r6 neither A nor B -> B+ ends -> C1 match (r0-r5) 0
C2 removed by SKIP TO PAST LAST ROW overlap
(C2 start r1 falls within C1 match range r0-r5)
Max concurrent contexts: 3 = PREFIX_length(2) + 1
Key observations:
- C3 is immediately removed at r2 by PREFIX absorption (shadow-path-
based).
- C2's shadow is discarded at r4 when A+->B+ transition moves it into
TAIL.
- C2 is not absorbable (condition 2 not met), survives until r6, then
removed by SKIP TO PAST LAST ROW overlap when C1 matches
(Section 3.5).
This is a design proposal at this stage. Thoughts and feedback welcome.
Best regards,
Henson
I'd like to propose an extension to the context absorption optimization
for Row Pattern Recognition in the window clause: PREFIX pattern
absorption. This refines the "anchored pattern absorption" concept I
posted earlier [1], renaming it to "PREFIX pattern absorption" per
Ishii-san's feedback [2], and formalizing the absorption conditions
with count-dominance analysis.
It builds on the existing absorption framework to handle patterns where
a fixed-length prefix precedes the unbounded greedy repetition, using a
"shadow path" design.
[1] https://www.postgresql.org/message-id/CAAAe_zAEg7sVM=WDwXMyE-odGmQyXSVi5ZzWgye6SupSjdMKpg@mail.gmail.com
[2] https://www.postgresql.org/message-id/20260217.180053.64368434967157940.ishii@postgresql.org
1. Problem Definition
=====================
1.1 Current Absorption Scope
Context absorption currently applies to patterns with an unbounded-start
structure -- where an unbounded greedy repetition appears at the very
beginning of the pattern.
Example: A+ B -- the first element A+ is an unbounded repetition, so an
older context always has a repeat count greater than or equal to any
newer context, making immediate absorption possible.
1.2 The PREFIX Pattern Problem
For patterns like START SECOND A+ B+, where a finite-length prefix
(PREFIX) precedes the unbounded repetition (A+), absorption does not
currently work.
Why:
- A new context starts at the START element, so it is not at the same
pattern position as an older context in the A+ region.
- The allStatesAbsorbable condition is not satisfied, so absorption
never fires.
- Contexts accumulate while traversing the PREFIX, causing O(n^2)
regression.
2. Relationship to the Existing Framework
==========================================
2.1 Position in the Three-Phase Execution Model
The Match -> Absorb -> Advance model is preserved as-is. PREFIX
absorption is an extension of the Absorb phase's eligibility criteria,
not a new phase.
2.2 Prerequisites
The same four prerequisites as existing absorption apply:
1. Greedy quantifier -- A+ must be greedy.
2. SKIP TO PAST LAST ROW -- prior matches suppress subsequent start
points.
3. UNBOUNDED FOLLOWING -- all contexts see the same future rows.
4. Match-start-independent DEFINE predicates -- including predicates of
PREFIX elements.
2.3 Extension of the Dominance Order
In existing absorption, the dominance condition is:
C_old and C_new are in the absorbable region of the same element,
and C_old.counts >= C_new.counts.
In PREFIX patterns this condition cannot be directly satisfied: C_new is
in the PREFIX region while C_old is in the BODY region. We therefore
need a mechanism to track what C_new's state *would be* if it had
already reached the BODY region.
3. Design: Shadow Path
=======================
3.1 Core Idea
Each context logically executes two parallel paths:
- Original path: executes the full pattern in order (PREFIX -> BODY).
This is the path that produces the actual match result and evaluates
PREFIX predicates.
- Shadow path: skips the PREFIX and starts directly in the BODY region
(A+).
- Used solely for absorption eligibility decisions.
- Does not produce match results.
- Actually evaluates the BODY element's DEFINE predicates to track
counts accurately.
- The shadow is placed at the BODY start state upon creation and
immediately advances, so its initial count is 1.
- Not created if there is no preceding context (nothing to be
absorbed into).
- Discarded when the preceding context dies.
- Discarded when the shadow leaves the absorbable region (BODY->TAIL
transition or match failure).
3.2 Absorption Conditions
For C_old to absorb C_new, all of the following must hold:
1. C_old is past the PREFIX and in the absorbable region (BODY).
- hasAbsorbableState = true (existing flag)
2. C_new was created at or after the row where C_old entered the
absorbable region.
- "Same row" means: in the Match phase C_old enters BODY and C_new
is created; the subsequent Absorb phase of that same row can then
perform the absorption check.
3. C_new's shadow path is alive in the absorbable region.
- allStatesAbsorbable = true (evaluated on the shadow)
4. C_old.counts[depth] >= C_new.shadow.counts[depth]
- Same count-dominance condition as existing absorption (equality
counts as dominance).
- The shadow's count increases as it processes rows, so the
comparison uses actual counts at absorption time.
Condition 2 is the key insight. Only contexts created after C_old has
passed through the PREFIX into the absorbable region are eligible for
absorption. Contexts created while C_old is still in the PREFIX cannot
be absorbed.
3.3 Common Fate
Soundness argument for absorption:
C_old's original path and C_new's shadow path are at the same BODY
element (A+). By match-start independence, the DEFINE predicates of
both paths produce identical results on the same row. If C_new's
shadow can match at some future row r, then C_old can also complete
a match at row r (it has matched at least as many times, sees the
same future rows, and DEFINE evaluations are identical). Under SKIP
TO PAST LAST ROW, C_old's match includes C_new's start point, so
removing C_new does not lose any reported match.
3.4 Complexity
With PREFIX absorption, the maximum number of simultaneously active
contexts is bounded by PREFIX_length + 1. Every C_new created after
C_old enters BODY is absorbed immediately, so the only contexts that can
coexist are those still traversing the PREFIX plus the single C_old in
BODY. Since PREFIX length is a per-pattern constant, overall execution
complexity is O(n).
3.5 Removal of Non-Absorbable Contexts
Contexts that do not satisfy condition 2 -- i.e., those created before
C_old entered BODY -- cannot be removed by PREFIX absorption.
These contexts are removed by the existing SKIP TO PAST LAST ROW
mechanism, which is separate from PREFIX absorption. When C_old
successfully matches, all subsequent contexts whose start points fall
within C_old's match range are removed as overlaps. Since contexts
created during the PREFIX region have start points within C_old's match
range, they are automatically removed upon C_old's match completion.
The number of such contexts is at most PREFIX_length, which does not
affect the O(n) complexity (see Section 3.4).
4. Compile-Time Analysis
=========================
4.1 Pattern Structure Recognition
At compile time, traverse the pattern elements from the start until the
first unbounded greedy repetition is found. All preceding fixed-length
elements form the PREFIX; the unbounded element is the BODY; everything
after it is the TAIL. If the first element is already unbounded, the
pattern has no PREFIX and existing absorption applies directly.
Pattern: E1 E2 ... Ek A+ B+ ...
\__________/ \__/ \___/
PREFIX BODY TAIL
(absorbable) (non-absorbable)
5. Worked Example
==================
Pattern: START SECOND A+ B+
PREFIX: START, SECOND (length 2)
BODY: A+ (absorbable)
TAIL: B+ (non-absorbable)
Data: r0-r3 satisfy A only, r4-r5 satisfy B only, r6 satisfies neither
A nor B.
Row C1(orig) C2(orig) C2(shadow) C3(orig) C3(shadow) Active
r0 START - - - - 1
r1 SECOND START A(cnt=1) - - 2
r2 A(cnt=1) SECOND A(cnt=2) START A(cnt=1) 3
^ C1 reaches BODY
-> C3: shadow A(cnt=1), created at C1's BODY entry
-> absorbed (1 >= 1)
-> C2: created before C1's BODY entry
-> not absorbable 2
r3 A(cnt=2) A(cnt=1) A(cnt=3) - - 2
r4 B(cnt=1) B(cnt=1) (discard) - - 2
^ TAIL ^ TAIL ^ shadow A+->B+ -> TAIL -> discard
r5 B(cnt=2) B(cnt=2) - - - 2
r6 neither A nor B -> B+ ends -> C1 match (r0-r5) 0
C2 removed by SKIP TO PAST LAST ROW overlap
(C2 start r1 falls within C1 match range r0-r5)
Max concurrent contexts: 3 = PREFIX_length(2) + 1
Key observations:
- C3 is immediately removed at r2 by PREFIX absorption (shadow-path-
based).
- C2's shadow is discarded at r4 when A+->B+ transition moves it into
TAIL.
- C2 is not absorbable (condition 2 not met), survives until r6, then
removed by SKIP TO PAST LAST ROW overlap when C1 matches
(Section 3.5).
This is a design proposal at this stage. Thoughts and feedback welcome.
Best regards,
Henson
pgsql-hackers by date: