Home > mailing lists

Re: Row pattern recognition - Mailing list pgsql-hackers

From	Henson Choi
Subject	Re: Row pattern recognition
Date	February 28 19:59:56
Msg-id	CAAAe_zDq7R8CaDfmh8C+H3_639Y5LtJD+2Z=1txDt=vaOr90rQ@mail.gmail.com Whole thread
In response to	Re: Row pattern recognition (Tatsuo Ishii <ishii@postgresql.org>)
Responses	Re: Row pattern recognition
List	pgsql-hackers

Tree view

Hi hackers,

I'd like to propose an extension to the context absorption optimization
for Row Pattern Recognition in the window clause: PREFIX pattern
absorption. This refines the "anchored pattern absorption" concept I
posted earlier [1], renaming it to "PREFIX pattern absorption" per
Ishii-san's feedback [2], and formalizing the absorption conditions
with count-dominance analysis.

It builds on the existing absorption framework to handle patterns where
a fixed-length prefix precedes the unbounded greedy repetition, using a
"shadow path" design.

[1] https://www.postgresql.org/message-id/CAAAe_zAEg7sVM=WDwXMyE-odGmQyXSVi5ZzWgye6SupSjdMKpg@mail.gmail.com
[2] https://www.postgresql.org/message-id/20260217.180053.64368434967157940.ishii@postgresql.org

1. Problem Definition
=====================

1.1 Current Absorption Scope

Context absorption currently applies to patterns with an unbounded-start
structure -- where an unbounded greedy repetition appears at the very
beginning of the pattern.

Example: A+ B -- the first element A+ is an unbounded repetition, so an
older context always has a repeat count greater than or equal to any
newer context, making immediate absorption possible.

1.2 The PREFIX Pattern Problem

For patterns like START SECOND A+ B+, where a finite-length prefix
(PREFIX) precedes the unbounded repetition (A+), absorption does not
currently work.

Why:
- A new context starts at the START element, so it is not at the same
pattern position as an older context in the A+ region.
- The allStatesAbsorbable condition is not satisfied, so absorption
never fires.
- Contexts accumulate while traversing the PREFIX, causing O(n^2)
regression.

2. Relationship to the Existing Framework
==========================================

2.1 Position in the Three-Phase Execution Model

The Match -> Absorb -> Advance model is preserved as-is. PREFIX
absorption is an extension of the Absorb phase's eligibility criteria,
not a new phase.

2.2 Prerequisites

The same four prerequisites as existing absorption apply:

1. Greedy quantifier -- A+ must be greedy.
2. SKIP TO PAST LAST ROW -- prior matches suppress subsequent start
points.
3. UNBOUNDED FOLLOWING -- all contexts see the same future rows.
4. Match-start-independent DEFINE predicates -- including predicates of
PREFIX elements.

2.3 Extension of the Dominance Order

In existing absorption, the dominance condition is:

C_old and C_new are in the absorbable region of the same element,
and C_old.counts >= C_new.counts.

In PREFIX patterns this condition cannot be directly satisfied: C_new is
in the PREFIX region while C_old is in the BODY region. We therefore
need a mechanism to track what C_new's state *would be* if it had
already reached the BODY region.

3. Design: Shadow Path
=======================

3.1 Core Idea

Each context logically executes two parallel paths:

- Original path: executes the full pattern in order (PREFIX -> BODY).
This is the path that produces the actual match result and evaluates
PREFIX predicates.

- Shadow path: skips the PREFIX and starts directly in the BODY region
(A+).
- Used solely for absorption eligibility decisions.
- Does not produce match results.
- Actually evaluates the BODY element's DEFINE predicates to track
counts accurately.
- The shadow is placed at the BODY start state upon creation and
immediately advances, so its initial count is 1.
- Not created if there is no preceding context (nothing to be
absorbed into).
- Discarded when the preceding context dies.
- Discarded when the shadow leaves the absorbable region (BODY->TAIL
transition or match failure).

3.2 Absorption Conditions

For C_old to absorb C_new, all of the following must hold:

1. C_old is past the PREFIX and in the absorbable region (BODY).
- hasAbsorbableState = true (existing flag)

2. C_new was created at or after the row where C_old entered the
absorbable region.
- "Same row" means: in the Match phase C_old enters BODY and C_new
is created; the subsequent Absorb phase of that same row can then
perform the absorption check.

3. C_new's shadow path is alive in the absorbable region.
- allStatesAbsorbable = true (evaluated on the shadow)

4. C_old.counts[depth] >= C_new.shadow.counts[depth]
- Same count-dominance condition as existing absorption (equality
counts as dominance).
- The shadow's count increases as it processes rows, so the
comparison uses actual counts at absorption time.

Condition 2 is the key insight. Only contexts created after C_old has
passed through the PREFIX into the absorbable region are eligible for
absorption. Contexts created while C_old is still in the PREFIX cannot
be absorbed.

3.3 Common Fate

Soundness argument for absorption:

C_old's original path and C_new's shadow path are at the same BODY
element (A+). By match-start independence, the DEFINE predicates of
both paths produce identical results on the same row. If C_new's
shadow can match at some future row r, then C_old can also complete
a match at row r (it has matched at least as many times, sees the
same future rows, and DEFINE evaluations are identical). Under SKIP
TO PAST LAST ROW, C_old's match includes C_new's start point, so
removing C_new does not lose any reported match.

3.4 Complexity

With PREFIX absorption, the maximum number of simultaneously active
contexts is bounded by PREFIX_length + 1. Every C_new created after
C_old enters BODY is absorbed immediately, so the only contexts that can
coexist are those still traversing the PREFIX plus the single C_old in
BODY. Since PREFIX length is a per-pattern constant, overall execution
complexity is O(n).

3.5 Removal of Non-Absorbable Contexts

Contexts that do not satisfy condition 2 -- i.e., those created before
C_old entered BODY -- cannot be removed by PREFIX absorption.

These contexts are removed by the existing SKIP TO PAST LAST ROW
mechanism, which is separate from PREFIX absorption. When C_old
successfully matches, all subsequent contexts whose start points fall
within C_old's match range are removed as overlaps. Since contexts
created during the PREFIX region have start points within C_old's match
range, they are automatically removed upon C_old's match completion.

The number of such contexts is at most PREFIX_length, which does not
affect the O(n) complexity (see Section 3.4).

4. Compile-Time Analysis
=========================

4.1 Pattern Structure Recognition

At compile time, traverse the pattern elements from the start until the
first unbounded greedy repetition is found. All preceding fixed-length
elements form the PREFIX; the unbounded element is the BODY; everything
after it is the TAIL. If the first element is already unbounded, the
pattern has no PREFIX and existing absorption applies directly.

Pattern: E1 E2 ... Ek A+ B+ ...
\__________/ \__/ \___/
PREFIX BODY TAIL
(absorbable) (non-absorbable)

5. Worked Example
==================

Pattern: START SECOND A+ B+
PREFIX: START, SECOND (length 2)
BODY: A+ (absorbable)
TAIL: B+ (non-absorbable)

Data: r0-r3 satisfy A only, r4-r5 satisfy B only, r6 satisfies neither
A nor B.

Row C1(orig) C2(orig) C2(shadow) C3(orig) C3(shadow) Active
r0 START - - - - 1
r1 SECOND START A(cnt=1) - - 2
r2 A(cnt=1) SECOND A(cnt=2) START A(cnt=1) 3
^ C1 reaches BODY
-> C3: shadow A(cnt=1), created at C1's BODY entry
-> absorbed (1 >= 1)
-> C2: created before C1's BODY entry
-> not absorbable 2
r3 A(cnt=2) A(cnt=1) A(cnt=3) - - 2
r4 B(cnt=1) B(cnt=1) (discard) - - 2
^ TAIL ^ TAIL ^ shadow A+->B+ -> TAIL -> discard
r5 B(cnt=2) B(cnt=2) - - - 2
r6 neither A nor B -> B+ ends -> C1 match (r0-r5) 0
C2 removed by SKIP TO PAST LAST ROW overlap
(C2 start r1 falls within C1 match range r0-r5)

Max concurrent contexts: 3 = PREFIX_length(2) + 1

Key observations:
- C3 is immediately removed at r2 by PREFIX absorption (shadow-path-
based).
- C2's shadow is discarded at r4 when A+->B+ transition moves it into
TAIL.
- C2 is not absorbable (condition 2 not met), survives until r6, then
removed by SKIP TO PAST LAST ROW overlap when C1 matches
(Section 3.5).

This is a design proposal at this stage. Thoughts and feedback welcome.

Best regards,
Henson

pgsql-hackers by date:

From: Andrey Borodin
Date: 28 February, 18:26:05
Subject: Re: [WiP] B-tree page merge during vacuum to reduce index bloat

From: Andres Freund
Date: 28 February, 21:01:49
Subject: Re: index prefetching

Re: Row pattern recognition - Mailing list pgsql-hackers

Previous

Next