Re: Row pattern recognition - Mailing list pgsql-hackers

From Henson Choi
Subject Re: Row pattern recognition
Date
Msg-id CAAAe_zDU5G3T1CbpTzGsZvxyY6FXnPBjmaWuSoayO_GaDKfZnA@mail.gmail.com
Whole thread Raw
In response to Re: Row pattern recognition  (Tatsuo Ishii <ishii@postgresql.org>)
List pgsql-hackers
SQL/RPR Patch Review Report
============================

Patch: Row Pattern Recognition (SQL:2016)
Commitfest: https://commitfest.postgresql.org/patch/4460

Review Methodology:
  This review focused on quality assessment, not line-by-line code audit.
  Key code paths and quality issues were examined with surrounding context
  when concerns arose. Documentation files were reviewed with AI-assisted
  grammar and typo checking. Code coverage was measured using gcov and
  custom analysis tools.

Limitations:
  - Not a security audit or formal verification
  - Parser and executor internals reviewed at module level, not exhaustively
  - Performance characteristics not benchmarked


TABLE OF CONTENTS
-----------------

1. Executive Summary
2. Issues Found
   2.1 Critical / Major
   2.2 Minor Issues (Review Needed)
   2.3 Minor Issues (Style)
   2.4 Suggestions for Discussion
3. Test Coverage
   3.1 Covered Areas
   3.2 Untested Items
   3.3 Unimplemented Features (No Test Needed)
4. Code Coverage Analysis
   4.1 Overall Coverage
   4.2 Coverage by File
   4.3 Uncovered Code Risk Assessment
5. Commit Analysis
6. Recommendations
7. Conclusion


1. EXECUTIVE SUMMARY
--------------------

Overall assessment: GOOD

The SQL/RPR patch demonstrates solid implementation quality within its
defined scope. Code follows PostgreSQL coding standards (with minor style
issues), test coverage is comprehensive at 96.4%, and documentation is
thorough with only minor typos.

Rating by Area:
- Code Quality:     Good (PostgreSQL style compliant, 26 static style fixes recommended)
- Test Coverage:    Good (96.4% line coverage, 28 test cases)
- Documentation:    Good (Complete, 1 minor typo)
- Build/Regress:    Pass (make check-world, 753 tests passed)


2. ISSUES FOUND
---------------

2.1 Critical / Major

#1 [Code] Greedy pattern combinatorial explosion
   Pattern: PATTERN (A+ B+ C+) with DEFINE A AS TRUE, B AS TRUE, C AS TRUE
   Impact: Memory exhaustion or infinite wait
   Recommendation: Add work_mem-based memory limit (error on exceed)

   Evidence - No memory limit in current code:
   - nodeWindowAgg.c:5718-5722 string_set_init(): palloc() unconditional
   - nodeWindowAgg.c:5741-5750 string_set_add(): set_size *= 2; repalloc() unlimited
   - nodeWindowAgg.c:5095-5174 generate_patterns(): Triple loop, no limit

   Only work_mem usage in RPR (nodeWindowAgg.c:1297):
     winstate->buffer = tuplestore_begin_heap(false, false, work_mem);
   -> For tuple buffer, unrelated to pattern combination memory (StringSet)

2.2 Minor Issues (Review Needed)

#1 [Code] nodeWindowAgg.c:5849,5909,5912
   pos > NUM_ALPHABETS check - intent unclear, causes error at 28 PATTERN elements

   Reproduction:
   - PATTERN (A B C ... Z A) (27 elements) -> OK
   - PATTERN (A B C ... Z A B) (28 elements) -> ERROR "initial is not valid char: b"

2.3 Minor Issues (Style)

#1 [Code] nodeWindowAgg.c (25 blocks)
   #ifdef RPR_DEBUG -> recommend elog(DEBUG2, ...) or remove

#2 [Code] src/backend/executor/nodeWindowAgg.c
   static keyword on separate line (26 functions)

#3 [Code] src/backend/utils/adt/ruleutils.c
   Whitespace typo: "regexp !=NULL" -> "regexp != NULL"

#4 [Code] nodeWindowAgg.c:4619
   Error message case: "Unrecognized" -> "unrecognized" (PostgreSQL style)

#5 [Doc] doc/src/sgml/ref/select.sgml:1128
   Typo: "maximu" -> "maximum"

2.4 Suggestions for Discussion

#1 Incremental matching with streaming NFA redesign
   Benefits:
   - O(n) time complexity per row (current: exponential in worst case)
   - Bounded memory via state merging and context absorption
   - Natural extension for OR patterns, {n,m} quantifiers, nested groups
   - Enables MEASURES clause with incremental aggregates
   - Proven approach in CEP engines (Flink, Esper)


3. TEST COVERAGE
----------------

3.1 Covered Areas

- PATTERN clause: +, *, ? quantifiers (line 41, 71, 86)
- DEFINE clause: Variable conditions, auto-TRUE for missing (line 120-133)
- PREV/NEXT functions: Single argument (line 44, 173)
- AFTER MATCH SKIP: TO NEXT ROW (line 182), PAST LAST ROW (line 198)
- Aggregate integration: sum, avg, count, max, min (line 258-277)
- Window function integration: first_value, last_value, nth_value (line 34-35)
- JOIN/CTE: JOIN (line 313), WITH (line 324)
- View: VIEW creation, pg_get_viewdef (line 390-406)
- Error cases: 7 error scenarios (line 409-532)
- Large partition: 5000 row smoke test (line 360-387)
- ROWS BETWEEN offset: CURRENT ROW AND 2 FOLLOWING (line 244)

3.2 Untested Items

Feature                  Priority   Recommendation
-------------------------------------------------------------------------------
26 variable limit        Medium     Test 26 variables success, 27th variable error
NULL value handling      Low        Test PREV(col) where col or previous row is NULL

Sample test for 26 variable limit:

    -- Should fail with 27th variable (parser error, no table needed)
    SELECT * FROM (SELECT 1 AS x) t
    WINDOW w AS (
      ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
      PATTERN (a b c d e f g h i j k l m n o p q r s t u v w x y z aa)
      DEFINE a AS TRUE, b AS TRUE, c AS TRUE, d AS TRUE, e AS TRUE, f AS TRUE,
             g AS TRUE, h AS TRUE, i AS TRUE, j AS TRUE, k AS TRUE, l AS TRUE,
             m AS TRUE, n AS TRUE, o AS TRUE, p AS TRUE, q AS TRUE, r AS TRUE,
             s AS TRUE, t AS TRUE, u AS TRUE, v AS TRUE, w AS TRUE, x AS TRUE,
             y AS TRUE, z AS TRUE, aa AS TRUE
    );
    -- ERROR: number of row pattern definition variable names exceeds 26

Sample test for NULL handling:

    CREATE TEMP TABLE stock_null (company TEXT, tdate DATE, price INTEGER);
    INSERT INTO stock_null VALUES ('c1', '2023-07-01', 100);
    INSERT INTO stock_null VALUES ('c1', '2023-07-02', NULL);  -- NULL in middle
    INSERT INTO stock_null VALUES ('c1', '2023-07-03', 200);
    INSERT INTO stock_null VALUES ('c1', '2023-07-04', 150);

    SELECT company, tdate, price, count(*) OVER w AS match_count
    FROM stock_null
    WINDOW w AS (
      PARTITION BY company
      ORDER BY tdate
      ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
      PATTERN (START UP DOWN)
      DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price)
    );
    -- Result: all rows show match_count = 0 (NULL breaks pattern matching)

3.3 Unimplemented Features (No Test Needed)

Per documentation (select.sgml:1133-1136):
- MEASURES clause: Not implemented
- SUBSET clause: Not implemented
- AFTER MATCH variants: Only SKIP TO NEXT ROW, PAST LAST ROW supported

Not in documentation, verified by testing:
- {n,m} quantifier: Parser error ("syntax error at or near {")
- (A|B) OR pattern: Not supported (parsed as single variable name "a|b")
- (A B)+ compound repetition: Parser accepts, but execution returns 0 matches


4. CODE COVERAGE ANALYSIS
-------------------------

4.1 Overall Coverage

96.4% (732 / 759 lines)

4.2 Coverage by File (major RPR-modified files)

nodeWindowAgg.c:    96.6% (560/580) - Pattern matching execution engine
parse_clause.c:     96.7% (88/91) - PATTERN/DEFINE analysis
ruleutils.c:        93.1% (54/58) - pg_get_viewdef output

4.3 Uncovered Code Risk Assessment

Total: 27 lines uncovered

Medium Risk (2 items) - Test addition recommended (see section 3.2):
- parse_clause.c:4093: transformDefineClause - 27th variable error
- nodeWindowAgg.c:5623: get_slots - null_slot for PREV at partition first row

Low Risk (25 items) - Defensive code / Unreachable:
- nodeWindowAgg.c:3074: attno_map_walker - Parser validates arg count
- nodeWindowAgg.c:3137-3138: rpr_navigation_walker - switch default
- nodeWindowAgg.c:3566: window_gettupleslot - Before mark position
- nodeWindowAgg.c:4289: WinGetFuncArgInFrame - isout flag
- nodeWindowAgg.c:4393: WinGetSlotInFrame - Boundary check
- nodeWindowAgg.c:4618-4619: row_is_in_reduced_frame - switch default
- nodeWindowAgg.c:4697: register_reduced_frame_map - pos < 0 check
- nodeWindowAgg.c:5007: search_str_set - NULL return continue
- nodeWindowAgg.c:5405: do_pattern_match - Regex compile error
- nodeWindowAgg.c:5435,5437-5438: do_pattern_match - Regex exec error
- nodeWindowAgg.c:5700: pattern_initial - Variable not found
- nodeWindowAgg.c:5776: string_set_get - Index range check
- nodeWindowAgg.c:5850: variable_pos_register - a-z range check
- nodeWindowAgg.c:5910: variable_pos_fetch - a-z range check
- nodeWindowAgg.c:5913: variable_pos_fetch - Index range check
- parse_clause.c:3989: transformDefineClause - A_Expr type check
- parse_clause.c:4145: transformPatternClause - A_Expr type check
- ruleutils.c:6904-6908: get_rule_windowspec - SKIP TO NEXT ROW output

Conclusion: Most uncovered code consists of defensive error handling or
code unreachable due to parser pre-validation. No security or functional risk.


5. COMMIT ANALYSIS
------------------

8 sequential commits:

Commit  Area            Files   +/-         Key Content
-------------------------------------------------------------------------------
1       Raw Parser      4       +174/-16    gram.y grammar (PATTERN/DEFINE)
2       Parse/Analysis  4       +277/-1     parse_clause.c analysis logic
3       Rewriter        1       +109/-0     pg_get_viewdef extension
4       Planner         5       +73/-3      WindowAgg plan node extension
5       Executor        4       +1,942/-11  CORE: Pattern matching engine (+1,850)
6       Docs            3       +192/-7     advanced.sgml, func-window.sgml
7       Tests           3       +1,585/-1   rpr.sql (532), rpr.out (1,052)
8       typedefs        1       +6/-0       pgindent support

Code Change Statistics:
- Total files: 25
- Lines added: 4,358
- Lines deleted: 39
- Net increase: +4,319 lines


6. RECOMMENDATIONS
------------------

6.1 Combinatorial Explosion Prevention (Major, Required)

Add work_mem-based memory limit for StringSet allocation.
Location: string_set_add() in nodeWindowAgg.c:5741-5750
Consistent with existing PostgreSQL approach (Hash Join, Sort, etc.)

6.2 Code Review Required (Minor, Decision Needed)

Location: nodeWindowAgg.c:5849,5909,5912
Issue: pos > NUM_ALPHABETS check intent unclear
Current: PATTERN with 28 elements causes error
Question: Is 27 element limit intentional?

6.3 Code Style Fixes (Minor)

- #ifdef RPR_DEBUG: Use elog(DEBUG2, ...) or remove (25 blocks)
- static keyword on separate line: 26 functions to fix
- Whitespace: "regexp !=NULL" -> "regexp != NULL"
- Error message case: "Unrecognized" -> "unrecognized"

6.4 Documentation Fixes (Minor)

- select.sgml: "maximu" -> "maximum"

6.5 Test Additions (Optional)

Black-box Tests (Functional):

Feature              Test Case                                    Priority
-------------------------------------------------------------------------------
Variable limit       26 variables success, 27 error               Medium
NULL boundary        PREV at partition first row                  Medium

White-box Tests (Coverage) - covered by above black-box tests:

File:Line                          Target Branch
-------------------------------------------------------------------------------
parse_clause.c:4093                Limit error branch (Variable limit test)
nodeWindowAgg.c:5623               null_slot usage (NULL boundary test)


7. CONCLUSION
-------------

Test Quality: GOOD

Core functionality is thoroughly tested with comprehensive error case coverage.

The patch is well-implemented within its defined scope. Identified issues include
one major concern (combinatorial explosion) and minor style/documentation items.

Recommended actions before commit:
1. Add work_mem-based memory limit for pattern combinations (required)
2. Clarify pos > NUM_ALPHABETS check intent (review needed)
3. Fix code style issues (optional but recommended)
4. Fix documentation typo (trivial)
5. Add tests for variable limit and NULL handling (optional)

Points for discussion (optional):
6. Incremental matching with streaming NFA redesign


Attachment:
- coverage.tgz (gcov HTML report, RPR-modified code only)


---
End of Report

2025년 9월 24일 (수) PM 7:36, Tatsuo Ishii <ishii@postgresql.org>님이 작성:
Attached are the v33 patches for Row pattern recognition.  The
difference from v32 is that the raw parse tree printing patch is not
included in v33. This is because now that we have
debug_print_raw_parse.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
Attachment

pgsql-hackers by date:

Previous
From: "Jelte Fennema-Nio"
Date:
Subject: Add "format" target to make and ninja to run pgindent and pgperltidy
Next
From: Nazir Bilal Yavuz
Date:
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD