Re: Row pattern recognition - Mailing list pgsql-hackers

From Henson Choi
Subject Re: Row pattern recognition
Date
Msg-id CAAAe_zB+W+xBJpYNzRLzh6BH9HrFycTjoPUnVvnU1BT_7RR8Bw@mail.gmail.com
Whole thread
In response to Re: Row pattern recognition  (Tatsuo Ishii <ishii@postgresql.org>)
Responses Re: Row pattern recognition
List pgsql-hackers
Hi Tatsuo,

This round focused on reviewing nodeWindowAgg.c -- the NFA executor
-- which was the last piece remaining on my review list.  I've also
added a comprehensive NFA runtime test suite.

The attached patch continues the incremental series on top of v42.
Here are the changes:


FIXME issues documented:

These require non-trivial structural changes, so I've documented
them as FIXME comments for now rather than attempting a fix.

- altPriority tracks only the last ALT choice, so repeated or nested
  ALTs like (A|B)+ cannot correctly implement SQL standard lexical
  ordering.  A full-path classifier structure is needed.

- Cycle prevention condition (count == 0 && min == 0) is insufficient
  for patterns like (A*)* where cycles occur at count > 0.  Currently
  relies on implicit duplicate detection in nfa_add_state_unique.


Bugs fixed:

- Fix nfa_advance_begin() routing order: enter group first (lexically
  first), then skip path (lexically second).

- Add ALT scope boundary check in nfa_advance_alt() and
  computeAbsorbabilityRecursive() to stop branch traversal when
  element depth exits ALT scope.

- Disallow RANGE and GROUPS frame types with row pattern
  recognition, as the SQL standard only requires ROWS.
  Also remove the now-dead frame-type determination code.


NFA executor refactoring (nodeWindowAgg.c):

- Simplify nfa_process_row() phase 1: remove nfa_advance() call
  after forced mismatch at frame boundary, since all states are
  already at VAR positions and mismatch removes them.  Replace
  the redundant phase 3 frame boundary check with Assert.

- Simplify update_reduced_frame() result registration: mismatch
  path now returns early, and the post-processing SKIP mode cleanup
  block (nfa_remove_contexts_up_to vs nfa_context_free) is removed
  since eager pruning handles SKIP PAST LAST ROW and both paths
  now simply call nfa_context_free().

- Replace nfa_remove_contexts_up_to() with eager pruning inside
  nfa_add_matched_state().  When matchEndRow is extended during
  greedy matching, pending contexts within the match range are
  pruned incrementally instead of after match completion.  For
  SKIP PAST LAST ROW patterns like START UP+, this reduces the
  number of live contexts at each row from O(n) to O(1), avoiding
  O(n^2) per-row processing of contexts that would be skipped
  anyway.

- Optimize nfa_states_equal() to compare counts only up to the
  current element's depth instead of the full maxDepth.

- Rename nfa_state_clone -> nfa_state_create,
  nfa_find_context_for_pos -> nfa_get_head_context for clarity.

- Add nfa_record_context_success/skipped/absorbed statistics helpers.

- Remove unused RPRPattern parameter from
  nfa_update_absorption_flags().


Absorbability refactoring (rpr.c):

- Remove parentDepth parameter from isUnboundedStart() and
  computeAbsorbabilityRecursive().  Scope boundaries are now
  determined by element depth comparison instead of an explicit
  parent depth parameter.

- Simplify isUnboundedStart(): check simple VAR case first, then
  GROUP case with a depth-bounded loop instead of a FIN-terminated
  traversal with multiple break conditions.


Test changes:

- Add rpr_nfa.sql: comprehensive NFA runtime test suite covering
  quantifier boundaries (min/max/exact), alternation priority, nested
  patterns, frame boundary variations, INITIAL mode (syntax error
  expected), pathological patterns, context absorption, and FIXME
  reproduction cases.

- Add nth_value out-of-frame tests, ReScan/LATERAL test, and
  nth_value IGNORE NULLS test to rpr.sql.

- Change selected EXPLAIN test queries in rpr_explain.sql from
  EXPLAIN (COSTS OFF) to EXPLAIN (ANALYZE, ...) to verify actual
  NFA execution statistics.

- Fix stale comments across rpr.sql, rpr_base.sql, and rpr_nfa.sql:
  remove resolved BUG annotations, update error messages to match
  actual output, correct optimization result descriptions, and
  standardize Expected comment placement to after SQL statements.


Other changes:

- Run pgindent on RPR source files.  Add /*----------*/ comment
  guards to protect structured comments from reformatting.


Coverage:

I ran gcov on the modified lines (diff-only coverage).  The attached
coverage.zip contains an HTML report.  Summary:

- 18 files, 124 functions, 2238 modified lines analyzed
- Overall: 92.1% line coverage (2061/2238)
- Core files (nodeWindowAgg.c, rpr.c, explain.c, parse_rpr.c):
  98.0-98.5% each
- Node serialization (outfuncs.c, readfuncs.c, equalfuncs.c):
  0% -- these implement RPRPattern serialization for plan caching,
  but regression tests don't exercise prepared statements with RPR
- ruleutils.c: 96.3% -- untested lines are the reluctant quantifier
  display path ({1}?), which is currently rejected at parse time

The node serialization functions (141 lines, 0% coverage) are the
largest untested area.  I'm not sure how to trigger these paths
in the regression test framework.  Any suggestions?

I'll send a separate email within a few days listing the FIXME
issues and other unresolved items from the mailing list discussion
for your review.


Best regards,
Henson
Attachment

pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: Questionable description about character sets
Next
From: KAZAR Ayoub
Date:
Subject: Re: Speed up COPY TO text/CSV parsing using SIMD