Re: Row pattern recognition - Mailing list pgsql-hackers

From Henson Choi
Subject Re: Row pattern recognition
Date
Msg-id CAAAe_zDgbDQf=RMtGw_axaNm1d_HY-e4=0ddDy809p7kTuUwHQ@mail.gmail.com
Whole thread
In response to Re: Row pattern recognition  (Tatsuo Ishii <ishii@postgresql.org>)
Responses Re: Row pattern recognition
List pgsql-hackers
Hi, Tatsuo

Does "a zero-length match" mean "an empty match"?

Yes, they refer to the same thing.  "Zero-length match" is the more
common term in general regex implementations (PCRE2, Perl, Python,
Java, etc.[1]), but the RPR standard (ISO/IEC 19075-5, Section 4.12.2)
uses "empty match" exclusively.

[1] https://www.regular-expressions.info/zerolength.html
 

BTW, currently we place all nfa_* functions at the bottom of
nodeWindowAgg.c.  However nodeWindowAgg.c in master branch places "API
exposed to window functions" at the bottom of the file. Do you think
we should follow the way?

Yes, we should follow master's convention.  I see three options:

  (a) Reorder within nodeWindowAgg.c: move the nfa_* functions up and
      keep the "API exposed to window functions" section at the bottom,
      matching master's layout.

  (b) Separate file under src/backend/executor/, keeping it close to
      nodeWindowAgg.c while making the boundary explicit.

  (c) A dedicated src/backend/rpr/ directory modeled on
      src/backend/regex/, giving the NFA engine its own namespace.
      This could also be an opportunity to consolidate the existing
      src/backend/optimizer/plan/rpr.c into the same directory.

For now (a) is the safest change.  Longer term, (b) or (c) would make
more sense -- especially when we extend to MATCH_RECOGNIZE (R010),
where the NFA engine will need to be shared across both code paths.
Either way, the NFA engine can be exposed via a header so that R010
can share it without further restructuring.

Since the NFA algorithm is not familiar territory for most DBMS
developers, it would also be worth preserving the detailed algorithm
description posted earlier in this thread -- either as structured
comments or as a dedicated README alongside the code.

What do you think?  Should we start with (a) now and revisit the
broader restructuring approaches -- (b) or (c) -- later, or would you
prefer to discuss them first?  Either of those would also resolve the
file layout convention issue naturally, since new files would follow
proper conventions from the start.


One more thing: there are no ECPG example programs or regression tests
for RPR yet.  I'd like to propose adding them.  Shall I draft an
initial set, or would you prefer to coordinate with the ECPG
maintainers first?


Best regards,
Henson

pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Next
From: Nathan Bossart
Date:
Subject: Re: Add starelid, attnum to pg_stats and leverage this in pg_dump