Re: Row pattern recognition - Mailing list pgsql-hackers

From Vik Fearing
Subject Re: Row pattern recognition
Date
Msg-id c9ebc3d0-c3d1-e8eb-4a57-0ec099cbda17@postgresfriends.org
Whole thread Raw
In response to Re: Row pattern recognition  (Tatsuo Ishii <ishii@sraoss.co.jp>)
Responses Re: Row pattern recognition
List pgsql-hackers
On 7/26/23 14:21, Tatsuo Ishii wrote:
> Attached is the v3 patch. In this patch following changes are made.

Excellent.  Thanks!

A few quick comments:

- PERMUTE is still misspelled as PREMUTE

- PATTERN variables do not have to exist in the DEFINE clause.  They are 
considered TRUE if not present.

> (1) I completely changed the pattern matching engine so that it
> performs backtracking. Now the engine evaluates all pattern elements
> defined in PATTER against each row, saving matched pattern variables
> in a string per row. For example if the pattern element A and B
> evaluated to true, a string "AB" is created for current row.
> 
> This continues until all pattern matching fails or encounters the end
> of full window frame/partition. After that, the pattern matching
> engine creates all possible "pattern strings" and apply the regular
> expression matching to each. For example if we have row 0 = "AB" row 1
> = "C", possible pattern strings are: "AC" and "BC".
> 
> If it matches, the length of matching substring is saved. After all
> possible trials are done, the longest matching substring is chosen and
> it becomes the width (number of rows) in the reduced window frame.
> 
> See row_is_in_reduced_frame, search_str_set and search_str_set_recurse
> in nodeWindowAggs.c for more details. For now I use a naive depth
> first search and probably there is a lot of rooms for optimization
> (for example rewriting it without using
> recursion). Suggestions/patches are welcome.

My own reviews will only focus on correctness for now.  Once we get a 
good set of regression tests all passing, I will focus more on 
optimization.  Of course, others might want to review the performance now.

> Vik Fearing wrote:
>> I strongly disagree with this.  Window function do not need to know
>> how the frame is defined, and indeed they should not.
>> WinGetFuncArgInFrame should answer yes or no and the window function
>> just works on that. Otherwise we will get extension (and possibly even
>> core) functions that don't handle the frame properly.
> 
> So I moved row_is_in_reduce_frame into WinGetFuncArgInFrame so that
> those window functions are not needed to be changed.
> 
> (3) Window function rpr was removed. We can use first_value instead.

Excellent.

> (4) Remaining tasks/issues.
> 
> - For now I disable WinSetMarkPosition because RPR often needs to
>    access a row before the mark is set. We need to fix this in the
>    future.

Noted, and agreed.

> - I am working on making window aggregates RPR aware now. The
>    implementation is in progress and far from completeness. An example
>    is below. I think row 2, 3, 4 of "count" column should be NULL
>    instead of 3, 2, 0, 0. Same thing can be said to other
>    rows. Probably this is an effect of moving aggregate but I still
>    studying the window aggregation code.

This tells me again that RPR is not being run in the right place.  All 
windowed aggregates and frame-level window functions should Just Work 
with no modification.

> SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM stock
>   WINDOW w AS (
>   PARTITION BY company
>   ORDER BY tdate
>   ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
>   AFTER MATCH SKIP PAST LAST ROW
>   INITIAL
>   PATTERN (START UP+ DOWN+)
>   DEFINE
>    START AS TRUE,
>    UP AS price > PREV(price),
>    DOWN AS price < PREV(price)
> );
>   company  |   tdate    | first_value | count
> ----------+------------+-------------+-------
>   company1 | 2023-07-01 |         100 |     4
>   company1 | 2023-07-02 |             |     3
>   company1 | 2023-07-03 |             |     2
>   company1 | 2023-07-04 |             |     0
>   company1 | 2023-07-05 |             |     0
>   company1 | 2023-07-06 |          90 |     4
>   company1 | 2023-07-07 |             |     3
>   company1 | 2023-07-08 |             |     2
>   company1 | 2023-07-09 |             |     0
>   company1 | 2023-07-10 |             |     0
>   company2 | 2023-07-01 |          50 |     4
>   company2 | 2023-07-02 |             |     3
>   company2 | 2023-07-03 |             |     2
>   company2 | 2023-07-04 |             |     0
>   company2 | 2023-07-05 |             |     0
>   company2 | 2023-07-06 |          60 |     4
>   company2 | 2023-07-07 |             |     3
>   company2 | 2023-07-08 |             |     2
>   company2 | 2023-07-09 |             |     0
>   company2 | 2023-07-10 |             |     0

In this scenario, row 1's frame is the first 5 rows and specified SKIP 
PAST LAST ROW, so rows 2-5 don't have *any* frame (because they are 
skipped) and the result of the outer count should be 0 for all of them 
because there are no rows in the frame.

When we get to adding count in the MEASURES clause, there will be a 
difference between no match and empty match, but that does not apply here.

> I am going to add this thread to CommitFest and plan to add both of
> you as reviewers. Thanks in advance.

My pleasure.  Thank you for working on this difficult feature.
-- 
Vik Fearing




pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Support worker_spi to execute the function dynamically.
Next
From: Amit Kapila
Date:
Subject: Re: logical decoding and replication of sequences, take 2