Re: Row pattern recognition - Mailing list pgsql-hackers
From | Vik Fearing |
---|---|
Subject | Re: Row pattern recognition |
Date | |
Msg-id | c9ebc3d0-c3d1-e8eb-4a57-0ec099cbda17@postgresfriends.org Whole thread Raw |
In response to | Re: Row pattern recognition (Tatsuo Ishii <ishii@sraoss.co.jp>) |
Responses |
Re: Row pattern recognition
|
List | pgsql-hackers |
On 7/26/23 14:21, Tatsuo Ishii wrote: > Attached is the v3 patch. In this patch following changes are made. Excellent. Thanks! A few quick comments: - PERMUTE is still misspelled as PREMUTE - PATTERN variables do not have to exist in the DEFINE clause. They are considered TRUE if not present. > (1) I completely changed the pattern matching engine so that it > performs backtracking. Now the engine evaluates all pattern elements > defined in PATTER against each row, saving matched pattern variables > in a string per row. For example if the pattern element A and B > evaluated to true, a string "AB" is created for current row. > > This continues until all pattern matching fails or encounters the end > of full window frame/partition. After that, the pattern matching > engine creates all possible "pattern strings" and apply the regular > expression matching to each. For example if we have row 0 = "AB" row 1 > = "C", possible pattern strings are: "AC" and "BC". > > If it matches, the length of matching substring is saved. After all > possible trials are done, the longest matching substring is chosen and > it becomes the width (number of rows) in the reduced window frame. > > See row_is_in_reduced_frame, search_str_set and search_str_set_recurse > in nodeWindowAggs.c for more details. For now I use a naive depth > first search and probably there is a lot of rooms for optimization > (for example rewriting it without using > recursion). Suggestions/patches are welcome. My own reviews will only focus on correctness for now. Once we get a good set of regression tests all passing, I will focus more on optimization. Of course, others might want to review the performance now. > Vik Fearing wrote: >> I strongly disagree with this. Window function do not need to know >> how the frame is defined, and indeed they should not. >> WinGetFuncArgInFrame should answer yes or no and the window function >> just works on that. Otherwise we will get extension (and possibly even >> core) functions that don't handle the frame properly. > > So I moved row_is_in_reduce_frame into WinGetFuncArgInFrame so that > those window functions are not needed to be changed. > > (3) Window function rpr was removed. We can use first_value instead. Excellent. > (4) Remaining tasks/issues. > > - For now I disable WinSetMarkPosition because RPR often needs to > access a row before the mark is set. We need to fix this in the > future. Noted, and agreed. > - I am working on making window aggregates RPR aware now. The > implementation is in progress and far from completeness. An example > is below. I think row 2, 3, 4 of "count" column should be NULL > instead of 3, 2, 0, 0. Same thing can be said to other > rows. Probably this is an effect of moving aggregate but I still > studying the window aggregation code. This tells me again that RPR is not being run in the right place. All windowed aggregates and frame-level window functions should Just Work with no modification. > SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM stock > WINDOW w AS ( > PARTITION BY company > ORDER BY tdate > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > AFTER MATCH SKIP PAST LAST ROW > INITIAL > PATTERN (START UP+ DOWN+) > DEFINE > START AS TRUE, > UP AS price > PREV(price), > DOWN AS price < PREV(price) > ); > company | tdate | first_value | count > ----------+------------+-------------+------- > company1 | 2023-07-01 | 100 | 4 > company1 | 2023-07-02 | | 3 > company1 | 2023-07-03 | | 2 > company1 | 2023-07-04 | | 0 > company1 | 2023-07-05 | | 0 > company1 | 2023-07-06 | 90 | 4 > company1 | 2023-07-07 | | 3 > company1 | 2023-07-08 | | 2 > company1 | 2023-07-09 | | 0 > company1 | 2023-07-10 | | 0 > company2 | 2023-07-01 | 50 | 4 > company2 | 2023-07-02 | | 3 > company2 | 2023-07-03 | | 2 > company2 | 2023-07-04 | | 0 > company2 | 2023-07-05 | | 0 > company2 | 2023-07-06 | 60 | 4 > company2 | 2023-07-07 | | 3 > company2 | 2023-07-08 | | 2 > company2 | 2023-07-09 | | 0 > company2 | 2023-07-10 | | 0 In this scenario, row 1's frame is the first 5 rows and specified SKIP PAST LAST ROW, so rows 2-5 don't have *any* frame (because they are skipped) and the result of the outer count should be 0 for all of them because there are no rows in the frame. When we get to adding count in the MEASURES clause, there will be a difference between no match and empty match, but that does not apply here. > I am going to add this thread to CommitFest and plan to add both of > you as reviewers. Thanks in advance. My pleasure. Thank you for working on this difficult feature. -- Vik Fearing
pgsql-hackers by date: