Re: Row pattern recognition - Mailing list pgsql-hackers
From | Vik Fearing |
---|---|
Subject | Re: Row pattern recognition |
Date | |
Msg-id | fdf57a8d-65d5-e8a5-361f-3095d6899099@postgresfriends.org Whole thread Raw |
In response to | Row pattern recognition (Tatsuo Ishii <ishii@sraoss.co.jp>) |
Responses |
Re: Row pattern recognition
|
List | pgsql-hackers |
On 6/25/23 14:05, Tatsuo Ishii wrote: > Attached is a PoC patch to implement "Row pattern recognition" (RPR) > in SQL:2016 (I know SQL:2023 is already out, but I don't have access > to it). Actually SQL:2016 defines RPR in two places[1]: > > Feature R010, “Row pattern recognition: FROM clause” > Feature R020, “Row pattern recognition: WINDOW clause” > > The patch includes partial support for R020 part. I have been dreaming of and lobbying for someone to pick up this feature. I will be sure to review it from a standards perspective and will try my best to help with the technical aspect, but I am not sure to have the qualifications for that. THANK YOU! > (I know SQL:2023 is already out, but I don't have access to it) If you can, try to get ISO/IEC 19075-5 which is a guide to RPR instead of just its technical specification. https://www.iso.org/standard/78936.html > - What is RPR? > > RPR provides a way to search series of data using regular expression > patterns. Suppose you have a stock database. > > CREATE TABLE stock ( > company TEXT, > tdate DATE, > price BIGINT); > > You want to find a "V-shaped" pattern: i.e. price goes down for 1 or > more days, then goes up for 1 or more days. If we try to implement the > query in PostgreSQL, it could be quite complex and inefficient. > > RPR provides convenient way to implement the query. > > SELECT company, tdate, price, rpr(price) OVER w FROM stock > WINDOW w AS ( > PARTITION BY company > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > PATTERN (START DOWN+ UP+) > DEFINE > START AS TRUE, > UP AS price > PREV(price), > DOWN AS price < PREV(price) > ); > > "PATTERN" and "DEFINE" are the key clauses of RPR. DEFINE defines 3 > "row pattern variables" namely START, UP and DOWN. They are associated > with logical expressions namely "TRUE", "price > PREV(price)", and > "price < PREV(price)". Note that "PREV" function returns price column > in the previous row. So, UP is true if price is higher than previous > day. On the other hand, DOWN is true if price is lower than previous > day. PATTERN uses the row pattern variables to create a necessary > pattern. In this case, the first row is always match because START is > always true, and second or more rows match with "UP" ('+' is a regular > expression representing one or more), and subsequent rows match with > "DOWN". > > Here is the sample output. > > company | tdate | price | rpr > ----------+------------+-------+------ > company1 | 2023-07-01 | 100 | > company1 | 2023-07-02 | 200 | 200 -- 200->150->140->150 > company1 | 2023-07-03 | 150 | 150 -- 150->140->150 > company1 | 2023-07-04 | 140 | > company1 | 2023-07-05 | 150 | 150 -- 150->90->110->130 > company1 | 2023-07-06 | 90 | > company1 | 2023-07-07 | 110 | > company1 | 2023-07-08 | 130 | > company1 | 2023-07-09 | 120 | > company1 | 2023-07-10 | 130 | > > rpr shows the first row if all the patterns are satisfied. In the > example above 200, 150, 150 are the cases. Other rows are shown as > NULL. For example, on 2023-07-02 price starts with 200, then goes down > to 150 then 140 but goes up 150 next day. I don't understand this. RPR in a window specification limits the window to the matched rows, so this looks like your rpr() function is just the regular first_value() window function that we already have? > As far as I know, only Oracle implements RPR (only R010. R020 is not > implemented) among OSS/commercial RDBMSs. There are a few DWH software > having RPR. According to [2] they are Snowflake and MS Stream > Analytics. It seems Trino is another one[3]. I thought DuckDB had it already, but it looks like I was wrong. > - Note about the patch > > The current implementation is not only a subset of the standard, but > is different from it in some places. The example above is written as > follows according to the standard: > > SELECT company, tdate, startprice OVER w FROM stock > WINDOW w AS ( > PARTITION BY company > MEASURES > START.price AS startprice > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > PATTERN (START DOWN+ UP+) > DEFINE > START AS TRUE, > UP AS UP.price > PREV(UP.price), > DOWN AS DOWN.price < PREV(DOWN.price) > ); > > Notice that rpr(price) is written as START.price and startprice in the > standard. MEASURES defines variable names used in the target list used > with "OVER window". As OVER only allows functions in PostgreSQL, I had > to make up a window function "rpr" which performs the row pattern > recognition task. I was not able to find a way to implement > expressions like START.price (START is not a table alias). Any > suggestion is greatly appreciated. As in your example, you cannot have START.price outside of the window specification; it can only go in the MEASURES clause. Only startprice is allowed outside and it gets its qualification from the OVER. Using w.startprice might have been better but that would require window names to be in the same namespace as range tables. This currently works in Postgres: SELECT RANK() OVER w FROM (VALUES (1)) AS w (x) WINDOW w AS (ORDER BY w.x); > The differences from the standard include: > > o MEASURES is not supported > o FIRST, LAST, CLASSIFIER are not supported > o PREV/NEXT in the standard accept more complex arguments > o INITIAL or SEEK are not supported ((behaves as if INITIAL is specified) Okay, for now. > o SUBSET is not supported Is this because you haven't done it yet, or because you ran into problems trying to do it? > o Regular expressions other than "+" are not supported This is what I had a hard time imagining how to do while thinking about it. The grammar is so different here and we allow many more operators (like "?" which is also the standard parameter symbol). People more expert than me will have to help here. > o Only AFTER MATCH SKIP TO NEXT ROW is supported (if AFTER MATCH is > not specified, AFTER MATCH SKIP TO NEXT ROW is assumed. In the > standard AFTER MATCH SKIP PAST LAST ROW is assumed in this case). I > have a plan to implement AFTER MATCH SKIP PAST LAST ROW though. In this case, we should require the user to specify AFTER MATCH SKIP TO NEXT ROW so that behavior doesn't change when we implement the standard default. (Your patch might do this already.) > o Aggregate functions associated with window clause using RPR do not respect RPR I do not understand what this means. > It seems RPR in the standard is quite complex. I think we can start > with a small subset of RPR then we could gradually enhance the > implementation. I have no problem with that as long as we don't paint ourselves into a corner. > Comments and suggestions are welcome. I have not looked at the patch yet, but is the reason for doing R020 before R010 because you haven't done the MEASURES clause yet? In any case, I will be watching this with a close eye, and I am eager to help in any way I can. -- Vik Fearing
pgsql-hackers by date: