Re: Row pattern recognition - Mailing list pgsql-hackers

From Vik Fearing
Subject Re: Row pattern recognition
Date
Msg-id 90bff6fa-01b7-cf60-f4f4-3412514a7cd2@postgresfriends.org
Whole thread Raw
In response to Re: Row pattern recognition  (Tatsuo Ishii <ishii@sraoss.co.jp>)
Responses Re: Row pattern recognition
Re: Row pattern recognition
List pgsql-hackers
On 6/26/23 03:05, Tatsuo Ishii wrote:
>> I don't understand this.  RPR in a window specification limits the
>> window to the matched rows, so this looks like your rpr() function is
>> just the regular first_value() window function that we already have?
> 
> No, rpr() is different from first_value(). rpr() returns the argument
> value at the first row in a frame only when matched rows found. On the
> other hand first_value() returns the argument value at the first row
> in a frame unconditionally.
> 
> company  |   tdate    | price | rpr  | first_value
> ----------+------------+-------+------+-------------
>   company1 | 2023-07-01 |   100 |      |         100
>   company1 | 2023-07-02 |   200 |  200 |         200
>   company1 | 2023-07-03 |   150 |  150 |         150
>   company1 | 2023-07-04 |   140 |      |         140
>   company1 | 2023-07-05 |   150 |  150 |         150
>   company1 | 2023-07-06 |    90 |      |          90
>   company1 | 2023-07-07 |   110 |      |         110
>   company1 | 2023-07-08 |   130 |      |         130
>   company1 | 2023-07-09 |   120 |      |         120
>   company1 | 2023-07-10 |   130 |      |         130
> 
> For example, a frame starting with (tdate = 2023-07-02, price = 200)
> consists of rows (price = 200, 150, 140, 150) satisfying the pattern,
> thus rpr returns 200. Since in this example frame option "ROWS BETWEEN
> CURRENT ROW AND UNBOUNDED FOLLOWING" is specified, next frame starts
> with (tdate = 2023-07-03, price = 150). This frame satisfies the
> pattern too (price = 150, 140, 150), and rpr retus 150... and so on.


Okay, I see the problem now, and why you need the rpr() function.

You are doing this as something that happens over a window frame, but it 
is actually something that *reduces* the window frame.  The pattern 
matching needs to be done when the frame is calculated and not when any 
particular function is applied over it.

This query (with all the defaults made explicit):

SELECT s.company, s.tdate, s.price,
        FIRST_VALUE(s.tdate) OVER w,
        LAST_VALUE(s.tdate) OVER w,
        lowest OVER w
FROM stock AS s
WINDOW w AS (
   PARTITION BY s.company
   ORDER BY s.tdate
   ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
   EXCLUDE NO OTHERS
   MEASURES
     LAST(DOWN) AS lowest
   AFTER MATCH SKIP PAST LAST ROW
   INITIAL PATTERN (START DOWN+ UP+)
   DEFINE
     START AS TRUE,
     UP AS price > PREV(price),
     DOWN AS price < PREV(price)
);

Should produce this result:

  company  |   tdate    | price | first_value | last_value | lowest
----------+------------+-------+-------------+------------+--------
  company1 | 07-01-2023 |   100 |             |            |
  company1 | 07-02-2023 |   200 | 07-02-2023  | 07-05-2023 |    140
  company1 | 07-03-2023 |   150 |             |            |
  company1 | 07-04-2023 |   140 |             |            |
  company1 | 07-05-2023 |   150 |             |            |
  company1 | 07-06-2023 |    90 |             |            |
  company1 | 07-07-2023 |   110 |             |            |
  company1 | 07-08-2023 |   130 | 07-05-2023  | 07-05-2023 |    120
  company1 | 07-09-2023 |   120 |             |            |
  company1 | 07-10-2023 |   130 |             |            |
(10 rows)

Or if we switch to AFTER MATCH SKIP TO NEXT ROW, then we get:

  company  |   tdate    | price | first_value | last_value | lowest
----------+------------+-------+-------------+------------+--------
  company1 | 07-01-2023 |   100 |             |            |
  company1 | 07-02-2023 |   200 | 07-02-2023  | 07-05-2023 |    140
  company1 | 07-03-2023 |   150 | 07-03-2023  | 07-05-2023 |    140
  company1 | 07-04-2023 |   140 |             |            |
  company1 | 07-05-2023 |   150 | 07-05-2023  | 07-08-2023 |     90
  company1 | 07-06-2023 |    90 |             |            |
  company1 | 07-07-2023 |   110 |             |            |
  company1 | 07-08-2023 |   130 | 07-08-2023  | 07-10-2023 |    120
  company1 | 07-09-2023 |   120 |             |            |
  company1 | 07-10-2023 |   130 |             |            |
(10 rows)

And then if we change INITIAL to SEEK:

  company  |   tdate    | price | first_value | last_value | lowest
----------+------------+-------+-------------+------------+--------
  company1 | 07-01-2023 |   100 | 07-02-2023  | 07-05-2023 |    140
  company1 | 07-02-2023 |   200 | 07-02-2023  | 07-05-2023 |    140
  company1 | 07-03-2023 |   150 | 07-03-2023  | 07-05-2023 |    140
  company1 | 07-04-2023 |   140 | 07-05-2023  | 07-08-2023 |     90
  company1 | 07-05-2023 |   150 | 07-05-2023  | 07-08-2023 |     90
  company1 | 07-06-2023 |    90 | 07-08-2023  | 07-10-2023 |    120
  company1 | 07-07-2023 |   110 | 07-08-2023  | 07-10-2023 |    120
  company1 | 07-08-2023 |   130 | 07-08-2023  | 07-10-2023 |    120
  company1 | 07-09-2023 |   120 |             |            |
  company1 | 07-10-2023 |   130 |             |            |
(10 rows)

Since the pattern recognition is part of the frame, the window 
aggregates should Just Work.


>>> o SUBSET is not supported
>>
>> Is this because you haven't done it yet, or because you ran into
>> problems trying to do it?
> 
> Because it seems SUBSET is not useful without MEASURES support. Thus
> my plan is, firstly implement MEASURES, then SUBSET. What do you
> think?


SUBSET elements can be used in DEFINE clauses, but I do not think this 
is important compared to other features.


>>> Comments and suggestions are welcome.
>>
>> I have not looked at the patch yet, but is the reason for doing R020
>> before R010 because you haven't done the MEASURES clause yet?
> 
> One of the reasons is, implementing MATCH_RECOGNIZE (R010) looked
> harder for me because modifying main SELECT clause could be a hard
> work. Another reason is, I had no idea how to implement PREV/NEXT in
> other than in WINDOW clause. Other people might feel differently
> though.


I think we could do this with a single tuplesort if we use backtracking 
(which might be really slow for some patterns).  I have not looked into 
it in any detail.

We would need to be able to remove tuples from the end (even if only 
logically), and be able to update tuples inside the store.  Both of 
those needs come from backtracking and possibly changing the classifier.

Without backtracking, I don't see how we could do it without have a 
separate tuplestore for every current possible match.


>> In any case, I will be watching this with a close eye, and I am eager
>> to help in any way I can.
> 
> Thank you! I am looking forward to comments on my patch.  Also any
> idea how to implement MEASURES clause is welcome.


I looked at your v2 patches a little bit and the only comment that I 
currently have on the code is you spelled PERMUTE as PREMUTE. 
Everything else is hopefully explained above.
-- 
Vik Fearing




pgsql-hackers by date:

Previous
From: Kirk Wolak
Date:
Subject: Re: Do we want a hashset type?
Next
From: Vik Fearing
Date:
Subject: Re: Fixing tab-complete for dollar-names