Re: Row pattern recognition - Mailing list pgsql-hackers

From Vik Fearing
Subject Re: Row pattern recognition
Date
Msg-id 9e135bd6-252a-e39a-bf7d-5a9726427769@postgresfriends.org
Whole thread Raw
In response to Re: Row pattern recognition  (Tatsuo Ishii <ishii@sraoss.co.jp>)
Responses Re: Row pattern recognition
List pgsql-hackers
On 7/28/23 13:02, Tatsuo Ishii wrote:
>>> Attached is the v3 patch. In this patch following changes are made.
>>
>> - PATTERN variables do not have to exist in the DEFINE clause.  They are
>> - considered TRUE if not present.
> 
> Do you think we really need this? I found a criticism regarding this.
> 
> https://link.springer.com/article/10.1007/s13222-022-00404-3
> "3.2 Explicit Definition of All Row Pattern Variables"
> 
> What do you think?

I think that a large part of obeying the standard is to allow queries 
from other engines to run the same on ours.  The standard does not 
require the pattern variables to be defined and so there are certainly 
queries out there without them, and that hurts migrating to PostgreSQL.

>>> - I am working on making window aggregates RPR aware now. The
>>>     implementation is in progress and far from completeness. An example
>>>     is below. I think row 2, 3, 4 of "count" column should be NULL
>>>     instead of 3, 2, 0, 0. Same thing can be said to other
>>>     rows. Probably this is an effect of moving aggregate but I still
>>>     studying the window aggregation code.
>>
>> This tells me again that RPR is not being run in the right place.  All
>> windowed aggregates and frame-level window functions should Just Work
>> with no modification.
> 
> I am not touching each aggregate function. I am modifying
> eval_windowaggregates() in nodeWindowAgg.c, which calls each aggregate
> function. Do you think it's not the right place to make window
> aggregates RPR aware?

Oh, okay.

>>> SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM
>>> stock
>>>    WINDOW w AS (
>>>    PARTITION BY company
>>>    ORDER BY tdate
>>>    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
>>>    AFTER MATCH SKIP PAST LAST ROW
>>>    INITIAL
>>>    PATTERN (START UP+ DOWN+)
>>>    DEFINE
>>>     START AS TRUE,
>>>     UP AS price > PREV(price),
>>>     DOWN AS price < PREV(price)
>>> );
>>>    company  |   tdate    | first_value | count
>>> ----------+------------+-------------+-------
>>>    company1 | 2023-07-01 |         100 |     4
>>>    company1 | 2023-07-02 |             |     3
>>>    company1 | 2023-07-03 |             |     2
>>>    company1 | 2023-07-04 |             |     0
>>>    company1 | 2023-07-05 |             |     0
>>>    company1 | 2023-07-06 |          90 |     4
>>>    company1 | 2023-07-07 |             |     3
>>>    company1 | 2023-07-08 |             |     2
>>>    company1 | 2023-07-09 |             |     0
>>>    company1 | 2023-07-10 |             |     0
>>>    company2 | 2023-07-01 |          50 |     4
>>>    company2 | 2023-07-02 |             |     3
>>>    company2 | 2023-07-03 |             |     2
>>>    company2 | 2023-07-04 |             |     0
>>>    company2 | 2023-07-05 |             |     0
>>>    company2 | 2023-07-06 |          60 |     4
>>>    company2 | 2023-07-07 |             |     3
>>>    company2 | 2023-07-08 |             |     2
>>>    company2 | 2023-07-09 |             |     0
>>>    company2 | 2023-07-10 |             |     0
>>
>> In this scenario, row 1's frame is the first 5 rows and specified SKIP
>> PAST LAST ROW, so rows 2-5 don't have *any* frame (because they are
>> skipped) and the result of the outer count should be 0 for all of them
>> because there are no rows in the frame.
> 
> Ok. Just I want to make sure. If it's other aggregates like sum or
> avg, the result of the outer aggregates should be NULL.

They all behave the same way as in a normal query when they receive no 
rows as input.

>> When we get to adding count in the MEASURES clause, there will be a
>> difference between no match and empty match, but that does not apply
>> here.
> 
> Can you elaborate more? I understand that "no match" and "empty match"
> are different things. But I do not understand how the difference
> affects the result of count.

This query:

SELECT v.a, wcnt OVER w, count(*) OVER w
FROM (VALUES ('A')) AS v (a)
WINDOW w AS (
   ORDER BY v.a
   MEASURES count(*) AS wcnt
   ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
   PATTERN (B)
   DEFINE B AS B.a = 'B'
)

produces this result:

  a | wcnt | count
---+------+-------
  A |      |     0
(1 row)

Inside the window specification, *no match* was found and so all of the 
MEASURES are null.  The count(*) in the target list however, still 
exists and operates over zero rows.

This very similar query:

SELECT v.a, wcnt OVER w, count(*) OVER w
FROM (VALUES ('A')) AS v (a)
WINDOW w AS (
   ORDER BY v.a
   MEASURES count(*) AS wcnt
   ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
   PATTERN (B?)
   DEFINE B AS B.a = 'B'
)

produces this result:

  a | wcnt | count
---+------+-------
  A |    0 |     0
(1 row)

In this case, the pattern is B? instead of just B, which produces an 
*empty match* for the MEASURES to be applied over.
-- 
Vik Fearing




pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: logical decoding and replication of sequences, take 2
Next
From: Tomas Vondra
Date:
Subject: Re: logical decoding and replication of sequences, take 2