Re: Row pattern recognition - Mailing list pgsql-hackers
From | Tatsuo Ishii |
---|---|
Subject | Re: Row pattern recognition |
Date | |
Msg-id | 20240928.194359.1327378331465099656.ishii@postgresql.org Whole thread Raw |
In response to | Row pattern recognition (Tatsuo Ishii <ishii@sraoss.co.jp>) |
Responses |
Re: Row pattern recognition
|
List | pgsql-hackers |
> With some bigger partitions, I hit an `ERROR: wrong pos: 1024`. A > test that reproduces it is attached. Thanks for the report. Attached is a patch on top of v22 patches to fix the bug. We keep info in an array (WindowAggState.reduced_frame_map) to track the rpr pattern match result status for each row in a frame. If pattern match succeeds, the first row in the reduced frame has status RF_FRAME_HEAD and rest of rows have RF_SKIPPED state. A row with pattern match failure state has RF_UNMATCHED state. Any row which is never tested has state RF_NOT_DETERMINED. At begining the map is initialized with 1024 entries with all RF_NOT_DETERMINED state. Eventually they are replaced with other than RF_NOT_DETERMINED state. In the error case rpr engine tries to find 1024 th row's state in the map and failed because the row's state has not been tested yet. I think we should treat it as RF_NOT_DETERMINED rather than an error. Attached patch does it. > While playing with the feature, I've been trying to identify runs of > matched rows by eye. But it's pretty difficult -- the best I can do is > manually count rows using a `COUNT(*) OVER ...`. So I'd like to > suggest that MEASURES be part of the eventual v1 feature, if there's > no other way to determine whether a row was skipped by a previous > match. (That was less obvious to me before the fix in v17.) I think implementing MEASURES is challenging. Especially we need to find how our parser accepts "colname OVER window_definition". Currently PostgreSQL's parser only accepts "func() OVER window_definition" Even it is technically possible, I think the v1 patch size will become much larger than now due to this. How about inventing new window function that returns row state instead? - match found (yes/no) - skipped due to AFTER MATCH SKIP PAST LAST ROW (no match) For the rest of the mail I need more time to understand. I will reply back after studying it. For now, I just want to thank you for the valuable information! > -- > > I've been working on an implementation [1] of SQL/RPR's "parenthesized > language" and preferment order. (These are defined in SQL/Foundation > 2023, section 9.41.) The tool gives you a way to figure out, for a > given pattern, what matches are supposed to be attempted and in what > order: > > $ ./src/test/modules/rpr/rpr_prefer "a b? a" > ( ( a ( b ) ) a ) > ( ( a ( ) ) a ) > > Many simple patterns result in an infinite set of possible matches. So > if you use an unbounded quantifiers, you have to also use --max-rows > to limit the size of the hypothetical window frame: > > $ ./src/test/modules/rpr/rpr_prefer --max-rows 2 "^ PERMUTE(a*, b+)? $" > ( ( ^ ( ( ( ( ( ( a ) ( b ) ) ) - ) ) ) ) $ ) > ( ( ^ ( ( ( ( ( ( ) ( b b ) ) ) - ) ) ) ) $ ) > ( ( ^ ( ( ( ( ( ( ) ( b ) ) ) - ) ) ) ) $ ) > ( ( ^ ( ( ( - ( ( ( b b ) ( ) ) ) ) ) ) ) $ ) > ( ( ^ ( ( ( - ( ( ( b ) ( a ) ) ) ) ) ) ) $ ) > ( ( ^ ( ( ( - ( ( ( b ) ( ) ) ) ) ) ) ) $ ) > ( ( ^ ( ) ) $ ) > > I've found this useful to check my personal understanding of the spec > and the match behavior, but it could also potentially be used to > generate test cases, or to help users debug their own patterns. For > example, a pattern that has a bunch of duplicate sequences in its PL > is probably not very well optimized: > > $ ./src/test/modules/rpr/rpr_prefer --max-rows 4 "a+ a+" > ( ( a a a ) ( a ) ) > ( ( a a ) ( a a ) ) > ( ( a a ) ( a ) ) > ( ( a ) ( a a a ) ) > ( ( a ) ( a a ) ) > ( ( a ) ( a ) ) > > And patterns with catastrophic backtracking behavior tend to show a > "sawtooth" pattern in the output, with a huge number of potential > matches being generated relative to the number of rows in the frame. > > My implementation is really messy -- it leaks memory like a sieve, and > I cannibalized the parser from ECPG, which just ended up as an > exercise in teaching myself flex/bison. But if there's interest in > having this kind of tool in the tree, I can work on making it > reviewable. Either way, I should be able to use it to double-check > more complicated test cases. > > A while back [2], you were wondering whether our Bison implementation > would be able to parse the PATTERN grammar directly. I think this tool > proves that the answer is "yes", but PERMUTE in particular causes a > shift/reduce conflict. To fix it, I applied the same precedence > workaround that we use for CUBE and ROLLUP. > > Thanks again! > --Jacob > > [1] https://github.com/jchampio/postgres/tree/dev/rpr > [2] https://www.postgresql.org/message-id/20230721.151648.412762379013769790.t-ishii%40sranhm.sra.co.jp diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c index e46a3dd1b7..4a5a6fbf07 100644 --- a/src/backend/executor/nodeWindowAgg.c +++ b/src/backend/executor/nodeWindowAgg.c @@ -4148,9 +4148,15 @@ int get_reduced_frame_map(WindowAggState *winstate, int64 pos) { Assert(winstate->reduced_frame_map != NULL); + Assert(pos >= 0); - if (pos < 0 || pos >= winstate->alloc_sz) - elog(ERROR, "wrong pos: " INT64_FORMAT, pos); + /* + * If pos is not in the reduced frame map, it means that any info + * regarding the pos has not been registered yet. So we return + * RF_NOT_DETERMINED. + */ + if (pos >= winstate->alloc_sz) + return RF_NOT_DETERMINED; return winstate->reduced_frame_map[pos]; }
pgsql-hackers by date: