Re: Row pattern recognition - Mailing list pgsql-hackers
From | Tatsuo Ishii |
---|---|
Subject | Re: Row pattern recognition |
Date | |
Msg-id | 20230909.202121.983745512339833395.t-ishii@sranhm.sra.co.jp Whole thread Raw |
In response to | Re: Row pattern recognition (Jacob Champion <jchampion@timescale.com>) |
Responses |
Re: Row pattern recognition
Re: Row pattern recognition |
List | pgsql-hackers |
Hi, >> But: >> >> UP AS price > PREV(price) >> >> also depends on previous row, no? > > PREV(CLASSIFIER()) depends not on the value of the previous row but the > state of the match so far. To take an example from the patch: > >> * Example: >> * str_set[0] = "AB"; >> * str_set[1] = "AC"; >> * In this case at row 0 A and B are true, and A and C are true in row 1. > > With these str_sets and my example DEFINE, row[1] is only classifiable > as 'A' if row[0] is *not* classified as 'A' at this point in the match. > "AA" is not a valid candidate string, even if it matches the PATTERN. Ok, Let me clarify my understanding. Suppose we have: PATTER (A B) DEFINE A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A', B AS price > 100 and the target table has price column values: row[0]: 110 row[1]: 110 row[2]: 110 row[3]: 110 Then we will get for str_set: r0: B r1: AB Because r0 only has classifier B, r1 can have A and B. Problem is, r2. If we choose A at r1, then r2 = B. But if we choose B at t1, then r2 = AB. I guess this is the issue you pointed out. > So if we don't reevaluate the pattern variable condition for the row, we > at least have to prune the combinations that search_str_set() visits, so > that we don't generate a logically impossible combination. That seems > like it could be pretty fragile, and it may be difficult for us to prove > compliance. Yeah, probably we have delay evaluation of such pattern variables like A, then reevaluate A after the first scan. What about leaving this (reevaluation) for now? Because: 1) we don't have CLASSIFIER 2) we don't allow to give CLASSIFIER to PREV as its arggument so I think we don't need to worry about this for now. >> Can you explain a little bit more? I think 'aaa' matches a regular >> expression 'a+(b|a)+' and should be no problem before "aab" is >> considered. > > Assuming I've understood the rules correctly, we're not allowed to > classify the last row as 'A' if it also matches 'B'. Lexicographic > ordering takes precedence, so we have to try "aab" first. Otherwise our > query could return different results compared to another implementation. > >>> Similarly, the assumption that we want to match the longest string only >>> works because we don't allow alternation yet. >> >> Can you please clarify more on this? > > Sure: for the pattern > > PATTERN ( (A|B)+ ) > > we have to consider the candidate "a" over the candidate "ba", even > though the latter is longer. Like the prior example, lexicographic > ordering is considered more important than the greedy quantifier. > Quoting ISO/IEC 9075-2:2016: > > More precisely, with both reluctant and greedy quantifiers, the set > of matches is ordered lexicographically, but when one match is an > initial substring of another match, reluctant quantifiers prefer the > shorter match (the substring), whereas greedy quantifiers prefer the > longer match (the “superstring”). > > Here, "ba" doesn't have "a" as a prefix, so "ba" doesn't get priority. > ISO/IEC 19075-5:2021 has a big section on this (7.2) with worked examples. > > (The "lexicographic order matters more than greediness" concept was the > most mind-bending part for me so far, probably because I haven't figured > out how to translate the concept into POSIX EREs. It wouldn't make sense > to say "the letter 't' can match 'a', 'B', or '3' in this regex", but > that's what RPR is doing.) Thanks for the explanation. Surprising concet of the standard:-) Is it different from SIMILAR TO REs too? What if we don't follow the standard, instead we follow POSIX EREs? I think this is better for users unless RPR's REs has significant merit for users. >> Do you mean you want to provide a better patch for the pattern >> matching part? That will be helpfull. > > No guarantees that I'll find a better patch :D But yes, I will give it a > try. Ok. >> Because I am currently working >> on the aggregation part and have no time to do it. However, the >> aggregation work affects the v5 patch: it needs a refactoring. So can >> you wait until I release v6 patch? I hope it will be released in two >> weeks or so. > > Absolutely! Thanks. > Heh, I think it would be pretty foolish for me to code an NFA, from > scratch, and then try to convince the community to maintain it. > > But: > - I think we have to implement a parallel parser regardless (RPR PATTERN > syntax looks incompatible with POSIX) I am not sure if we need to worry about this because of the reason I mentioned above. > - I suspect we need more control over the backtracking than the current > pg_reg* API is going to give us, or else I'm worried performance is > going to fall off a cliff with usefully-large partitions Agreed. > - there's a lot of stuff in POSIX EREs that we don't need, and of the > features we do need, the + quantifier is probably one of the easiest > - it seems easier to prove the correctness of a slow, naive, > row-at-a-time engine, because we can compare it directly to the spec > > So what I'm thinking is: if I start by open-coding the + quantifier, and > slowly add more pieces in, then it might be easier to see the parts of > src/backend/regex that I've duplicated. We can try to expose those parts > directly from the internal API to replace my bad implementation. And if > there are parts that aren't duplicated, then it'll be easier to explain > why we need something different from the current engine. > > Does that seem like a workable approach? (Worst-case, my code is just > horrible, and we throw it in the trash.) Yes, it seems workable. I think for the first cut of RPR needs at least the +quantifier with reasonable performance. The current naive implementation seems to have issue because of exhaustive search. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
pgsql-hackers by date: