Re: Row pattern recognition - Mailing list pgsql-hackers
From | Tatsuo Ishii |
---|---|
Subject | Re: Row pattern recognition |
Date | |
Msg-id | 20230908.125447.478562939918621603.t-ishii@sranhm.sra.co.jp Whole thread Raw |
In response to | Re: Row pattern recognition (Jacob Champion <jchampion@timescale.com>) |
Responses |
Re: Row pattern recognition
|
List | pgsql-hackers |
Hi, > Hello! > >> (1) I completely changed the pattern matching engine so that it >> performs backtracking. Now the engine evaluates all pattern elements >> defined in PATTER against each row, saving matched pattern variables >> in a string per row. For example if the pattern element A and B >> evaluated to true, a string "AB" is created for current row. > > If I understand correctly, this strategy assumes that one row's > membership in a pattern variable is independent of the other rows' > membership. But I don't think that's necessarily true: > > DEFINE > A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A', > ... But: UP AS price > PREV(price) also depends on previous row, no? Can you please elaborate how your example could break current implementation? I cannot test it because CLASSIFIER is not implemented yet. >> See row_is_in_reduced_frame, search_str_set and search_str_set_recurse >> in nodeWindowAggs.c for more details. For now I use a naive depth >> first search and probably there is a lot of rooms for optimization >> (for example rewriting it without using >> recursion). > > The depth-first match is doing a lot of subtle work here. For example, with > > PATTERN ( A+ B+ ) > DEFINE A AS TRUE, > B AS TRUE > > (i.e. all rows match both variables), and three rows in the partition, > our candidates will be tried in the order > > aaa > aab > aba > abb > ... > bbb > > The only possible matches against our regex `^a+b+` are "aab" and "abb", > and that order matches the preferment order, so it's fine. But it's easy > to come up with a pattern where that's the wrong order, like > > PATTERN ( A+ (B|A)+ ) > > Now "aaa" will be considered before "aab", which isn't correct. Can you explain a little bit more? I think 'aaa' matches a regular expression 'a+(b|a)+' and should be no problem before "aab" is considered. > Similarly, the assumption that we want to match the longest string only > works because we don't allow alternation yet. Can you please clarify more on this? > Cool, I will give this piece some more thought. Do you mind if I try to > add some more complicated pattern quantifiers to stress the > architecture, or would you prefer to tackle that later? Just alternation > by itself will open up a world of corner cases. Do you mean you want to provide a better patch for the pattern matching part? That will be helpfull. Because I am currently working on the aggregation part and have no time to do it. However, the aggregation work affects the v5 patch: it needs a refactoring. So can you wait until I release v6 patch? I hope it will be released in two weeks or so. > On 8/9/23 01:41, Tatsuo Ishii wrote: >> - I found a bug with pattern matching code. It creates a string for >> subsequent regular expression matching. It uses the initial letter >> of each define variable name. For example, if the varname is "foo", >> then "f" is used. Obviously this makes trouble if we have two or >> more variables starting with same "f" (e.g. "food"). To fix this, I >> assign [a-z] to each variable instead of its initial letter. However >> this way limits us not to have more than 26 variables. I hope 26 is >> enough for most use cases. > > There are still plenty of alphanumerics left that could be assigned... > > But I'm wondering if we might want to just implement the NFA directly? > The current implementation's Cartesian explosion can probably be pruned > aggressively, but replaying the entire regex match once for every > backtracked step will still duplicate a lot of work. Not sure if you mean implementing new regular expression engine besides src/backend/regexp. I am afraid it's not a trivial work. The current regexp code consists of over 10k lines. What do you think? > I've attached another test case; it looks like last_value() is depending > on some sort of side effect from either first_value() or nth_value(). I > know the window frame itself is still under construction, so apologies > if this is an expected failure. Thanks. Fortunately current code which I am working passes the new test. I will include it in the next (v6) patch. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
pgsql-hackers by date: