Silliness in regexp's citerdissect/creviterdissect - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Silliness in regexp's citerdissect/creviterdissect |
Date | |
Msg-id | 1808998.1629412269@sss.pgh.pa.us Whole thread Raw |
List | pgsql-hackers |
While poking at a report of slow regexp matching [1], I happened to notice that citerdissect and creviterdissect are being remarkably stupid about how to backtrack after a match failure. Specifically, having used the DFA logic to identify K possible submatches, they then start the slow recursive "cdissect" check of each submatch. But, if the I'th submatch fails dissection, their response is to see about adjusting the length of the last submatch ... which will do nothing at all to make the result of the I'th submatch change, if it's before K. When this happens, we should immediately start adjusting the length of the I'th submatch. Attached is a simple patch to do this. It passes check-world, and shows the same results as before on Jacobson's web-regexps corpus, as well as on a sample of random regexps made by Dilger's script. I don't really see any performance difference on Jacobson's corpus, but Dilger's script did find an example where this makes a huge difference: HEAD: regexp=# select regexp_match('nefajztngmsvfajztngmsvlwhjsq', '(.*)((\1)){9}'); regexp_match -------------- (1 row) Time: 9655.141 ms (00:09.655) regexp=# select regexp_match('nefajztngmsvfajztngmsvlwhxxxxxxxxxxjsq', '(.*)((\1)){9}'); regexp_match -------------- {x,x,x} (1 row) Time: 271106.324 ms (04:31.106) With patch: regexp=# select regexp_match('nefajztngmsvfajztngmsvlwhjsq', '(.*)((\1)){9}'); regexp_match -------------- (1 row) Time: 9.385 ms regexp=# select regexp_match('nefajztngmsvfajztngmsvlwhxxxxxxxxxxjsq', '(.*)((\1)){9}'); regexp_match -------------- {x,x,x} (1 row) Time: 25.103 ms Admittedly this is a bit contrived. (In v14/HEAD, you can make the performance problem go away by getting rid of the redundant capturing parens, as then we don't invoke citerdissect at all. That trick doesn't help in the back branches though.) I think this is a pretty clear performance bug, and unless there are objections I plan to push this fix into all branches. regards, tom lane [1] https://www.postgresql.org/message-id/CALZg0g4FA1Xc5UMLrGBKM--erUGEAhe8GGLE-YcN7%3DO6rw2%3D0A%40mail.gmail.com diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c index db54ebfba4..aa5ba85514 100644 --- a/src/backend/regex/regexec.c +++ b/src/backend/regex/regexec.c @@ -1147,8 +1147,8 @@ citerdissect(struct vars *v, * Our strategy is to first find a set of sub-match endpoints that are * valid according to the child node's DFA, and then recursively dissect * each sub-match to confirm validity. If any validity check fails, - * backtrack the last sub-match and try again. And, when we next try for - * a validity check, we need not recheck any successfully verified + * backtrack that sub-match and try again. And, when we next try for a + * validity check, we need not recheck any successfully verified * sub-matches that we didn't move the endpoints of. nverified remembers * how many sub-matches are currently known okay. */ @@ -1236,12 +1236,13 @@ citerdissect(struct vars *v, return REG_OKAY; } - /* match failed to verify, so backtrack */ + /* i'th match failed to verify, so backtrack it */ + k = i; backtrack: /* - * Must consider shorter versions of the current sub-match. However, + * Must consider shorter versions of the k'th sub-match. However, * we'll only ask for a zero-length match if necessary. */ while (k > 0) @@ -1352,8 +1353,8 @@ creviterdissect(struct vars *v, * Our strategy is to first find a set of sub-match endpoints that are * valid according to the child node's DFA, and then recursively dissect * each sub-match to confirm validity. If any validity check fails, - * backtrack the last sub-match and try again. And, when we next try for - * a validity check, we need not recheck any successfully verified + * backtrack that sub-match and try again. And, when we next try for a + * validity check, we need not recheck any successfully verified * sub-matches that we didn't move the endpoints of. nverified remembers * how many sub-matches are currently known okay. */ @@ -1447,12 +1448,13 @@ creviterdissect(struct vars *v, return REG_OKAY; } - /* match failed to verify, so backtrack */ + /* i'th match failed to verify, so backtrack it */ + k = i; backtrack: /* - * Must consider longer versions of the current sub-match. + * Must consider longer versions of the k'th sub-match. */ while (k > 0) {
pgsql-hackers by date: