Home > mailing lists

Re: Some regular-expression performance hacking - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Some regular-expression performance hacking
Date	February 18, 2021 23:44:47
Msg-id	2872014.1613681087@sss.pgh.pa.us Whole thread Raw
In response to	Re: Some regular-expression performance hacking ("Joel Jacobson" <joel@compiler.org>)
Responses	Re: Some regular-expression performance hacking
List	pgsql-hackers

Tree view

"Joel Jacobson" <joel@compiler.org> writes:
> Let's see if I can explain the idea:
> One of the problems with representing regexes with large bracket range expressions, like [a-z],
> is you get an explosion of edges, if the model can only represent state transitions for single characters.
> If we could instead let a single edge (for a state transition) represent a set of characters,
> or normally even more efficiently, a set of range of characters, then we could reduce the
> number of edges we need to represent the graph.
> The naive approach to just use the ranges as-is doesn't work.
> Instead, the graph must first be created with single-character edges.
> It is then examined what ranges can be constructed in a way that no single range
> overlaps any other range, so that every range can be seen as a character in an alphabet.

Hmm ... I might be misunderstanding, but I think our engine already
does a version of this.  See the discussion of "colors" in
src/backend/regex/README.

> Another optimization I've come up with (or probably re-invented because it feels quite obvious),
> is to read more than one character, when knowing for sure multiple characters-in-a-row
> are expected, by concatenating edges having only one parent and one child.

Maybe.  In practice the actual scanning tends to be tracking more than one
possible NFA state in parallel, so I'm not sure how often we could expect
to be able to use this idea.  That is, even if we know that state X can
only succeed by following an arc to Y and then another to Z, we might
also be interested in what happens if the NFA is in state Q at this point;
and it seems unlikely that Q would have exactly the same two following
arc colors.

I do have some ideas about possible future optimizations, and one reason
I'm grateful for this large set of real regexes is that it can provide a
concrete basis for deciding that particular optimizations are or are not
worth pursuing.  So thanks again for collecting it!

            regards, tom lane

pgsql-hackers by date:

From: "Joel Jacobson"
Date: 18 February 2021, 23:44:07
Subject: Re: Some regular-expression performance hacking

From: "Joel Jacobson"
Date: 18 February 2021, 23:54:38
Subject: Re: Some regular-expression performance hacking

Re: Some regular-expression performance hacking - Mailing list pgsql-hackers

Previous

Next