Re: WIP: index support for regexp search - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: WIP: index support for regexp search
Date
Msg-id CAPpHfdvwGTEXHnxJXe-kne-Sr-DmJ0EiWJVYukgQR9Ophj3z=A@mail.gmail.com
Whole thread Raw
In response to Re: WIP: index support for regexp search  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: WIP: index support for regexp search  ("er" <er@xs4all.nl>)
Re: WIP: index support for regexp search  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On Mon, Nov 26, 2012 at 4:55 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
Great, that top-level comment helped tremendously! I feel enlightened.

I fixed some spelling, formatting etc. trivial stuff while reading through the patch, see attached. Below is some feedback on the details:

* I don't like the PG_TRY/CATCH trick. It's not generally safe to catch an error, without propagating it further or rolling back the whole (sub)transation. It might work in this case, as you're only suppressing errors with the special sqlcode that are used in the same file, but it nevertheless feels naughty. I believe none of the limits that are being checked are strict; it's OK to exceed the limits somewhat, as long as you terminate the processing in a reasonable time, in case of pathological input. I'd suggest putting an explicit check for the limits somewhere, and not rely on ereport(). Something like this, in the code that recurses:

if (trgmCNFA->arcsCount > MAX_RESULT_ARCS ||
    hash_get_num_entries(trgmCNFA->states) > MAX_RESULT_STATES)
{
        trgmCNFA->overflowed = true;
        return;
}

PG_TRY/CATCH trick is replaced with some number of if/return. Code becomes a bit more complicated, but your notes does matter.

And then check for the overflowed flag at the top level.

* This part of the high-level comment was not clear to me:

 * States of the graph produced in the first stage are marked with "keys". Key is a pair
 * of a "prefix" and a state of the original automaton. "Prefix" is a last
 * characters. So, knowing the prefix is enough to know what is a trigram when we read some new
 * character. However, we can know single character of prefix or don't know any
 * characters of it. Each state of resulting graph have an "enter key" (with that
 * key we've entered this state) and a set of keys which are reachable without
 * reading any predictable trigram. The algorithm of processing each state
 * of resulting graph are so:
 * 1) Add all keys which achievable without reading of any predictable trigram.
 * 2) Add outgoing arcs labeled with trigrams.
 * Step 2 leads to creation of new states and recursively processing them. So,
 * we use depth-first algorithm.

I didn't understand that. Can you elaborate? It might help to work through an example, with some ascii art depicting the graph.

This comment is extended with example.
 
* It would be nice to add some comments to TrgmCNFA struct, explaining which fields are valid at which stages. For example, it seems that 'trgms' array is calculated only after building the CNFA, by getTrgmVector() function, while arcsCount is updated on the fly, while recursing in the getState() function.

TrgmCNFA comment is extended with this. 
 
* What is the representation used for the path matrix? Needs a comment.

Comments are added to PackedTrgmPaths and TrgmStatePath.
 
* What do the getColorinfo()
 
getColorInfo comment now references to ColorInfo struct which have comment. 

and scanColorMap() functions do?

 scanColorMap comment now have description of colormap structure.
 
What exactly does a color represent?

This is added to the top comment.
 
What's the tradeoff in choosing MAX_COLOR_CHARS?
 
Comment is added to the macro.

------
With best regards,
Alexander Korotkov.
Attachment

pgsql-hackers by date:

Previous
From: Hannu Krosing
Date:
Subject: Re: WIP json generation enhancements
Next
From: Andrew Dunstan
Date:
Subject: Re: WIP json generation enhancements