Re: trgm regex index peculiarity - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: trgm regex index peculiarity |
Date | |
Msg-id | 4963.1389828856@sss.pgh.pa.us Whole thread Raw |
In response to | Re: trgm regex index peculiarity (Alexander Korotkov <aekorotkov@gmail.com>) |
Responses |
Re: trgm regex index peculiarity
|
List | pgsql-hackers |
Alexander Korotkov <aekorotkov@gmail.com> writes: > Revised version of patch with necessary comments. I looked at this patch a bit. It seems like this: + * BLANK_COLOR_SIZE - How much blank character is more frequent than + * other character in average + #define BLANK_COLOR_SIZE 32 is a drastic overestimate. Using the program Erik provided to generate some sample data, I find that blanks in trigrams are maybe about three times more common than other characters. Possibly Erik's data isn't very representative of real text, but if that's the case we'd better get a more representative sample case before we start optimizing ... Now maybe the above number is okay for this purpose anyway, but if so it needs some different justification; maybe call it a "penalty"? But nonetheless it seems like a darn large penalty, particularly for " x" type trigrams where the value would effectively be squared. Compared to the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like you might as well forget the "sizing" approach and just flat out reject trigrams containing COLOR_BLANK, because they'll never get through the size filter. I'm inclined to think you need a larger MAX_TRGM_SIZE; and WISH_TRGM_SIZE seems mighty small as well, compared to what the code would have done before. Also, surely this bit: ! trgmNFA->totalTrgmCount = (int) totalTrgmSize; is flat out wrong? expandColorTrigrams() expects that trgmNFA->totalTrgmCount is the truth, not some penalty-bloated abstraction. The fact that the patch hasn't failed on you completely demonstrates that you've not tested any cases where blank-containing trigrams got through the filter, reinforcing my feeling that it's probably too strict now. I wonder if it would be more appropriate to continue to measure the limit MAX_TRGM_COUNT in terms of actual trigram counts, and just use the penalized "size" as the sort key while determining which color trigrams to discard first. Another thought is that it's not clear that you should apply the same penalty to blanks in all three positions. Because of the padding settings, any one word will normally lead to padded strings " a", " ab", "yz ", so that spaces in the first position are about twice as common as spaces in the other positions. (It's a little less than that because single-letter words produce only " a" and " a "; but I'd think that such words aren't very common in text that trigrams are effective for.) So I'm thinking the penalty ought to take that into account. I'm also inclined to think that it might be worth adding a size field to ColorTrgmInfo rather than repeatedly calculating the size estimate. We only allow a thousand and some ColorTrgmInfos at most, so the extra space wouldn't be that large, and I'm concerned by the expense the current patch adds to the sort comparator. Another point is that this comment: * Note: per-color-trigram counts cannot overflow an int so long as * COLOR_COUNT_LIMIT is not more than the cube root ofINT_MAX, ie about * 1290. However, the grand total totalTrgmCount might conceivably * overflow an int, so we use int64for that within this routine. is no longer valid, or at least it fails to demonstrate that the result of getColorTrigramSize() can't overflow an int; indeed I rather fear it can. The patch failed to even update the variable name in this comment, let alone address the substantive question. There are some other cosmetic things I didn't like, for instance colorTrgmInfoCountCmp() is no longer comparing counts but you didn't rename it. That's about it for substantive comments though. regards, tom lane
pgsql-hackers by date: