Re: trgm regex index peculiarity - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: trgm regex index peculiarity
Date
Msg-id CAPpHfdtca-Jh5FX=HGG7D0Q9nf0SBK2FSfGr4mo1vKfSyGGgCw@mail.gmail.com
Whole thread Raw
In response to Re: trgm regex index peculiarity  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: trgm regex index peculiarity
List pgsql-hackers
On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> Revised version of patch with necessary comments.

I looked at this patch a bit.  It seems like this:

+  *    BLANK_COLOR_SIZE - How much blank character is more frequent than
+  *                       other character in average
+ #define BLANK_COLOR_SIZE      32

is a drastic overestimate.  Using the program Erik provided to generate
some sample data, I find that blanks in trigrams are maybe about three
times more common than other characters.  Possibly Erik's data isn't
very representative of real text, but if that's the case we'd better
get a more representative sample case before we start optimizing ...

Now maybe the above number is okay for this purpose anyway, but if so
it needs some different justification; maybe call it a "penalty"?
But nonetheless it seems like a darn large penalty, particularly for " x"
type trigrams where the value would effectively be squared.  Compared to
the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like
you might as well forget the "sizing" approach and just flat out reject
trigrams containing COLOR_BLANK, because they'll never get through the
size filter.

It seems to be right decision to me when we have other trigrams can reject them. But there are cases when we can't.
 
 I'm inclined to think you need a larger MAX_TRGM_SIZE;
and WISH_TRGM_SIZE seems mighty small as well, compared to what the
code would have done before.
 
OK, I would like to make more reasoning for penalty. 
Let's consider word of length n.
It contains n+1 trigrams including:
1 __x trigram
1 _xx trigram
1 xx_ trigram
n - 2 xxx trigrams

Assume alphabet of size m those trigrams have following average frequencies:
__x 1/((n+1)*m)
_xx 1/((n+1)*m^2)
xx_ 1/((n+1)*m^2)
xxx (n-2)/((n+1)*m^3)

The ratio of this frequencies with blanks to frequency of xxx is:
__x m^2/(n-2)
_xx and xx_ m/(n-2)

In order to estimate n I've analyzed some classics:
Ernest Hemingway "A farewell to arms" - 3.8
Leo Tolstoy "War and Peace" - 5.0

In english with alphabet size = 26 we have

__x m^2/(n-2) = 375
_xx and xx_ m/(n-2) = 14.4

In russian with alphabet size = 33 we have

__x m^2/(n-2) = 363
_xx and xx_ m/(n-2) = 11

Assuming these calculations is approximate, we can safely use same values for both languages.
Does such reasoning make sense?

Also, surely this bit:

!       trgmNFA->totalTrgmCount = (int) totalTrgmSize;

is flat out wrong?  expandColorTrigrams() expects that
trgmNFA->totalTrgmCount is the truth, not some penalty-bloated
abstraction.  The fact that the patch hasn't failed on you completely
demonstrates that you've not tested any cases where blank-containing
trigrams got through the filter, reinforcing my feeling that it's
probably too strict now.

Oh, I wonder how can I leave such weird bug in patch :-(
 
I wonder if it would be more appropriate to continue to measure the limit
MAX_TRGM_COUNT in terms of actual trigram counts, and just use the
penalized "size" as the sort key while determining which color trigrams
to discard first.
 
Agree. But I would like to save some equivalent of WISH_TRGM_SIZE.

Another thought is that it's not clear that you should apply the same
penalty to blanks in all three positions.  Because of the padding
settings, any one word will normally lead to padded strings "  a",
" ab", "yz ", so that spaces in the first position are about twice
as common as spaces in the other positions.  (It's a little less
than that because single-letter words produce only "  a" and " a ";
but I'd think that such words aren't very common in text that trigrams
are effective for.)  So I'm thinking the penalty ought to take that
into account.

I'm also inclined to think that it might be worth adding a size
field to ColorTrgmInfo rather than repeatedly calculating the
size estimate.  We only allow a thousand and some ColorTrgmInfos
at most, so the extra space wouldn't be that large, and I'm concerned
by the expense the current patch adds to the sort comparator.

Agree.
 

Another point is that this comment:

         * Note: per-color-trigram counts cannot overflow an int so long as
         * COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie about
         * 1290.  However, the grand total totalTrgmCount might conceivably
         * overflow an int, so we use int64 for that within this routine.

is no longer valid, or at least it fails to demonstrate that the result
of getColorTrigramSize() can't overflow an int; indeed I rather fear it can.
The patch failed to even update the variable name in this comment, let
alone address the substantive question.

There are some other cosmetic things I didn't like, for instance
colorTrgmInfoCountCmp() is no longer comparing counts but you didn't
rename it.  That's about it for substantive comments though.

Thanks, will be fixed.

------
With best regards,
Alexander Korotkov.

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: narwhal and PGDLLIMPORT
Next
From: Tom Lane
Date:
Subject: Re: trgm regex index peculiarity