Thread: trgm regex index peculiarity
9.4devel (but same in 9.3) In a 112 MB test table (containing random generated text) with a trgm index (gin_trgm_ops), I consistently get these timings: select txt from azjunk6 where txt ~ '^abcd'; 130 ms select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; 3 ms (a similar performance difference occurs when using a regex, i.e. 'abc[de]' ) This difference is so large that I wonder if there is not something wrong in the first case. (The returned results are correct though) Here are the two explains: explain analyze select txt from azjunk6 where txt ~ '^abcd'; QUERYPLAN -------------------------------------------------------------------------------------------------------------------------------------Bitmap HeapScan on azjunk6 (cost=108.78..484.93 rows=100 width=81) (actual time=129.557..129.742 rows=1 loops=1) Recheck Cond:(txt ~ '^abcd'::text) Rows Removed by Index Recheck: 17 -> Bitmap Index Scan on azjunk6_trgm_re_idx (cost=0.00..108.75rows=100 width=0) (actual time=129.503..129.503 rows=18 loops=1) Index Cond: (txt ~ '^abcd'::text)Total runtime: 130.008 ms (6 rows) explain analyze select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------Bitmap HeapScan on azjunk6 (cost=56.75..433.40 rows=1 width=81) (actual time=2.064..3.379 rows=1 loops=1) Recheck Cond: (txt ~'abcd'::text) Rows Removed by Index Recheck: 14 Filter: (substr(txt, 1, 4) = 'abcd'::text) Rows Removed by Filter: 112 -> Bitmap Index Scan on azjunk6_trgm_re_idx (cost=0.00..56.75 rows=100 width=0) (actual time=1.911..1.911 rows=127 loops=1) Index Cond: (txt ~ 'abcd'::text)Total runtime: 3.409 ms (8 rows) The results in both cases are correct, but does this difference not almost amount to a bug? ( Interestingly, the variant WHERE txt ~ 'abcd$' is as fast as the non-anchored variant ) Thanks, Erik Rijkers
"Erik Rijkers" <er@xs4all.nl> writes: > In a 112 MB test table (containing random generated text) with a trgm index (gin_trgm_ops), I consistently get these timings: > select txt from azjunk6 where txt ~ '^abcd'; > 130 ms > select txt from azjunk6 > where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; > 3 ms Hm, could you provide a self-contained test case? regards, tom lane
On Fri, June 21, 2013 05:25, Tom Lane wrote: > "Erik Rijkers" <er@xs4all.nl> writes: >> In a 112 MB test table (containing random generated text) with a trgm index (gin_trgm_ops), I consistently get these >> timings: >> select txt from azjunk6 where txt ~ '^abcd'; >> 130 ms >> select txt from azjunk6 >> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; >> 3 ms > > Hm, could you provide a self-contained test case? > yes, sorry. I tested on a 1M row table: #!/bin/sh # create table: for power in 6; do table=azjunk${power} index=${table}_trgm_re_idx perl -E' sub ss{ join"",@_[ map{rand @_} 1 .. shift ] }; say(ss(80,"a".."g","","h".."m"," ","n".."s"," ","t".."z")) for 1 .. 1e'"${power};" \ | psql -aqXc " drop table if exists$table; create table $table(txt text); copy $table from stdin;"; echo "set session maintenance_work_mem='1GB'; create index $index on $table using gin (txt gin_trgm_ops); analyze $table;" | psql -qtAX; done # test: echo " \\timing on explain analyze select txt from azjunk6 where txt ~ '^abcd'; -- slow (140 ms) explain analyze select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; -- fast (5 ms) " | psql -Xqa
On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <er@xs4all.nl> wrote:
------
With best regards,
On Fri, June 21, 2013 05:25, Tom Lane wrote:yes, sorry. I tested on a 1M row table:
> "Erik Rijkers" <er@xs4all.nl> writes:
>> In a 112 MB test table (containing random generated text) with a trgm index (gin_trgm_ops), I consistently get these
>> timings:
>> select txt from azjunk6 where txt ~ '^abcd';
>> 130 ms
>> select txt from azjunk6
>> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
>> 3 ms
>
> Hm, could you provide a self-contained test case?
>
#!/bin/sh
# create table:
for power in 6;
do
table=azjunk${power}
index=${table}_trgm_re_idx
perl -E'
sub ss{ join"",@_[ map{rand @_} 1 .. shift ] };
say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z")) for 1 .. 1e'"${power};" \
| psql -aqXc "
drop table if exists $table;
create table $table(txt text);
copy $table from stdin;";
echo "set session maintenance_work_mem='1GB';
create index $index on $table using gin (txt gin_trgm_ops);
analyze $table;" | psql -qtAX;
done
# test:
echo "
\\timing on
explain analyze select txt from azjunk6 where txt ~ '^abcd'; -- slow (140 ms)
explain analyze select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; -- fast (5 ms)
" | psql -Xqa
Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'. However trigrams '__a' is much more frequent than '_ab' which in turn is much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of '__a' and '_ab' and that gives so significant speedup.
The problem is that trigram regex engine doesn't know that '__a' and '_ab' is more frequent than another trigrams because we don't know their distribution. However, simple assumption that blank character (in pg_trgm meaning) is much more frequent than others seems to be true for most cases.
Attached patch implementing this assumption. It introduces BLANK_COLOR_SIZE macro which is used in selectColorTrigrams like count of characters in COLOR_BLANK. Another change in this patch is split of MAX_TRGM_COUNT into WISH_TRGM_SIZE and MAX_TRGM_SIZE. The idea is to keep trying to remove color trigrams from graph even when it fits into MAX_TRGM_SIZE, because we are intending to scan less posting lists/trees.
Comments is not fixed yet, coming soon.
With best regards,
Alexander Korotkov.
Attachment
On Fri, June 21, 2013 15:11, Alexander Korotkov wrote: > On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <er@xs4all.nl> wrote: > >> On Fri, June 21, 2013 05:25, Tom Lane wrote: >> > "Erik Rijkers" <er@xs4all.nl> writes: >> >> In a 112 MB test table (containing random generated text) with a trgm >> index (gin_trgm_ops), I consistently get these >> >> timings: >> >> select txt from azjunk6 where txt ~ '^abcd'; >> >> 130 ms >> >> select txt from azjunk6 >> >> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; >> >> 3 ms >> > > > Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'. > However trigrams '__a' is much more frequent than '_ab' which in turn is > much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of > '__a' and '_ab' and that gives so significant speedup. > [trgm_regex_optimize.1.patch ] Yes, that fixes the problem, thanks. Erik Rijkers
On Fri, Jun 21, 2013 at 5:39 PM, Erik Rijkers <er@xs4all.nl> wrote:
------
With best regards,
Alexander Korotkov.
On Fri, June 21, 2013 15:11, Alexander Korotkov wrote:
> On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <er@xs4all.nl> wrote:
>
>> On Fri, June 21, 2013 05:25, Tom Lane wrote:
>> > "Erik Rijkers" <er@xs4all.nl> writes:
>> >> In a 112 MB test table (containing random generated text) with a trgm
>> index (gin_trgm_ops), I consistently get these
>> >> timings:
>> >> select txt from azjunk6 where txt ~ '^abcd';
>> >> 130 ms
>> >> select txt from azjunk6
>> >> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
>> >> 3 ms
>> >
>> Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'.> [trgm_regex_optimize.1.patch ]
> However trigrams '__a' is much more frequent than '_ab' which in turn is
> much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of
> '__a' and '_ab' and that gives so significant speedup.
Yes, that fixes the problem, thanks.
Revised version of patch with necessary comments.
With best regards,
Alexander Korotkov.
Attachment
Alexander Korotkov <aekorotkov@gmail.com> writes: > Revised version of patch with necessary comments. I looked at this patch a bit. It seems like this: + * BLANK_COLOR_SIZE - How much blank character is more frequent than + * other character in average + #define BLANK_COLOR_SIZE 32 is a drastic overestimate. Using the program Erik provided to generate some sample data, I find that blanks in trigrams are maybe about three times more common than other characters. Possibly Erik's data isn't very representative of real text, but if that's the case we'd better get a more representative sample case before we start optimizing ... Now maybe the above number is okay for this purpose anyway, but if so it needs some different justification; maybe call it a "penalty"? But nonetheless it seems like a darn large penalty, particularly for " x" type trigrams where the value would effectively be squared. Compared to the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like you might as well forget the "sizing" approach and just flat out reject trigrams containing COLOR_BLANK, because they'll never get through the size filter. I'm inclined to think you need a larger MAX_TRGM_SIZE; and WISH_TRGM_SIZE seems mighty small as well, compared to what the code would have done before. Also, surely this bit: ! trgmNFA->totalTrgmCount = (int) totalTrgmSize; is flat out wrong? expandColorTrigrams() expects that trgmNFA->totalTrgmCount is the truth, not some penalty-bloated abstraction. The fact that the patch hasn't failed on you completely demonstrates that you've not tested any cases where blank-containing trigrams got through the filter, reinforcing my feeling that it's probably too strict now. I wonder if it would be more appropriate to continue to measure the limit MAX_TRGM_COUNT in terms of actual trigram counts, and just use the penalized "size" as the sort key while determining which color trigrams to discard first. Another thought is that it's not clear that you should apply the same penalty to blanks in all three positions. Because of the padding settings, any one word will normally lead to padded strings " a", " ab", "yz ", so that spaces in the first position are about twice as common as spaces in the other positions. (It's a little less than that because single-letter words produce only " a" and " a "; but I'd think that such words aren't very common in text that trigrams are effective for.) So I'm thinking the penalty ought to take that into account. I'm also inclined to think that it might be worth adding a size field to ColorTrgmInfo rather than repeatedly calculating the size estimate. We only allow a thousand and some ColorTrgmInfos at most, so the extra space wouldn't be that large, and I'm concerned by the expense the current patch adds to the sort comparator. Another point is that this comment: * Note: per-color-trigram counts cannot overflow an int so long as * COLOR_COUNT_LIMIT is not more than the cube root ofINT_MAX, ie about * 1290. However, the grand total totalTrgmCount might conceivably * overflow an int, so we use int64for that within this routine. is no longer valid, or at least it fails to demonstrate that the result of getColorTrigramSize() can't overflow an int; indeed I rather fear it can. The patch failed to even update the variable name in this comment, let alone address the substantive question. There are some other cosmetic things I didn't like, for instance colorTrgmInfoCountCmp() is no longer comparing counts but you didn't rename it. That's about it for substantive comments though. regards, tom lane
On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
------
With best regards,
Alexander Korotkov.
Alexander Korotkov <aekorotkov@gmail.com> writes:I looked at this patch a bit. It seems like this:
> Revised version of patch with necessary comments.
+ * BLANK_COLOR_SIZE - How much blank character is more frequent than
+ * other character in average
+ #define BLANK_COLOR_SIZE 32
is a drastic overestimate. Using the program Erik provided to generate
some sample data, I find that blanks in trigrams are maybe about three
times more common than other characters. Possibly Erik's data isn't
very representative of real text, but if that's the case we'd better
get a more representative sample case before we start optimizing ...
Now maybe the above number is okay for this purpose anyway, but if so
it needs some different justification; maybe call it a "penalty"?
But nonetheless it seems like a darn large penalty, particularly for " x"
type trigrams where the value would effectively be squared. Compared to
the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like
you might as well forget the "sizing" approach and just flat out reject
trigrams containing COLOR_BLANK, because they'll never get through the
size filter.
It seems to be right decision to me when we have other trigrams can reject them. But there are cases when we can't.
I'm inclined to think you need a larger MAX_TRGM_SIZE;
and WISH_TRGM_SIZE seems mighty small as well, compared to what the
code would have done before.
OK, I would like to make more reasoning for penalty.
Let's consider word of length n.
It contains n+1 trigrams including:
1 __x trigram
1 _xx trigram
1 xx_ trigram
n - 2 xxx trigrams
Assume alphabet of size m those trigrams have following average frequencies:
__x 1/((n+1)*m)
_xx 1/((n+1)*m^2)
xx_ 1/((n+1)*m^2)
xxx (n-2)/((n+1)*m^3)
The ratio of this frequencies with blanks to frequency of xxx is:
__x m^2/(n-2)
_xx and xx_ m/(n-2)
In order to estimate n I've analyzed some classics:
Ernest Hemingway "A farewell to arms" - 3.8
Leo Tolstoy "War and Peace" - 5.0
In english with alphabet size = 26 we have
__x m^2/(n-2) = 375
_xx and xx_ m/(n-2) = 14.4
In russian with alphabet size = 33 we have
__x m^2/(n-2) = 363
_xx and xx_ m/(n-2) = 11
Assuming these calculations is approximate, we can safely use same values for both languages.
Does such reasoning make sense?
Also, surely this bit:
! trgmNFA->totalTrgmCount = (int) totalTrgmSize;
is flat out wrong? expandColorTrigrams() expects that
trgmNFA->totalTrgmCount is the truth, not some penalty-bloated
abstraction. The fact that the patch hasn't failed on you completely
demonstrates that you've not tested any cases where blank-containing
trigrams got through the filter, reinforcing my feeling that it's
probably too strict now.
Oh, I wonder how can I leave such weird bug in patch :-(
I wonder if it would be more appropriate to continue to measure the limit
MAX_TRGM_COUNT in terms of actual trigram counts, and just use the
penalized "size" as the sort key while determining which color trigrams
to discard first.
Agree. But I would like to save some equivalent of WISH_TRGM_SIZE.
Another thought is that it's not clear that you should apply the same
penalty to blanks in all three positions. Because of the padding
settings, any one word will normally lead to padded strings " a",
" ab", "yz ", so that spaces in the first position are about twice
as common as spaces in the other positions. (It's a little less
than that because single-letter words produce only " a" and " a ";
but I'd think that such words aren't very common in text that trigrams
are effective for.) So I'm thinking the penalty ought to take that
into account.
I'm also inclined to think that it might be worth adding a size
field to ColorTrgmInfo rather than repeatedly calculating the
size estimate. We only allow a thousand and some ColorTrgmInfos
at most, so the extra space wouldn't be that large, and I'm concerned
by the expense the current patch adds to the sort comparator.
Agree.
Another point is that this comment:
* Note: per-color-trigram counts cannot overflow an int so long as
* COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie about
* 1290. However, the grand total totalTrgmCount might conceivably
* overflow an int, so we use int64 for that within this routine.
is no longer valid, or at least it fails to demonstrate that the result
of getColorTrigramSize() can't overflow an int; indeed I rather fear it can.
The patch failed to even update the variable name in this comment, let
alone address the substantive question.
There are some other cosmetic things I didn't like, for instance
colorTrgmInfoCountCmp() is no longer comparing counts but you didn't
rename it. That's about it for substantive comments though.
Thanks, will be fixed.
With best regards,
Alexander Korotkov.
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I looked at this patch a bit. It seems like this: >> + * BLANK_COLOR_SIZE - How much blank character is more frequent than >> + * other character in average >> + #define BLANK_COLOR_SIZE 32 >> is a drastic overestimate. > OK, I would like to make more reasoning for penalty. > Let's consider word of length n. > It contains n+1 trigrams including: > 1 __x trigram > 1 _xx trigram > 1 xx_ trigram > n - 2 xxx trigrams > Assume alphabet of size m those trigrams have following average frequencies: > __x 1/((n+1)*m) > _xx 1/((n+1)*m^2) > xx_ 1/((n+1)*m^2) > xxx (n-2)/((n+1)*m^3) Hmm, but you're assuming that the m letters are all equally frequent, which is surely not true in normal text. I didn't have a machine-readable copy of Hemingway or Tolstoy at hand, but I do have a text file with the collected works of H.P. Lovecraft, so I tried analyzing the trigrams in that. I find that * The average word length is 4.78 characters. * There are 5652 distinct trigrams (discounting some containing digits), the most common one (' t') occurring 81222 times; the average occurrence count is 500.0. * Considering only trigrams not containing any blanks, there are 4937 of them, the most common one ('the') occurring 45832 times, with an average count of 266.9. * There are (unsurprisingly) exactly 26 trigrams of the form ' x', with an average count of 19598.5. * There are 689 trigrams of the form ' xx' or 'xx ', the most common one (' th') occurring 58200 times, with an average count of 1450.1. So, we've rediscovered the fact that 'the' is the most common word in English text ;-). But I think the significant conclusion for our purposes here is that single-space trigrams are on average about 1450.1/266.9 = 5.4 times more common than the average space-free trigram, and two-space trigrams about 19598.5/266.9 = 73.4 times more common. So this leads me to the conclusion that we should be using a BLANK_COLOR_SIZE value around 5 or 6 (with maybe something other than a square-law rule for two-space trigrams). Even using your numbers, it shouldn't be 32. regards, tom lane
On Mon, Feb 10, 2014 at 1:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
------
With best regards,
Alexander Korotkov.
Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:>> I looked at this patch a bit. It seems like this:
>> + * BLANK_COLOR_SIZE - How much blank character is more frequent than
>> + * other character in average
>> + #define BLANK_COLOR_SIZE 32
>> is a drastic overestimate.> OK, I would like to make more reasoning for penalty.Hmm, but you're assuming that the m letters are all equally frequent,
> Let's consider word of length n.
> It contains n+1 trigrams including:
> 1 __x trigram
> 1 _xx trigram
> 1 xx_ trigram
> n - 2 xxx trigrams
> Assume alphabet of size m those trigrams have following average frequencies:
> __x 1/((n+1)*m)
> _xx 1/((n+1)*m^2)
> xx_ 1/((n+1)*m^2)
> xxx (n-2)/((n+1)*m^3)
which is surely not true in normal text.
I didn't have a machine-readable copy of Hemingway or Tolstoy at hand,
but I do have a text file with the collected works of H.P. Lovecraft,
so I tried analyzing the trigrams in that. I find that
* The average word length is 4.78 characters.
* There are 5652 distinct trigrams (discounting some containing digits),
the most common one (' t') occurring 81222 times; the average
occurrence count is 500.0.
* Considering only trigrams not containing any blanks, there are
4937 of them, the most common one ('the') occurring 45832 times,
with an average count of 266.9.
* There are (unsurprisingly) exactly 26 trigrams of the form ' x',
with an average count of 19598.5.
* There are 689 trigrams of the form ' xx' or 'xx ', the most common one
(' th') occurring 58200 times, with an average count of 1450.1.
So, we've rediscovered the fact that 'the' is the most common word in
English text ;-). But I think the significant conclusion for our purposes
here is that single-space trigrams are on average about 1450.1/266.9 = 5.4
times more common than the average space-free trigram, and two-space
trigrams about 19598.5/266.9 = 73.4 times more common.
So this leads me to the conclusion that we should be using a
BLANK_COLOR_SIZE value around 5 or 6 (with maybe something other than
a square-law rule for two-space trigrams). Even using your numbers,
it shouldn't be 32.
Next revision of patch is attached. Changes are so:
1) Notion "penalty" is used instead of "size".
2) We try to reduce total penalty to WISH_TRGM_PENALTY, but restriction is MAX_TRGM_COUNT total trigrams count.
3) Penalties are assigned to particular color trigram classes. I.e. separate penalties for __a, _aa, _a_, aa_. It's based on analysis of trigram frequencies in Oscar Wilde writings. We can end up with different numbers, but I don't think they will be dramatically different.
------
With best regards,
Alexander Korotkov.
Attachment
I went back and tried Erik's original test (http://www.postgresql.org/message-id/dafad644f268ce1503e1b8b682aae38a.squirrel@webmail.xs4all.nl). With a fresh checkout from master, the difference between the slow and fast queries is much less dramatic than Erik reported. The reason is that Alexander's GIN "fast scan" patch is very effective on that query. Erik reported that the '^abcd' query took 140ms, vs 5ms for 'abcd'. On my laptop, the numbers with a fresh checkout are about 2.5 ms vs. 1 ms, and with fast scan disabled (by modifying the source code), 40ms vs 1ms. So thanks to the fast scan patch, I don't think this patch is worth pursuing anymore. Unless there are some other test case where this patch helps, but the fast scan patch doesn't. - Heikki
On Fri, March 28, 2014 09:31, Heikki Linnakangas wrote: > I went back and tried Erik's original test > (http://www.postgresql.org/message-id/dafad644f268ce1503e1b8b682aae38a.squirrel@webmail.xs4all.nl). > With a fresh checkout from master, the difference between the slow and > fast queries is much less dramatic than Erik reported. The reason is > that Alexander's GIN "fast scan" patch is very effective on that query. > Erik reported that the '^abcd' query took 140ms, vs 5ms for 'abcd'. On > my laptop, the numbers with a fresh checkout are about 2.5 ms vs. 1 ms, > and with fast scan disabled (by modifying the source code), 40ms vs 1ms. > > So thanks to the fast scan patch, I don't think this patch is worth > pursuing anymore. Unless there are some other test case where this patch > helps, but the fast scan patch doesn't. > for the same 2 statements of my original test: explain (analyze,buffers) select txt from azjunk6 where txt ~ '^abcd'; -- slow (140 ms) explain (analyze,buffers) select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; -- fast (5 ms) You mention (from HEAD, I suppose?) a difference of 2.5 ms vs. 1 ms. FWIW, for me the difference (from HEAD) remains quite a bit larger: for n in `seq 1 10`; do ./trgm_peculiarity.sh ; done | grep runtime Total runtime: 16.167 msTotal runtime: 2.188 msTotal runtime: 16.902 msTotal runtime: 2.203 msTotal runtime: 17.486 msTotalruntime: 2.201 msTotal runtime: 17.663 msTotal runtime: 2.441 msTotal runtime: 13.555 msTotal runtime: 2.204 msTotalruntime: 16.862 msTotal runtime: 2.225 msTotal runtime: 13.207 msTotal runtime: 2.550 msTotal runtime: 16.768 msTotalruntime: 2.172 msTotal runtime: 19.259 msTotal runtime: 2.180 msTotal runtime: 12.934 msTotal runtime: 2.198 ms That's a lot better than the original 140ms vs 5ms but your laptop's 2.5 ms vs. 1 ms is perhaps not representative. (for the full plans see below) Erik Rijkers ----------------------------------------------------------------------------------------------------------------------------------Bitmap HeapScan on azjunk6 (cost=56.77..432.93 rows=100 width=81) (actual time=15.898..15.925 rows=2 loops=1) Recheck Cond: (txt~ '^abcd'::text) Rows Removed by Index Recheck: 11 Heap Blocks: exact=13 Buffers: shared hit=105 -> Bitmap IndexScan on azjunk6_trgm_re_idx (cost=0.00..56.75 rows=100 width=0) (actual time=15.834..15.834 rows=13 loops=1) Index Cond: (txt ~ '^abcd'::text) Buffers: shared hit=92Planning time: 3.304 msTotal runtime: 16.179ms (10 rows) Time: 21.103 ms explain (analyze,buffers) select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; -- fast (5 ms) QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------Bitmap HeapScan on azjunk6 (cost=28.75..405.40 rows=1 width=81) (actual time=1.681..2.164 rows=2 loops=1) Recheck Cond: (txt ~'abcd'::text) Rows Removed by Index Recheck: 11 Filter: (substr(txt, 1, 4) = 'abcd'::text) Rows Removed by Filter: 101 Heap Blocks: exact=113 Buffers: shared hit=120 -> Bitmap Index Scan on azjunk6_trgm_re_idx (cost=0.00..28.75 rows=100width=0) (actual time=1.171..1.171 rows=114 loops=1) Index Cond: (txt ~ 'abcd'::text) Buffers: shared hit=7Planning time: 0.516 msTotal runtime: 2.183ms (12 rows)
"Erik Rijkers" <er@xs4all.nl> writes: > On Fri, March 28, 2014 09:31, Heikki Linnakangas wrote: >> So thanks to the fast scan patch, I don't think this patch is worth >> pursuing anymore. Unless there are some other test case where this patch >> helps, but the fast scan patch doesn't. > FWIW, for me the difference (from HEAD) remains quite a bit larger: > ... > That's a lot better than the original 140ms vs 5ms but your laptop's 2.5 ms vs. 1 ms is perhaps not representative. I also feel that Alexander's patch is still worth pursuing. What I see (in an assert-disabled build of HEAD) is that Erik's "slow" query takes about 2 ms to plan and 5.5 ms to execute, while the "fast" query is about 0.7 ms to plan and 1.3 ms to execute. With the patch, the "slow" query is still about 2 ms to plan but only 0.3 ms to execute, while the "fast" query doesn't change much. There's a lot of noise in these numbers (successive executions seem to be bouncing around more than usual today), but still it looks like the runtime gain is impressive percentagewise. One thing that's interesting is that the planning time is so much more for the "slow" query, even though the "fast" one has an additional WHERE clause that the planner has to think about. I haven't tried to do perf measurements to be sure, but my guess is that this has nothing to do with GIN or pg_trgm directly, but represents the planner's efforts to get a selectivity estimate for the ~ operator --- that code works much differently for patterns that define fixed prefixes vs those that don't. Anyway, the important point here is that I think the planning time is helping to mask the fact that there's a pretty useful runtime improvement. regards, tom lane
Alexander Korotkov <aekorotkov@gmail.com> writes: > Next revision of patch is attached. Changes are so: > 1) Notion "penalty" is used instead of "size". > 2) We try to reduce total penalty to WISH_TRGM_PENALTY, but restriction is > MAX_TRGM_COUNT total trigrams count. > 3) Penalties are assigned to particular color trigram classes. I.e. > separate penalties for __a, _aa, _a_, aa_. It's based on analysis of > trigram frequencies in Oscar Wilde writings. We can end up with different > numbers, but I don't think they will be dramatically different. Committed with cosmetic improvements (adjusting the comments mostly). The new whitespace penalties look reasonably sane to me. I wonder though if WISH_TRGM_PENALTY is too small --- it seems like this code will tend to select many fewer trigrams than the old code did. What testing did you do that led you to select the specific value of 16? regards, tom lane