Thread: trgm regex index peculiarity

trgm regex index peculiarity

From

"Erik Rijkers"

Date:

21 June 2013, 01:38:32

9.4devel (but same in 9.3)

In a 112 MB test table (containing random generated text) with a trgm index (gin_trgm_ops), I consistently get these
timings:

select txt from azjunk6 where txt ~ '^abcd';  130 ms


select txt from azjunk6
where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';  3 ms

(a similar performance difference occurs when using a regex, i.e. 'abc[de]'  )

This difference is so large that I wonder if there is not something wrong in the first case. (The returned results are
correct though)

Here are the two explains:

explain analyze select txt from azjunk6 where txt ~ '^abcd';
QUERYPLAN
 

-------------------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on azjunk6  (cost=108.78..484.93 rows=100 width=81) (actual time=129.557..129.742 rows=1 loops=1)  Recheck
Cond:(txt ~ '^abcd'::text)  Rows Removed by Index Recheck: 17  ->  Bitmap Index Scan on azjunk6_trgm_re_idx
(cost=0.00..108.75rows=100 width=0) (actual time=129.503..129.503 rows=18
 
loops=1)        Index Cond: (txt ~ '^abcd'::text)Total runtime: 130.008 ms
(6 rows)

explain analyze select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
                         QUERY PLAN
 

---------------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on azjunk6  (cost=56.75..433.40 rows=1 width=81) (actual time=2.064..3.379 rows=1 loops=1)  Recheck Cond: (txt
~'abcd'::text)  Rows Removed by Index Recheck: 14  Filter: (substr(txt, 1, 4) = 'abcd'::text)  Rows Removed by Filter:
112 ->  Bitmap Index Scan on azjunk6_trgm_re_idx  (cost=0.00..56.75 rows=100 width=0) (actual time=1.911..1.911
rows=127
loops=1)        Index Cond: (txt ~ 'abcd'::text)Total runtime: 3.409 ms
(8 rows)


The results in both cases are correct, but does this difference not almost amount to a bug?

( Interestingly, the variant     WHERE txt ~ 'abcd$'
is as fast as the non-anchored variant )


Thanks,

Erik Rijkers

Re: trgm regex index peculiarity

From

Tom Lane

Date:

21 June 2013, 03:26:03

"Erik Rijkers" <er@xs4all.nl> writes:
> In a 112 MB test table (containing random generated text) with a trgm index (gin_trgm_ops), I consistently get these
timings:
> select txt from azjunk6 where txt ~ '^abcd';
>    130 ms
> select txt from azjunk6
> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
>    3 ms

Hm, could you provide a self-contained test case?
        regards, tom lane

Re: trgm regex index peculiarity

From

"Erik Rijkers"

Date:

21 June 2013, 10:40:43

On Fri, June 21, 2013 05:25, Tom Lane wrote:
> "Erik Rijkers" <er@xs4all.nl> writes:
>> In a 112 MB test table (containing random generated text) with a trgm index (gin_trgm_ops), I consistently get
these
>> timings:
>> select txt from azjunk6 where txt ~ '^abcd';
>>    130 ms
>> select txt from azjunk6
>> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
>>    3 ms
>
> Hm, could you provide a self-contained test case?
>

yes, sorry.   I tested on a 1M row table:

#!/bin/sh

# create table:
for power in 6;
do table=azjunk${power} index=${table}_trgm_re_idx perl -E' sub ss{ join"",@_[ map{rand @_} 1 .. shift ] };
say(ss(80,"a".."g","","h".."m"," ","n".."s"," ","t".."z")) for 1 .. 1e'"${power};" \ | psql -aqXc "   drop table if
exists$table;   create table $table(txt text);   copy $table from stdin;"; echo "set session
maintenance_work_mem='1GB';  create index $index on $table using gin (txt gin_trgm_ops);   analyze $table;" | psql
-qtAX;
done

# test:
echo "
\\timing on
explain analyze select txt from azjunk6 where txt ~ '^abcd';  -- slow (140 ms)
explain analyze select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; -- fast (5 ms)
" | psql -Xqa

Re: trgm regex index peculiarity

From

Alexander Korotkov

Date:

21 June 2013, 13:11:27

On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <er@xs4all.nl> wrote:

On Fri, June 21, 2013 05:25, Tom Lane wrote:
> "Erik Rijkers" <er@xs4all.nl> writes:
>> In a 112 MB test table (containing random generated text) with a trgm index (gin_trgm_ops), I consistently get these
>> timings:
>> select txt from azjunk6 where txt ~ '^abcd';
>> 130 ms
>> select txt from azjunk6
>> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
>> 3 ms
>
> Hm, could you provide a self-contained test case?
>

yes, sorry. I tested on a 1M row table:

#!/bin/sh

# create table:
for power in 6;
do
table=azjunk${power}
index=${table}_trgm_re_idx
perl -E'
sub ss{ join"",@_[ map{rand @_} 1 .. shift ] };
say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z")) for 1 .. 1e'"${power};" \
| psql -aqXc "
drop table if exists $table;
create table $table(txt text);
copy $table from stdin;";
echo "set session maintenance_work_mem='1GB';
create index $index on $table using gin (txt gin_trgm_ops);
analyze $table;" | psql -qtAX;
done

# test:
echo "
\\timing on
explain analyze select txt from azjunk6 where txt ~ '^abcd'; -- slow (140 ms)
explain analyze select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; -- fast (5 ms)
" | psql -Xqa

Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'. However trigrams '__a' is much more frequent than '_ab' which in turn is much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of '__a' and '_ab' and that gives so significant speedup.

The problem is that trigram regex engine doesn't know that '__a' and '_ab' is more frequent than another trigrams because we don't know their distribution. However, simple assumption that blank character (in pg_trgm meaning) is much more frequent than others seems to be true for most cases.

Attached patch implementing this assumption. It introduces BLANK_COLOR_SIZE macro which is used in selectColorTrigrams like count of characters in COLOR_BLANK. Another change in this patch is split of MAX_TRGM_COUNT into WISH_TRGM_SIZE and MAX_TRGM_SIZE. The idea is to keep trying to remove color trigrams from graph even when it fits into MAX_TRGM_SIZE, because we are intending to scan less posting lists/trees.

Comments is not fixed yet, coming soon.

------
With best regards,

Alexander Korotkov.

Attachment

trgm_regex_optimize.1.patch

Re: trgm regex index peculiarity

From

"Erik Rijkers"

Date:

21 June 2013, 13:39:56

On Fri, June 21, 2013 15:11, Alexander Korotkov wrote:
> On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <er@xs4all.nl> wrote:
>
>> On Fri, June 21, 2013 05:25, Tom Lane wrote:
>> > "Erik Rijkers" <er@xs4all.nl> writes:
>> >> In a 112 MB test table (containing random generated text) with a trgm
>> index (gin_trgm_ops), I consistently get these
>> >> timings:
>> >> select txt from azjunk6 where txt ~ '^abcd';
>> >>    130 ms
>> >> select txt from azjunk6
>> >> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
>> >>    3 ms
>> >
>
> Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'.
> However trigrams '__a' is much more frequent than '_ab' which in turn is
> much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of
> '__a' and '_ab' and that gives so significant speedup.

> [trgm_regex_optimize.1.patch ]

Yes, that fixes the problem, thanks.

Erik Rijkers

Re: trgm regex index peculiarity

From

Alexander Korotkov

Date:

15 January 2014, 13:36:21

On Fri, Jun 21, 2013 at 5:39 PM, Erik Rijkers <er@xs4all.nl> wrote:

On Fri, June 21, 2013 15:11, Alexander Korotkov wrote:
> On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <er@xs4all.nl> wrote:
>
>> On Fri, June 21, 2013 05:25, Tom Lane wrote:
>> > "Erik Rijkers" <er@xs4all.nl> writes:
>> >> In a 112 MB test table (containing random generated text) with a trgm
>> index (gin_trgm_ops), I consistently get these
>> >> timings:
>> >> select txt from azjunk6 where txt ~ '^abcd';
>> >> 130 ms
>> >> select txt from azjunk6
>> >> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
>> >> 3 ms
>> >
>
> Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'.
> However trigrams '__a' is much more frequent than '_ab' which in turn is
> much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of
> '__a' and '_ab' and that gives so significant speedup.

> [trgm_regex_optimize.1.patch ]

Yes, that fixes the problem, thanks.

Revised version of patch with necessary comments.

------
With best regards,
Alexander Korotkov.

Attachment

trgm-regex-optimize.2.patch

Re: trgm regex index peculiarity

From

Tom Lane

Date:

15 January 2014, 23:34:22

Alexander Korotkov <aekorotkov@gmail.com> writes:
> Revised version of patch with necessary comments.

I looked at this patch a bit. It seems like this:

+ * BLANK_COLOR_SIZE - How much blank character is more frequent than
+ * other character in average
+ #define BLANK_COLOR_SIZE 32

is a drastic overestimate. Using the program Erik provided to generate
some sample data, I find that blanks in trigrams are maybe about three
times more common than other characters. Possibly Erik's data isn't
very representative of real text, but if that's the case we'd better
get a more representative sample case before we start optimizing ...

Now maybe the above number is okay for this purpose anyway, but if so
it needs some different justification; maybe call it a "penalty"?
But nonetheless it seems like a darn large penalty, particularly for " x"
type trigrams where the value would effectively be squared. Compared to
the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like
you might as well forget the "sizing" approach and just flat out reject
trigrams containing COLOR_BLANK, because they'll never get through the
size filter. I'm inclined to think you need a larger MAX_TRGM_SIZE;
and WISH_TRGM_SIZE seems mighty small as well, compared to what the
code would have done before.

Also, surely this bit:

! trgmNFA->totalTrgmCount = (int) totalTrgmSize;

is flat out wrong? expandColorTrigrams() expects that
trgmNFA->totalTrgmCount is the truth, not some penalty-bloated
abstraction. The fact that the patch hasn't failed on you completely
demonstrates that you've not tested any cases where blank-containing
trigrams got through the filter, reinforcing my feeling that it's
probably too strict now.

I wonder if it would be more appropriate to continue to measure the limit
MAX_TRGM_COUNT in terms of actual trigram counts, and just use the
penalized "size" as the sort key while determining which color trigrams
to discard first.

Another thought is that it's not clear that you should apply the same
penalty to blanks in all three positions. Because of the padding
settings, any one word will normally lead to padded strings " a",
" ab", "yz ", so that spaces in the first position are about twice
as common as spaces in the other positions. (It's a little less
than that because single-letter words produce only " a" and " a ";
but I'd think that such words aren't very common in text that trigrams
are effective for.) So I'm thinking the penalty ought to take that
into account.

I'm also inclined to think that it might be worth adding a size
field to ColorTrgmInfo rather than repeatedly calculating the
size estimate. We only allow a thousand and some ColorTrgmInfos
at most, so the extra space wouldn't be that large, and I'm concerned
by the expense the current patch adds to the sort comparator.

Another point is that this comment:
* Note: per-color-trigram counts cannot overflow an int so long as * COLOR_COUNT_LIMIT is not more than the cube root
ofINT_MAX, ie about * 1290. However, the grand total totalTrgmCount might conceivably * overflow an int, so we use
int64for that within this routine.

is no longer valid, or at least it fails to demonstrate that the result
of getColorTrigramSize() can't overflow an int; indeed I rather fear it can.
The patch failed to even update the variable name in this comment, let
alone address the substantive question.

There are some other cosmetic things I didn't like, for instance
colorTrgmInfoCountCmp() is no longer comparing counts but you didn't
rename it. That's about it for substantive comments though.
regards, tom lane

Re: trgm regex index peculiarity

From

Alexander Korotkov

Date:

09 February 2014, 19:29:24

On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexander Korotkov <aekorotkov@gmail.com> writes:
> Revised version of patch with necessary comments.

I looked at this patch a bit. It seems like this:

+ * BLANK_COLOR_SIZE - How much blank character is more frequent than
+ * other character in average
+ #define BLANK_COLOR_SIZE 32

is a drastic overestimate. Using the program Erik provided to generate
some sample data, I find that blanks in trigrams are maybe about three
times more common than other characters. Possibly Erik's data isn't
very representative of real text, but if that's the case we'd better
get a more representative sample case before we start optimizing ...

Now maybe the above number is okay for this purpose anyway, but if so
it needs some different justification; maybe call it a "penalty"?
But nonetheless it seems like a darn large penalty, particularly for " x"
type trigrams where the value would effectively be squared. Compared to
the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like
you might as well forget the "sizing" approach and just flat out reject
trigrams containing COLOR_BLANK, because they'll never get through the
size filter.

It seems to be right decision to me when we have other trigrams can reject them. But there are cases when we can't.

I'm inclined to think you need a larger MAX_TRGM_SIZE;
and WISH_TRGM_SIZE seems mighty small as well, compared to what the
code would have done before.

OK, I would like to make more reasoning for penalty.

Let's consider word of length n.

It contains n+1 trigrams including:

1 __x trigram

1 _xx trigram

1 xx_ trigram

n - 2 xxx trigrams

Assume alphabet of size m those trigrams have following average frequencies:

__x 1/((n+1)*m)

_xx 1/((n+1)*m^2)

xx_ 1/((n+1)*m^2)

xxx (n-2)/((n+1)*m^3)

The ratio of this frequencies with blanks to frequency of xxx is:

__x m^2/(n-2)

_xx and xx_ m/(n-2)

In order to estimate n I've analyzed some classics:

Ernest Hemingway "A farewell to arms" - 3.8

Leo Tolstoy "War and Peace" - 5.0

In english with alphabet size = 26 we have

__x m^2/(n-2) = 375

_xx and xx_ m/(n-2) = 14.4

In russian with alphabet size = 33 we have

__x m^2/(n-2) = 363

_xx and xx_ m/(n-2) = 11

Assuming these calculations is approximate, we can safely use same values for both languages.

Does such reasoning make sense?

Also, surely this bit:

! trgmNFA->totalTrgmCount = (int) totalTrgmSize;

is flat out wrong? expandColorTrigrams() expects that
trgmNFA->totalTrgmCount is the truth, not some penalty-bloated
abstraction. The fact that the patch hasn't failed on you completely
demonstrates that you've not tested any cases where blank-containing
trigrams got through the filter, reinforcing my feeling that it's
probably too strict now.

Oh, I wonder how can I leave such weird bug in patch :-(

I wonder if it would be more appropriate to continue to measure the limit
MAX_TRGM_COUNT in terms of actual trigram counts, and just use the
penalized "size" as the sort key while determining which color trigrams
to discard first.

Agree. But I would like to save some equivalent of WISH_TRGM_SIZE.

Another thought is that it's not clear that you should apply the same
penalty to blanks in all three positions. Because of the padding
settings, any one word will normally lead to padded strings " a",
" ab", "yz ", so that spaces in the first position are about twice
as common as spaces in the other positions. (It's a little less
than that because single-letter words produce only " a" and " a ";
but I'd think that such words aren't very common in text that trigrams
are effective for.) So I'm thinking the penalty ought to take that
into account.

I'm also inclined to think that it might be worth adding a size
field to ColorTrgmInfo rather than repeatedly calculating the
size estimate. We only allow a thousand and some ColorTrgmInfos
at most, so the extra space wouldn't be that large, and I'm concerned
by the expense the current patch adds to the sort comparator.

Agree.

Another point is that this comment:

* Note: per-color-trigram counts cannot overflow an int so long as
* COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie about
* 1290. However, the grand total totalTrgmCount might conceivably
* overflow an int, so we use int64 for that within this routine.

is no longer valid, or at least it fails to demonstrate that the result
of getColorTrigramSize() can't overflow an int; indeed I rather fear it can.
The patch failed to even update the variable name in this comment, let
alone address the substantive question.

There are some other cosmetic things I didn't like, for instance
colorTrgmInfoCountCmp() is no longer comparing counts but you didn't
rename it. That's about it for substantive comments though.

Thanks, will be fixed.

------
With best regards,
Alexander Korotkov.

Re: trgm regex index peculiarity

From

Tom Lane

Date:

09 February 2014, 21:02:10

Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I looked at this patch a bit.  It seems like this:
>> +  *    BLANK_COLOR_SIZE - How much blank character is more frequent than
>> +  *                       other character in average
>> + #define BLANK_COLOR_SIZE      32
>> is a drastic overestimate.

> OK, I would like to make more reasoning for penalty.
> Let's consider word of length n.
> It contains n+1 trigrams including:
> 1 __x trigram
> 1 _xx trigram
> 1 xx_ trigram
> n - 2 xxx trigrams

> Assume alphabet of size m those trigrams have following average frequencies:
> __x 1/((n+1)*m)
> _xx 1/((n+1)*m^2)
> xx_ 1/((n+1)*m^2)
> xxx (n-2)/((n+1)*m^3)

Hmm, but you're assuming that the m letters are all equally frequent,
which is surely not true in normal text.

I didn't have a machine-readable copy of Hemingway or Tolstoy at hand,
but I do have a text file with the collected works of H.P. Lovecraft,
so I tried analyzing the trigrams in that.  I find that
* The average word length is 4.78 characters.
* There are 5652 distinct trigrams (discounting some containing digits),
the most common one ('  t') occurring 81222 times; the average
occurrence count is 500.0.
* Considering only trigrams not containing any blanks, there are
4937 of them, the most common one ('the') occurring 45832 times,
with an average count of 266.9.
* There are (unsurprisingly) exactly 26 trigrams of the form '  x',
with an average count of 19598.5.
* There are 689 trigrams of the form ' xx' or 'xx ', the most common one
(' th') occurring 58200 times, with an average count of 1450.1.

So, we've rediscovered the fact that 'the' is the most common word in
English text ;-).  But I think the significant conclusion for our purposes
here is that single-space trigrams are on average about 1450.1/266.9 = 5.4
times more common than the average space-free trigram, and two-space
trigrams about 19598.5/266.9 = 73.4 times more common.

So this leads me to the conclusion that we should be using a
BLANK_COLOR_SIZE value around 5 or 6 (with maybe something other than
a square-law rule for two-space trigrams).  Even using your numbers,
it shouldn't be 32.
        regards, tom lane

Re: trgm regex index peculiarity

From

Alexander Korotkov

Date:

01 March 2014, 16:41:35

On Mon, Feb 10, 2014 at 1:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I looked at this patch a bit. It seems like this:
>> + * BLANK_COLOR_SIZE - How much blank character is more frequent than
>> + * other character in average
>> + #define BLANK_COLOR_SIZE 32
>> is a drastic overestimate.

> OK, I would like to make more reasoning for penalty.
> Let's consider word of length n.
> It contains n+1 trigrams including:
> 1 __x trigram
> 1 _xx trigram
> 1 xx_ trigram
> n - 2 xxx trigrams

> Assume alphabet of size m those trigrams have following average frequencies:
> __x 1/((n+1)*m)
> _xx 1/((n+1)*m^2)
> xx_ 1/((n+1)*m^2)
> xxx (n-2)/((n+1)*m^3)

Hmm, but you're assuming that the m letters are all equally frequent,
which is surely not true in normal text.

I didn't have a machine-readable copy of Hemingway or Tolstoy at hand,
but I do have a text file with the collected works of H.P. Lovecraft,
so I tried analyzing the trigrams in that. I find that
* The average word length is 4.78 characters.
* There are 5652 distinct trigrams (discounting some containing digits),
the most common one (' t') occurring 81222 times; the average
occurrence count is 500.0.
* Considering only trigrams not containing any blanks, there are
4937 of them, the most common one ('the') occurring 45832 times,
with an average count of 266.9.
* There are (unsurprisingly) exactly 26 trigrams of the form ' x',
with an average count of 19598.5.
* There are 689 trigrams of the form ' xx' or 'xx ', the most common one
(' th') occurring 58200 times, with an average count of 1450.1.

So, we've rediscovered the fact that 'the' is the most common word in
English text ;-). But I think the significant conclusion for our purposes
here is that single-space trigrams are on average about 1450.1/266.9 = 5.4
times more common than the average space-free trigram, and two-space
trigrams about 19598.5/266.9 = 73.4 times more common.

So this leads me to the conclusion that we should be using a
BLANK_COLOR_SIZE value around 5 or 6 (with maybe something other than
a square-law rule for two-space trigrams). Even using your numbers,
it shouldn't be 32.

Next revision of patch is attached. Changes are so:

1) Notion "penalty" is used instead of "size".

2) We try to reduce total penalty to WISH_TRGM_PENALTY, but restriction is MAX_TRGM_COUNT total trigrams count.

3) Penalties are assigned to particular color trigram classes. I.e. separate penalties for __a, _aa, _a_, aa_. It's based on analysis of trigram frequencies in Oscar Wilde writings. We can end up with different numbers, but I don't think they will be dramatically different.

------
With best regards,
Alexander Korotkov.

Attachment

trgm-regex-optimize.3.patch

Re: trgm regex index peculiarity

From

Heikki Linnakangas

Date:

28 March 2014, 08:31:31

I went back and tried Erik's original test 
(http://www.postgresql.org/message-id/dafad644f268ce1503e1b8b682aae38a.squirrel@webmail.xs4all.nl). 
With a fresh checkout from master, the difference between the slow and 
fast queries is much less dramatic than Erik reported. The reason is 
that Alexander's GIN "fast scan" patch is very effective on that query. 
Erik reported that the '^abcd' query took 140ms, vs 5ms for 'abcd'. On 
my laptop, the numbers with a fresh checkout are about 2.5 ms vs. 1 ms, 
and with fast scan disabled (by modifying the source code), 40ms vs 1ms.

So thanks to the fast scan patch, I don't think this patch is worth 
pursuing anymore. Unless there are some other test case where this patch 
helps, but the fast scan patch doesn't.

- Heikki

Re: trgm regex index peculiarity

From

"Erik Rijkers"

Date:

28 March 2014, 22:28:13

On Fri, March 28, 2014 09:31, Heikki Linnakangas wrote:
> I went back and tried Erik's original test
> (http://www.postgresql.org/message-id/dafad644f268ce1503e1b8b682aae38a.squirrel@webmail.xs4all.nl).
> With a fresh checkout from master, the difference between the slow and
> fast queries is much less dramatic than Erik reported. The reason is
> that Alexander's GIN "fast scan" patch is very effective on that query.
> Erik reported that the '^abcd' query took 140ms, vs 5ms for 'abcd'. On
> my laptop, the numbers with a fresh checkout are about 2.5 ms vs. 1 ms,
> and with fast scan disabled (by modifying the source code), 40ms vs 1ms.
>
> So thanks to the fast scan patch, I don't think this patch is worth
> pursuing anymore. Unless there are some other test case where this patch
> helps, but the fast scan patch doesn't.
>

for the same 2 statements of my original test:
explain (analyze,buffers) select txt from azjunk6 where txt ~ '^abcd'; -- slow (140 ms)
explain (analyze,buffers) select txt from azjunk6 where txt ~  'abcd' and substr(txt,1,4) = 'abcd'; -- fast (5 ms)

You mention (from HEAD, I suppose?) a difference of 2.5 ms vs. 1 ms.

FWIW, for me the difference (from HEAD) remains quite a bit larger:

for n in `seq 1 10`; do ./trgm_peculiarity.sh ; done | grep runtime
Total runtime: 16.167 msTotal runtime: 2.188 msTotal runtime: 16.902 msTotal runtime: 2.203 msTotal runtime: 17.486
msTotalruntime: 2.201 msTotal runtime: 17.663 msTotal runtime: 2.441 msTotal runtime: 13.555 msTotal runtime: 2.204
msTotalruntime: 16.862 msTotal runtime: 2.225 msTotal runtime: 13.207 msTotal runtime: 2.550 msTotal runtime: 16.768
msTotalruntime: 2.172 msTotal runtime: 19.259 msTotal runtime: 2.180 msTotal runtime: 12.934 msTotal runtime: 2.198 ms
 

That's a lot better than the original 140ms vs 5ms  but your laptop's  2.5 ms vs. 1 ms  is perhaps not representative.

(for the full plans see below)


Erik Rijkers




----------------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on azjunk6  (cost=56.77..432.93 rows=100 width=81) (actual time=15.898..15.925 rows=2 loops=1)  Recheck Cond:
(txt~ '^abcd'::text)  Rows Removed by Index Recheck: 11  Heap Blocks: exact=13  Buffers: shared hit=105  ->  Bitmap
IndexScan on azjunk6_trgm_re_idx  (cost=0.00..56.75 rows=100 width=0) (actual time=15.834..15.834 rows=13
 
loops=1)        Index Cond: (txt ~ '^abcd'::text)        Buffers: shared hit=92Planning time: 3.304 msTotal runtime:
16.179ms
 
(10 rows)

Time: 21.103 ms
explain (analyze,buffers) select txt from azjunk6 where txt ~  'abcd' and substr(txt,1,4) = 'abcd'; -- fast (5 ms)
                                                   QUERY PLAN
 

---------------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on azjunk6  (cost=28.75..405.40 rows=1 width=81) (actual time=1.681..2.164 rows=2 loops=1)  Recheck Cond: (txt
~'abcd'::text)  Rows Removed by Index Recheck: 11  Filter: (substr(txt, 1, 4) = 'abcd'::text)  Rows Removed by Filter:
101 Heap Blocks: exact=113  Buffers: shared hit=120  ->  Bitmap Index Scan on azjunk6_trgm_re_idx  (cost=0.00..28.75
rows=100width=0) (actual time=1.171..1.171 rows=114
 
loops=1)        Index Cond: (txt ~ 'abcd'::text)        Buffers: shared hit=7Planning time: 0.516 msTotal runtime:
2.183ms
 
(12 rows)

Re: trgm regex index peculiarity

From

Tom Lane

Date:

05 April 2014, 18:23:42

"Erik Rijkers" <er@xs4all.nl> writes:
> On Fri, March 28, 2014 09:31, Heikki Linnakangas wrote:
>> So thanks to the fast scan patch, I don't think this patch is worth
>> pursuing anymore. Unless there are some other test case where this patch
>> helps, but the fast scan patch doesn't.

> FWIW, for me the difference (from HEAD) remains quite a bit larger:
> ...
> That's a lot better than the original 140ms vs 5ms  but your laptop's  2.5 ms vs. 1 ms  is perhaps not
representative.

I also feel that Alexander's patch is still worth pursuing.  What I see
(in an assert-disabled build of HEAD) is that Erik's "slow" query takes
about 2 ms to plan and 5.5 ms to execute, while the "fast" query is about
0.7 ms to plan and 1.3 ms to execute.  With the patch, the "slow" query is
still about 2 ms to plan but only 0.3 ms to execute, while the "fast"
query doesn't change much.  There's a lot of noise in these numbers
(successive executions seem to be bouncing around more than usual today),
but still it looks like the runtime gain is impressive percentagewise.

One thing that's interesting is that the planning time is so much more for
the "slow" query, even though the "fast" one has an additional WHERE
clause that the planner has to think about.  I haven't tried to do perf
measurements to be sure, but my guess is that this has nothing to do with
GIN or pg_trgm directly, but represents the planner's efforts to get a
selectivity estimate for the ~ operator --- that code works much
differently for patterns that define fixed prefixes vs those that don't.

Anyway, the important point here is that I think the planning time
is helping to mask the fact that there's a pretty useful runtime
improvement.
        regards, tom lane

Re: trgm regex index peculiarity

From

Tom Lane

Date:

06 April 2014, 00:52:13

Alexander Korotkov <aekorotkov@gmail.com> writes:
> Next revision of patch is attached. Changes are so:
> 1) Notion "penalty" is used instead of "size".
> 2) We try to reduce total penalty to WISH_TRGM_PENALTY, but restriction is
> MAX_TRGM_COUNT total trigrams count.
> 3) Penalties are assigned to particular color trigram classes. I.e.
> separate penalties for __a, _aa, _a_, aa_. It's based on analysis of
> trigram frequencies in Oscar Wilde writings. We can end up with different
> numbers, but I don't think they will be dramatically different.

Committed with cosmetic improvements (adjusting the comments mostly).

The new whitespace penalties look reasonably sane to me.  I wonder though
if WISH_TRGM_PENALTY is too small --- it seems like this code will tend to
select many fewer trigrams than the old code did.  What testing did you do
that led you to select the specific value of 16?
        regards, tom lane