On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <er@xs4all.nl> wrote:
On Fri, June 21, 2013 05:25, Tom Lane wrote: > "Erik Rijkers" <er@xs4all.nl> writes: >> In a 112 MB test table (containing random generated text) with a trgm index (gin_trgm_ops), I consistently get these >> timings: >> select txt from azjunk6 where txt ~ '^abcd'; >> 130 ms >> select txt from azjunk6 >> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; >> 3 ms > > Hm, could you provide a self-contained test case? >
yes, sorry. I tested on a 1M row table:
#!/bin/sh
# create table: for power in 6; do table=azjunk${power} index=${table}_trgm_re_idx perl -E' sub ss{ join"",@_[ map{rand @_} 1 .. shift ] }; say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z")) for 1 .. 1e'"${power};" \ | psql -aqXc " drop table if exists $table; create table $table(txt text); copy $table from stdin;"; echo "set session maintenance_work_mem='1GB'; create index $index on $table using gin (txt gin_trgm_ops); analyze $table;" | psql -qtAX; done
# test: echo " \\timing on explain analyze select txt from azjunk6 where txt ~ '^abcd'; -- slow (140 ms) explain analyze select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; -- fast (5 ms) " | psql -Xqa
Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'. However trigrams '__a' is much more frequent than '_ab' which in turn is much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of '__a' and '_ab' and that gives so significant speedup.
The problem is that trigram regex engine doesn't know that '__a' and '_ab' is more frequent than another trigrams because we don't know their distribution. However, simple assumption that blank character (in pg_trgm meaning) is much more frequent than others seems to be true for most cases.
Attached patch implementing this assumption. It introduces BLANK_COLOR_SIZE macro which is used in selectColorTrigrams like count of characters in COLOR_BLANK. Another change in this patch is split of MAX_TRGM_COUNT into WISH_TRGM_SIZE and MAX_TRGM_SIZE. The idea is to keep trying to remove color trigrams from graph even when it fits into MAX_TRGM_SIZE, because we are intending to scan less posting lists/trees.