Thread: Extremely slow intarray index creation and inserts.
Short summary: * extremely slow intarray indexes with gist__int_ops * gist__intbig_ops seems much faster even with short arrays * gin seems much much faster for inserts and creation(and queries) I was debugging a system with a table that slowed to where individual inserts were taking well over a second on a practically idle system. Dropping an a gist index on an intarray made the problem go away. Timing the creation of gist indexes on this 8.3.6 system makes me think there's something broken with intarray's gist indexes. This table summarizes some of the times, shown more completely in a script below. ================================================================= create gist index on 10000 = 5 seconds create gist index on 20000 = 32 seconds create gist index on 30000 = 39 seconds create gist index on 40000 = 102 seconds create gist index on 70000 = I waited 10 minutes before giving up. create gin index on 40000 = 0.7 seconds create gist index on 40000 = 5 seconds using gist__intbig_ops create gin index on 70000 = 1.0 seconds create gist index on 70000 = 9 seconds using gist__intbig_ops ================================================================== This surprised me for a number of reasons. The longest array in the table is 9 elements long, and most are 5 or 6 so I'd have thought the default ops would have been better than the big_ops. Secondly, I thought gin inserts were expected to be slower than gist, but I'm finding them much faster. Nothing seems particular strange about the data. A dump of an excerpt of the table can be found at http://0ape.com/tmp/int_array.dmp (Yes, the production table had other columns; but this column alone is enough to demonstrate the problem.) Any thoughts what I'm doing wrong? Ron psql output showing the timing follows. =============================================================================== vm=# create table tmp_intarray_test as select tag_id_array as my_int_array from taggings; SELECT vm=# create table tmp_intarray_test_10000 as select * from tmp_intarray_test limit 10000; SELECT vm=# create table tmp_intarray_test_20000 as select * from tmp_intarray_test limit 20000; SELECT vm=# create table tmp_intarray_test_30000 as select * from tmp_intarray_test limit 30000; SELECT vm=# create table tmp_intarray_test_40000 as select * from tmp_intarray_test limit 40000; SELECT vm=# \timing Timing is on. vm=# vm=# create index "gist_10000 using GIST(my_int_array)" on tmp_intarray_test_10000 using GIST (my_int_array); CREATE INDEX Time: 5760.050 ms vm=# create index "gist_20000 using GIST(my_int_array)" on tmp_intarray_test_20000 using GIST (my_int_array); CREATE INDEX Time: 32500.911 ms vm=# create index "gist_30000 using GIST(my_int_array)" on tmp_intarray_test_30000 using GIST (my_int_array); CREATE INDEX Time: 39284.031 ms vm=# create index "gist_40000 using GIST(my_int_array)" on tmp_intarray_test_40000 using GIST (my_int_array); CREATE INDEX Time: 102572.780 ms vm=# vm=# vm=# vm=# vm=# vm=# vm=# create index "gin_40000" on tmp_intarray_test_40000 using GIN (my_int_array gin__int_ops); CREATE INDEX Time: 696.668 ms vm=# create index "gist_big_4000" on tmp_intarray_test_40000 using GIST (my_int_array gist__intbig_ops); CREATE INDEX Time: 5227.353 ms vm=# vm=# vm=# vm=# \d tmp_intarray_test Table "public.tmp_intarray_test" Column | Type | Modifiers --------------+-----------+----------- my_int_array | integer[] | vm=# select max(array_dims(my_int_array)) from tmp_intarray_test_30000; max ------- [1:9] (1 row) Time: 119.607 ms vm=# vm=# vm=# select version(); version ----------------------------------------------------------------------------------- PostgreSQL 8.3.6 on i686-pc-linux-gnu, compiled by GCC gcc (Debian 4.3.3-1) 4.3.3 (1 row) Time: 12.169 ms vm=# create index "gistbig70000" on tmp_intarray_test using GIST (my_int_array gist__intbig_ops); CREATE INDEX Time: 9156.886 ms vm=# create index "gin70000" on tmp_intarray_test using GIN (my_int_array gin__int_ops); CREATE INDEX Time: 1060.752 ms vm=# create index "gist7000" on tmp_intarray_test using GIST (my_int_array gist__int_ops); [.... it just sits here for 10 minutes or more ....]
Ron Mayer wrote: > This table summarizes some of the times, shown more completely > in a script below. > ================================================================= > create gist index on 10000 = 5 seconds > create gist index on 20000 = 32 seconds > create gist index on 30000 = 39 seconds > create gist index on 40000 = 102 seconds > create gist index on 70000 = I waited 10 minutes before giving up Finished after 34 minutes. vm=# create index "gist70000" on tmp_intarray_test using GIST (my_int_array gist__int_ops); CREATE INDEX Time: 2069836.856 ms Is that expected, or does it sound like a bug to take over half an hour to index 70000 rows of mostly 5 and 6-element integer arrays? > create gin index on 40000 = 0.7 seconds > create gist index on 40000 = 5 seconds using gist__intbig_ops > > create gin index on 70000 = 1.0 seconds > create gist index on 70000 = 9 seconds using gist__intbig_ops > ================================================================== > > This surprised me for a number of reasons. The longest > array in the table is 9 elements long, and most are 5 or 6 > so I'd have thought the default ops would have been better > than the big_ops. Secondly, I thought gin inserts were expected > to be slower than gist, but I'm finding them much faster. > > Nothing seems particular strange about the data. A dump > of an excerpt of the table can be found at > http://0ape.com/tmp/int_array.dmp > (Yes, the production table had other columns; but this > column alone is enough to demonstrate the problem.) > > Any thoughts what I'm doing wrong? > Ron > > psql output showing the timing follows. > > =============================================================================== > vm=# create table tmp_intarray_test as select tag_id_array as my_int_array from taggings; > SELECT > vm=# create table tmp_intarray_test_10000 as select * from tmp_intarray_test limit 10000; > SELECT > vm=# create table tmp_intarray_test_20000 as select * from tmp_intarray_test limit 20000; > SELECT > vm=# create table tmp_intarray_test_30000 as select * from tmp_intarray_test limit 30000; > SELECT > vm=# create table tmp_intarray_test_40000 as select * from tmp_intarray_test limit 40000; > SELECT > vm=# \timing > Timing is on. > vm=# > vm=# create index "gist_10000 using GIST(my_int_array)" on tmp_intarray_test_10000 using GIST (my_int_array); > CREATE INDEX > Time: 5760.050 ms > vm=# create index "gist_20000 using GIST(my_int_array)" on tmp_intarray_test_20000 using GIST (my_int_array); > CREATE INDEX > Time: 32500.911 ms > vm=# create index "gist_30000 using GIST(my_int_array)" on tmp_intarray_test_30000 using GIST (my_int_array); > CREATE INDEX > Time: 39284.031 ms > vm=# create index "gist_40000 using GIST(my_int_array)" on tmp_intarray_test_40000 using GIST (my_int_array); > CREATE INDEX > Time: 102572.780 ms > vm=# > vm=# > vm=# > vm=# > > vm=# > vm=# > vm=# create index "gin_40000" on tmp_intarray_test_40000 using GIN (my_int_array gin__int_ops); > CREATE INDEX > Time: 696.668 ms > vm=# create index "gist_big_4000" on tmp_intarray_test_40000 using GIST (my_int_array gist__intbig_ops); > CREATE INDEX > Time: 5227.353 ms > vm=# > vm=# > vm=# > vm=# \d tmp_intarray_test > Table "public.tmp_intarray_test" > Column | Type | Modifiers > --------------+-----------+----------- > my_int_array | integer[] | > > vm=# select max(array_dims(my_int_array)) from tmp_intarray_test_30000; > max > ------- > [1:9] > (1 row) > > Time: 119.607 ms > vm=# > vm=# > vm=# select version(); > version > ----------------------------------------------------------------------------------- > PostgreSQL 8.3.6 on i686-pc-linux-gnu, compiled by GCC gcc (Debian 4.3.3-1) 4.3.3 > (1 row) > > Time: 12.169 ms > > vm=# create index "gistbig70000" on tmp_intarray_test using GIST (my_int_array gist__intbig_ops); > CREATE INDEX > Time: 9156.886 ms > vm=# create index "gin70000" on tmp_intarray_test using GIN (my_int_array gin__int_ops); > CREATE INDEX > Time: 1060.752 ms > vm=# create index "gist7000" on tmp_intarray_test using GIST (my_int_array gist__int_ops); > [.... it just sits here for 10 minutes or more ....] >
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: > vm=# create index "gist70000" on tmp_intarray_test using GIST (my_int_array gist__int_ops); > CREATE INDEX > Time: 2069836.856 ms > Is that expected, or does it sound like a bug to take over > half an hour to index 70000 rows of mostly 5 and 6-element > integer arrays? I poked at this example with oprofile. It's entirely CPU-bound AFAICT, and the CPU utilization is approximately 55% g_int_compress 35% memmove/memcpy (difficult to distinguish these) 1% pg_qsort <1% anything else Probably need to look at reducing the number of calls to g_int_compress ... it must be getting called a whole lot more than once per new index entry, and I wonder why that should need to be. regards, tom lane
Tom Lane wrote: > Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: >> vm=# create index "gist70000" on tmp_intarray_test using GIST (my_int_array gist__int_ops); >> CREATE INDEX >> Time: 2069836.856 ms > >> Is that expected, or does it sound like a bug to take over >> half an hour to index 70000 rows of mostly 5 and 6-element >> integer arrays? > > I poked at this example with oprofile. It's entirely CPU-bound AFAICT, Oleg pointed out to me (off-list I now see) that it's not totally unexpected behavior and I should have been using gist__intbig_ops, since the "big" refers to the cardinality of the entire set (which was large, in my case) and not the length of the arrays. Oleg Bartunov wrote: OB:> it's not about short or long arrays, it's about small or big OB:> cardinality of the whole set (the number of unique elements) I'm re-reading the docs and still wasn't obvious to me. A potential docs patch is attached below. > and the CPU utilization is approximately > > 55% g_int_compress > 35% memmove/memcpy (difficult to distinguish these) > 1% pg_qsort > <1% anything else > > Probably need to look at reducing the number of calls to g_int_compress > ... it must be getting called a whole lot more than once per new index > entry, and I wonder why that should need to be. Perhaps that's a separate issue, but we're working fine with gist__intbig_ops for the time being. Here's a proposed docs patch that makes this more obvious. *** a/doc/src/sgml/intarray.sgml --- b/doc/src/sgml/intarray.sgml *************** *** 239,245 **** <literal>gist__int_ops</> (used by default) is suitable for small and medium-size arrays, while <literal>gist__intbig_ops</> uses a larger signature and is more ! suitable for indexing large arrays. </para> <para> --- 239,247 ---- <literal>gist__int_ops</> (used by default) is suitable for small and medium-size arrays, while <literal>gist__intbig_ops</> uses a larger signature and is more ! suitable for indexing high-cardinality data sets - where there ! are a large number of unique elements across all rows being ! indexed. </para> <para>
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: > Oleg Bartunov wrote: > OB:> it's not about short or long arrays, it's about small or big > OB:> cardinality of the whole set (the number of unique elements) > I'm re-reading the docs and still wasn't obvious to me. A > potential docs patch is attached below. Done, though not in exactly those words. I wonder though if we can be less vague about it --- can we suggest a typical cutover point? Like "use gist__intbig_ops if there are more than about 10,000 distinct array values"? Even a rough order of magnitude for where to worry about this would save a lot of people time. regards, tom lane Index: intarray.sgml =================================================================== RCS file: /cvsroot/pgsql/doc/src/sgml/intarray.sgml,v retrieving revision 1.5 retrieving revision 1.6 diff -c -r1.5 -r1.6 *** intarray.sgml 10 Dec 2007 05:32:51 -0000 1.5 --- intarray.sgml 18 Mar 2009 20:18:18 -0000 1.6 *************** *** 237,245 **** <para> Two GiST index operator classes are provided: <literal>gist__int_ops</> (used by default) is suitable for ! small and medium-size arrays, while <literal>gist__intbig_ops</> uses a larger signature and is more ! suitable for indexing large arrays. </para> <para> --- 237,246 ---- <para> Two GiST index operator classes are provided: <literal>gist__int_ops</> (used by default) is suitable for ! small- to medium-size data sets, while <literal>gist__intbig_ops</> uses a larger signature and is more ! suitable for indexing large data sets (i.e., columns containing ! a large number of distinct array values). </para> <para>
We usually say about 200 unique values as a limit for gist_int_ops. On Wed, 18 Mar 2009, Tom Lane wrote: > Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: >> Oleg Bartunov wrote: >> OB:> it's not about short or long arrays, it's about small or big >> OB:> cardinality of the whole set (the number of unique elements) > >> I'm re-reading the docs and still wasn't obvious to me. A >> potential docs patch is attached below. > > Done, though not in exactly those words. I wonder though if we can > be less vague about it --- can we suggest a typical cutover point? > Like "use gist__intbig_ops if there are more than about 10,000 distinct > array values"? Even a rough order of magnitude for where to worry > about this would save a lot of people time. > > regards, tom lane > > Index: intarray.sgml > =================================================================== > RCS file: /cvsroot/pgsql/doc/src/sgml/intarray.sgml,v > retrieving revision 1.5 > retrieving revision 1.6 > diff -c -r1.5 -r1.6 > *** intarray.sgml 10 Dec 2007 05:32:51 -0000 1.5 > --- intarray.sgml 18 Mar 2009 20:18:18 -0000 1.6 > *************** > *** 237,245 **** > <para> > Two GiST index operator classes are provided: > <literal>gist__int_ops</> (used by default) is suitable for > ! small and medium-size arrays, while > <literal>gist__intbig_ops</> uses a larger signature and is more > ! suitable for indexing large arrays. > </para> > > <para> > --- 237,246 ---- > <para> > Two GiST index operator classes are provided: > <literal>gist__int_ops</> (used by default) is suitable for > ! small- to medium-size data sets, while > <literal>gist__intbig_ops</> uses a larger signature and is more > ! suitable for indexing large data sets (i.e., columns containing > ! a large number of distinct array values). > </para> > > <para> > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov <oleg@sai.msu.su> writes: > We usually say about 200 unique values as a limit for > gist_int_ops. That seems awfully small ... should we make gist_intbig_ops the default, or more likely, raise the signature size of both opclasses? Even at a crossover point of 10000 I'm not sure that many real-world apps would bother considering gist_int_ops. regards, tom lane
On Thu, 19 Mar 2009, Tom Lane wrote: > Oleg Bartunov <oleg@sai.msu.su> writes: >> We usually say about 200 unique values as a limit for >> gist_int_ops. > > That seems awfully small ... should we make gist_intbig_ops the default, > or more likely, raise the signature size of both opclasses? Even at a > crossover point of 10000 I'm not sure that many real-world apps would > bother considering gist_int_ops. gist__int_ops doesn't uses signatures, it uses range compression, which is not lossy, but not capacious. Perhaps, that's why we decided to use it as default opclass. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Thu, 19 Mar 2009, Oleg Bartunov wrote: > On Thu, 19 Mar 2009, Tom Lane wrote: > >> Oleg Bartunov <oleg@sai.msu.su> writes: >>> We usually say about 200 unique values as a limit for >>> gist_int_ops. >> >> That seems awfully small ... should we make gist_intbig_ops the default, >> or more likely, raise the signature size of both opclasses? Even at a >> crossover point of 10000 I'm not sure that many real-world apps would >> bother considering gist_int_ops. > > gist__int_ops doesn't uses signatures, it uses range compression, which > is not lossy, but not capacious. Perhaps, that's why we decided to use it as sorry, it's lossy > default opclass. > > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83