Thread: Extremely slow intarray index creation and inserts.

Extremely slow intarray index creation and inserts.

From
Ron Mayer
Date:
Short summary:
    * extremely slow intarray indexes with gist__int_ops
    * gist__intbig_ops seems much faster even with short arrays
    * gin seems much much faster for inserts and creation(and queries)


I was debugging a system with a table that slowed to where individual
inserts were taking well over a second on a practically idle system.
Dropping an a gist index on an intarray made the problem go away.

Timing the creation of gist indexes on this 8.3.6 system makes me
think there's something broken with intarray's gist indexes.

This table summarizes some of the times, shown more completely
in a script below.
=================================================================
create gist index on 10000  =   5   seconds
create gist index on 20000  =  32   seconds
create gist index on 30000  =  39   seconds
create gist index on 40000  = 102   seconds
create gist index on 70000  = I waited 10 minutes before giving up.

create gin  index on 40000  =   0.7 seconds
create gist index on 40000  =   5   seconds using gist__intbig_ops

create gin  index on 70000  =   1.0 seconds
create gist index on 70000  =   9   seconds using gist__intbig_ops
==================================================================

This surprised me for a number of reasons.   The longest
array in the table is 9 elements long, and most are 5 or 6
so I'd have thought the default ops would have been better
than the big_ops.  Secondly, I thought gin inserts were expected
to be slower than gist, but I'm finding them much faster.

Nothing seems particular strange about the data.  A dump
of an excerpt of the table can be found at
http://0ape.com/tmp/int_array.dmp
(Yes, the production table had other columns; but this
column alone is enough to demonstrate the problem.)

   Any thoughts what I'm doing wrong?
   Ron

psql output showing the timing follows.

===============================================================================
vm=# create table tmp_intarray_test as select tag_id_array as my_int_array from taggings;
SELECT
vm=# create table tmp_intarray_test_10000 as select * from tmp_intarray_test limit 10000;
SELECT
vm=# create table tmp_intarray_test_20000 as select * from tmp_intarray_test limit 20000;
SELECT
vm=# create table tmp_intarray_test_30000 as select * from tmp_intarray_test limit 30000;
SELECT
vm=# create table tmp_intarray_test_40000 as select * from tmp_intarray_test limit 40000;
SELECT
vm=# \timing
Timing is on.
vm=#
vm=# create index "gist_10000 using GIST(my_int_array)" on tmp_intarray_test_10000 using GIST (my_int_array);
CREATE INDEX
Time: 5760.050 ms
vm=# create index "gist_20000 using GIST(my_int_array)" on tmp_intarray_test_20000 using GIST (my_int_array);
CREATE INDEX
Time: 32500.911 ms
vm=# create index "gist_30000 using GIST(my_int_array)" on tmp_intarray_test_30000 using GIST (my_int_array);
CREATE INDEX
Time: 39284.031 ms
vm=# create index "gist_40000 using GIST(my_int_array)" on tmp_intarray_test_40000 using GIST (my_int_array);
CREATE INDEX
Time: 102572.780 ms
vm=#
vm=#
vm=#
vm=#

vm=#
vm=#
vm=#  create index "gin_40000"     on tmp_intarray_test_40000 using GIN (my_int_array gin__int_ops);
CREATE INDEX
Time: 696.668 ms
vm=#  create index "gist_big_4000" on tmp_intarray_test_40000 using GIST (my_int_array gist__intbig_ops);
CREATE INDEX
Time: 5227.353 ms
vm=#
vm=#
vm=#
vm=# \d tmp_intarray_test
   Table "public.tmp_intarray_test"
    Column    |   Type    | Modifiers
--------------+-----------+-----------
 my_int_array | integer[] |

vm=# select max(array_dims(my_int_array)) from tmp_intarray_test_30000;
  max
-------
 [1:9]
(1 row)

Time: 119.607 ms
vm=#
vm=#
vm=# select version();
                                      version
-----------------------------------------------------------------------------------
 PostgreSQL 8.3.6 on i686-pc-linux-gnu, compiled by GCC gcc (Debian 4.3.3-1) 4.3.3
(1 row)

Time: 12.169 ms

vm=#  create index "gistbig70000" on tmp_intarray_test using GIST (my_int_array gist__intbig_ops);
CREATE INDEX
Time: 9156.886 ms
vm=#  create index "gin70000" on tmp_intarray_test using GIN (my_int_array gin__int_ops);
CREATE INDEX
Time: 1060.752 ms
vm=#  create index "gist7000" on tmp_intarray_test using GIST (my_int_array gist__int_ops);
      [.... it just sits here for 10 minutes or more ....]

Re: Extremely slow intarray index creation and inserts.

From
Ron Mayer
Date:
Ron Mayer wrote:
> This table summarizes some of the times, shown more completely
> in a script below.
> =================================================================
> create gist index on 10000  =   5   seconds
> create gist index on 20000  =  32   seconds
> create gist index on 30000  =  39   seconds
> create gist index on 40000  = 102   seconds
> create gist index on 70000  = I waited 10 minutes before giving up

  Finished after 34 minutes.

vm=# create index "gist70000" on tmp_intarray_test using GIST (my_int_array gist__int_ops);
CREATE INDEX
Time: 2069836.856 ms

Is that expected, or does it sound like a bug to take over
half an hour to index 70000 rows of mostly 5 and 6-element
integer arrays?


> create gin  index on 40000  =   0.7 seconds
> create gist index on 40000  =   5   seconds using gist__intbig_ops
>
> create gin  index on 70000  =   1.0 seconds
> create gist index on 70000  =   9   seconds using gist__intbig_ops
> ==================================================================
>
> This surprised me for a number of reasons.   The longest
> array in the table is 9 elements long, and most are 5 or 6
> so I'd have thought the default ops would have been better
> than the big_ops.  Secondly, I thought gin inserts were expected
> to be slower than gist, but I'm finding them much faster.
>
> Nothing seems particular strange about the data.  A dump
> of an excerpt of the table can be found at
> http://0ape.com/tmp/int_array.dmp
> (Yes, the production table had other columns; but this
> column alone is enough to demonstrate the problem.)
>
>    Any thoughts what I'm doing wrong?
>    Ron
>
> psql output showing the timing follows.
>
> ===============================================================================
> vm=# create table tmp_intarray_test as select tag_id_array as my_int_array from taggings;
> SELECT
> vm=# create table tmp_intarray_test_10000 as select * from tmp_intarray_test limit 10000;
> SELECT
> vm=# create table tmp_intarray_test_20000 as select * from tmp_intarray_test limit 20000;
> SELECT
> vm=# create table tmp_intarray_test_30000 as select * from tmp_intarray_test limit 30000;
> SELECT
> vm=# create table tmp_intarray_test_40000 as select * from tmp_intarray_test limit 40000;
> SELECT
> vm=# \timing
> Timing is on.
> vm=#
> vm=# create index "gist_10000 using GIST(my_int_array)" on tmp_intarray_test_10000 using GIST (my_int_array);
> CREATE INDEX
> Time: 5760.050 ms
> vm=# create index "gist_20000 using GIST(my_int_array)" on tmp_intarray_test_20000 using GIST (my_int_array);
> CREATE INDEX
> Time: 32500.911 ms
> vm=# create index "gist_30000 using GIST(my_int_array)" on tmp_intarray_test_30000 using GIST (my_int_array);
> CREATE INDEX
> Time: 39284.031 ms
> vm=# create index "gist_40000 using GIST(my_int_array)" on tmp_intarray_test_40000 using GIST (my_int_array);
> CREATE INDEX
> Time: 102572.780 ms
> vm=#
> vm=#
> vm=#
> vm=#
>
> vm=#
> vm=#
> vm=#  create index "gin_40000"     on tmp_intarray_test_40000 using GIN (my_int_array gin__int_ops);
> CREATE INDEX
> Time: 696.668 ms
> vm=#  create index "gist_big_4000" on tmp_intarray_test_40000 using GIST (my_int_array gist__intbig_ops);
> CREATE INDEX
> Time: 5227.353 ms
> vm=#
> vm=#
> vm=#
> vm=# \d tmp_intarray_test
>    Table "public.tmp_intarray_test"
>     Column    |   Type    | Modifiers
> --------------+-----------+-----------
>  my_int_array | integer[] |
>
> vm=# select max(array_dims(my_int_array)) from tmp_intarray_test_30000;
>   max
> -------
>  [1:9]
> (1 row)
>
> Time: 119.607 ms
> vm=#
> vm=#
> vm=# select version();
>                                       version
> -----------------------------------------------------------------------------------
>  PostgreSQL 8.3.6 on i686-pc-linux-gnu, compiled by GCC gcc (Debian 4.3.3-1) 4.3.3
> (1 row)
>
> Time: 12.169 ms
>
> vm=#  create index "gistbig70000" on tmp_intarray_test using GIST (my_int_array gist__intbig_ops);
> CREATE INDEX
> Time: 9156.886 ms
> vm=#  create index "gin70000" on tmp_intarray_test using GIN (my_int_array gin__int_ops);
> CREATE INDEX
> Time: 1060.752 ms
> vm=#  create index "gist7000" on tmp_intarray_test using GIST (my_int_array gist__int_ops);
>       [.... it just sits here for 10 minutes or more ....]
>


Re: Extremely slow intarray index creation and inserts.

From
Tom Lane
Date:
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
> vm=# create index "gist70000" on tmp_intarray_test using GIST (my_int_array gist__int_ops);
> CREATE INDEX
> Time: 2069836.856 ms

> Is that expected, or does it sound like a bug to take over
> half an hour to index 70000 rows of mostly 5 and 6-element
> integer arrays?

I poked at this example with oprofile.  It's entirely CPU-bound AFAICT,
and the CPU utilization is approximately

    55%    g_int_compress
    35%    memmove/memcpy (difficult to distinguish these)
     1%    pg_qsort
    <1%    anything else

Probably need to look at reducing the number of calls to g_int_compress
... it must be getting called a whole lot more than once per new index
entry, and I wonder why that should need to be.

            regards, tom lane

Re: Extremely slow intarray index creation and inserts.

From
Ron Mayer
Date:
Tom Lane wrote:
> Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
>> vm=# create index "gist70000" on tmp_intarray_test using GIST (my_int_array gist__int_ops);
>> CREATE INDEX
>> Time: 2069836.856 ms
>
>> Is that expected, or does it sound like a bug to take over
>> half an hour to index 70000 rows of mostly 5 and 6-element
>> integer arrays?
>
> I poked at this example with oprofile.  It's entirely CPU-bound AFAICT,

Oleg pointed out to me (off-list I now see) that it's not totally
unexpected behavior and I should have been using gist__intbig_ops,
since the "big" refers to the cardinality of the entire set (which
was large, in my case) and not the length of the arrays.

Oleg Bartunov wrote:
OB:> it's not about short or long arrays, it's about small or big
OB:> cardinality of the whole set (the number of unique elements)

I'm re-reading the docs and still wasn't obvious to me.   A
potential docs patch is attached below.

> and the CPU utilization is approximately
>
>     55%    g_int_compress
>     35%    memmove/memcpy (difficult to distinguish these)
>      1%    pg_qsort
>     <1%    anything else
>
> Probably need to look at reducing the number of calls to g_int_compress
> ... it must be getting called a whole lot more than once per new index
> entry, and I wonder why that should need to be.

Perhaps that's a separate issue, but we're working
fine with gist__intbig_ops for the time being.



Here's a proposed docs patch that makes this more obvious.

*** a/doc/src/sgml/intarray.sgml
--- b/doc/src/sgml/intarray.sgml
***************
*** 239,245 ****
     <literal>gist__int_ops</> (used by default) is suitable for
     small and medium-size arrays, while
     <literal>gist__intbig_ops</> uses a larger signature and is more
!    suitable for indexing large arrays.
    </para>

    <para>
--- 239,247 ----
     <literal>gist__int_ops</> (used by default) is suitable for
     small and medium-size arrays, while
     <literal>gist__intbig_ops</> uses a larger signature and is more
!    suitable for indexing high-cardinality data sets - where there
!    are a large number of unique elements across all rows being
!    indexed.
    </para>

    <para>


Re: Extremely slow intarray index creation and inserts.

From
Tom Lane
Date:
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
> Oleg Bartunov wrote:
> OB:> it's not about short or long arrays, it's about small or big
> OB:> cardinality of the whole set (the number of unique elements)

> I'm re-reading the docs and still wasn't obvious to me.   A
> potential docs patch is attached below.

Done, though not in exactly those words.  I wonder though if we can
be less vague about it --- can we suggest a typical cutover point?
Like "use gist__intbig_ops if there are more than about 10,000 distinct
array values"?  Even a rough order of magnitude for where to worry
about this would save a lot of people time.

            regards, tom lane

Index: intarray.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/intarray.sgml,v
retrieving revision 1.5
retrieving revision 1.6
diff -c -r1.5 -r1.6
*** intarray.sgml    10 Dec 2007 05:32:51 -0000    1.5
--- intarray.sgml    18 Mar 2009 20:18:18 -0000    1.6
***************
*** 237,245 ****
    <para>
     Two GiST index operator classes are provided:
     <literal>gist__int_ops</> (used by default) is suitable for
!    small and medium-size arrays, while
     <literal>gist__intbig_ops</> uses a larger signature and is more
!    suitable for indexing large arrays.
    </para>

    <para>
--- 237,246 ----
    <para>
     Two GiST index operator classes are provided:
     <literal>gist__int_ops</> (used by default) is suitable for
!    small- to medium-size data sets, while
     <literal>gist__intbig_ops</> uses a larger signature and is more
!    suitable for indexing large data sets (i.e., columns containing
!    a large number of distinct array values).
    </para>

    <para>

Re: Extremely slow intarray index creation and inserts.

From
Oleg Bartunov
Date:
We usually say about 200 unique values as a limit for
gist_int_ops.

On Wed, 18 Mar 2009, Tom Lane wrote:

> Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
>> Oleg Bartunov wrote:
>> OB:> it's not about short or long arrays, it's about small or big
>> OB:> cardinality of the whole set (the number of unique elements)
>
>> I'm re-reading the docs and still wasn't obvious to me.   A
>> potential docs patch is attached below.
>
> Done, though not in exactly those words.  I wonder though if we can
> be less vague about it --- can we suggest a typical cutover point?
> Like "use gist__intbig_ops if there are more than about 10,000 distinct
> array values"?  Even a rough order of magnitude for where to worry
> about this would save a lot of people time.
>
>             regards, tom lane
>
> Index: intarray.sgml
> ===================================================================
> RCS file: /cvsroot/pgsql/doc/src/sgml/intarray.sgml,v
> retrieving revision 1.5
> retrieving revision 1.6
> diff -c -r1.5 -r1.6
> *** intarray.sgml    10 Dec 2007 05:32:51 -0000    1.5
> --- intarray.sgml    18 Mar 2009 20:18:18 -0000    1.6
> ***************
> *** 237,245 ****
>    <para>
>     Two GiST index operator classes are provided:
>     <literal>gist__int_ops</> (used by default) is suitable for
> !    small and medium-size arrays, while
>     <literal>gist__intbig_ops</> uses a larger signature and is more
> !    suitable for indexing large arrays.
>    </para>
>
>    <para>
> --- 237,246 ----
>    <para>
>     Two GiST index operator classes are provided:
>     <literal>gist__int_ops</> (used by default) is suitable for
> !    small- to medium-size data sets, while
>     <literal>gist__intbig_ops</> uses a larger signature and is more
> !    suitable for indexing large data sets (i.e., columns containing
> !    a large number of distinct array values).
>    </para>
>
>    <para>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Extremely slow intarray index creation and inserts.

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
> We usually say about 200 unique values as a limit for
> gist_int_ops.

That seems awfully small ... should we make gist_intbig_ops the default,
or more likely, raise the signature size of both opclasses?  Even at a
crossover point of 10000 I'm not sure that many real-world apps would
bother considering gist_int_ops.

            regards, tom lane

Re: Extremely slow intarray index creation and inserts.

From
Oleg Bartunov
Date:
On Thu, 19 Mar 2009, Tom Lane wrote:

> Oleg Bartunov <oleg@sai.msu.su> writes:
>> We usually say about 200 unique values as a limit for
>> gist_int_ops.
>
> That seems awfully small ... should we make gist_intbig_ops the default,
> or more likely, raise the signature size of both opclasses?  Even at a
> crossover point of 10000 I'm not sure that many real-world apps would
> bother considering gist_int_ops.

gist__int_ops doesn't uses signatures, it uses range compression, which
is not lossy, but not capacious. Perhaps, that's why we decided to use it
as default opclass.


     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Extremely slow intarray index creation and inserts.

From
Oleg Bartunov
Date:
On Thu, 19 Mar 2009, Oleg Bartunov wrote:

> On Thu, 19 Mar 2009, Tom Lane wrote:
>
>> Oleg Bartunov <oleg@sai.msu.su> writes:
>>> We usually say about 200 unique values as a limit for
>>> gist_int_ops.
>>
>> That seems awfully small ... should we make gist_intbig_ops the default,
>> or more likely, raise the signature size of both opclasses?  Even at a
>> crossover point of 10000 I'm not sure that many real-world apps would
>> bother considering gist_int_ops.
>
> gist__int_ops doesn't uses signatures, it uses range compression, which
> is not lossy, but not capacious. Perhaps, that's why we decided to use it as

sorry, it's lossy

> default opclass.
>
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83