Re: Extremely slow intarray index creation and inserts. - Mailing list pgsql-performance

From Ron Mayer
Subject Re: Extremely slow intarray index creation and inserts.
Date
Msg-id 49C11D6D.8020604@cheapcomplexdevices.com
Whole thread Raw
In response to Re: Extremely slow intarray index creation and inserts.  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Extremely slow intarray index creation and inserts.
List pgsql-performance
Tom Lane wrote:
> Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
>> vm=# create index "gist70000" on tmp_intarray_test using GIST (my_int_array gist__int_ops);
>> CREATE INDEX
>> Time: 2069836.856 ms
>
>> Is that expected, or does it sound like a bug to take over
>> half an hour to index 70000 rows of mostly 5 and 6-element
>> integer arrays?
>
> I poked at this example with oprofile.  It's entirely CPU-bound AFAICT,

Oleg pointed out to me (off-list I now see) that it's not totally
unexpected behavior and I should have been using gist__intbig_ops,
since the "big" refers to the cardinality of the entire set (which
was large, in my case) and not the length of the arrays.

Oleg Bartunov wrote:
OB:> it's not about short or long arrays, it's about small or big
OB:> cardinality of the whole set (the number of unique elements)

I'm re-reading the docs and still wasn't obvious to me.   A
potential docs patch is attached below.

> and the CPU utilization is approximately
>
>     55%    g_int_compress
>     35%    memmove/memcpy (difficult to distinguish these)
>      1%    pg_qsort
>     <1%    anything else
>
> Probably need to look at reducing the number of calls to g_int_compress
> ... it must be getting called a whole lot more than once per new index
> entry, and I wonder why that should need to be.

Perhaps that's a separate issue, but we're working
fine with gist__intbig_ops for the time being.



Here's a proposed docs patch that makes this more obvious.

*** a/doc/src/sgml/intarray.sgml
--- b/doc/src/sgml/intarray.sgml
***************
*** 239,245 ****
     <literal>gist__int_ops</> (used by default) is suitable for
     small and medium-size arrays, while
     <literal>gist__intbig_ops</> uses a larger signature and is more
!    suitable for indexing large arrays.
    </para>

    <para>
--- 239,247 ----
     <literal>gist__int_ops</> (used by default) is suitable for
     small and medium-size arrays, while
     <literal>gist__intbig_ops</> uses a larger signature and is more
!    suitable for indexing high-cardinality data sets - where there
!    are a large number of unique elements across all rows being
!    indexed.
    </para>

    <para>


pgsql-performance by date:

Previous
From: Matthew Wakeling
Date:
Subject: Re: Proposal of tunable fix for scalability of 8.4
Next
From: Scott Carey
Date:
Subject: Re: Proposal of tunable fix for scalability of 8.4