On 09/15/2014 06:28 PM, Alexander Korotkov wrote:
> Hackers,
>
> some GIN opclasses uses collation-aware comparisons while they don't need
> to do especially collation-aware comparison. Examples are text[] and hstore
> opclasses.
Hmm. It would be nice to use the index for inequality searches, at least
on text[]. We don't support that currently, but it would require
collation-awareness.
> Depending on collation this may make them a much slower.
>
> See example.
>
> # show lc_collate ;
> lc_collate
> ─────────────
> ru_RU.UTF-8
> (1 row)
>
> # create table test as (select array_agg(i::text) from
> generate_series(1,1000000) i group by (i-1)/10);
> SELECT 100000
>
> # create index test_idx on test using gin(array_agg);
> CREATE INDEX
> Time: *26930,423 ms*
>
> # create index test_idx2 on test using gin(array_agg collate "C");
> CREATE INDEX
> Time: *5143,682 ms*
>
> Index creation with collation "ru_RU.UTF-8" is 5 times slower while
> collation has absolutely no effect on index functionality.
It occurs to me that practically all of those comparisons happen when we
populate the red-black Tree, during the index build. The purpose of the
red-black tree is to collect identical keys together, but there is
actually no requirement that the order of the red-black tree matches the
order of the index. It also isn't strictly required that it recognizes
equal keys as equal. The only requirement is that it doesn't incorrectly
put two keys that are equal according to the compare-function, into two
different nodes.
We could therefore use plain memcmp() to compare the Datums while
building the red-black tree. Keys that are bit-wise equal are surely
considered as equal by the compare-function. That makes the index build
a lot faster. With the attached quick patch:
postgres=# create index test_idx on test using gin(array_agg );
CREATE INDEX
Time: 880.620 ms
This is on my laptop. Without the patch, that takes about 4.7 seconds
with the C locale, so this is much faster than even using the C locale.
- Heikki