Thread: Re: [GENERAL] intarray internals

Re: [GENERAL] intarray internals

From
Teodor Sigaev
Date:
Hi!

Sorry for delay, I hadn't access to Internet.
> [3]> Again, in g_int_decompress(), I couldn't figure out the functionality of> below lines:

gist__int_ops use rangeset compression technique, read about in "THE RD-TREE: AN 
INDEX STRUCTURE FOR SETS", Joseph M. Hellerstein,
http://www.sai.msu.su/~megera/postgres/gist/papers/rd-tree.ps


About your patches:
* intarray_contains.patch.0 - applied

* intarray_union.patch.0 - doesn't applied, but make small optimization to 
reduce number non-unique values. I don't believe that one pass through array 
with a lot of ifs will be faster than two pass with simple ifs. Did you some tests?

* intarray_same.patch.0 - move SORT as you suggest, but don't touch algorithm. 1) if (A[0] == B[0] && A[1] == B[1] &&
...)
 2) if (A[0] == B[0] && A[  N] == B[  N] &&    A[1] == B[1] && A[N-1] == B[N-1] &&    ...)
  Why are you sure that second a much faster? Did you make tests? Number of 
comparisons is the same...

* intarray_sort.patch.0 - doesn't applied. isort() is very often called for 
already sorted and unique arrays (which comes from index), so it should be fast 
as possible for sorted arrays. As I remember ordered array is a worst case for 
qsort(). May be, it will be better choice to use mergesort.




-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: intarray internals

From
Volkan YAZICI
Date:
Hi,

First, thanks so much for your reply.

On May 10 04:01, Teodor Sigaev wrote:
> > Again, in g_int_decompress(), I couldn't figure out the functionality of
> > below lines:
> 
> gist__int_ops use rangeset compression technique, read about in "THE 
> RD-TREE: AN INDEX STRUCTURE FOR SETS", Joseph M. Hellerstein,
> http://www.sai.msu.su/~megera/postgres/gist/papers/rd-tree.ps

Thanks so much for the papers. I read the related section (and will read
whole today or tomorrow).

> * intarray_union.patch.0 - doesn't applied, but make small optimization to 
> reduce number non-unique values. I don't believe that one pass through 
> array with a lot of ifs will be faster than two pass with simple ifs. Did 
> you some tests?

IMHO, the only significant improvement in my proposal about _int_union()
is that this method will visit arrays only once (with extra price of x2
condition checks), while current one will also make a second visit to
arrays to remove duplicates (with small condition checks).

You can be right, maybe it doesn't worth for worrying about. Improvement
(if there's any) will probably be available to see for very long arrays.
(Sorry, no tests for this proposal.)

> * intarray_same.patch.0 - move SORT as you suggest, but don't touch 
> algorithm.
>  1) if (A[0] == B[0] && A[1] == B[1] && ...)
> 
>  2) if (A[0] == B[0] && A[  N] == B[  N] &&
>     A[1] == B[1] && A[N-1] == B[N-1] &&
>     ...)
> 
>   Why are you sure that second a much faster? Did you make tests? Number of 
> comparisons is the same...

Yep, both algorithms have O(n) comparisions in their worst cases. But
for general purposes, AFAICS, second one will perform better. For
instance consider below examples:
[Best case for 2nd algo.]Input    : 1, 2, 3, ..., 6, 7, *91st algo.: O(n)2nd algo.: O(1)
[Worst case for 2nd algo.]Input    : 1, 2, 3, 4, *4, 6, 7, 8, 91st algo.: O(n/2)2nd algo.: O(n)

But as you can see, because of our arrays are sorted, any missing (or
additional) element in the target array will produce a padding in the
end of the array --- assuming that arrays generally don't hold
duplicate values. Therefore, making comparisons for the tail elements
will perform better beucause of the unmatched values caused by padding.

Hope I managed to explain what I try to mean. Actually, IIRC, I saw this
method (both hacks for small sized arrays and comparisons for the tail
elements of a sorted array) in another FOSS project's source code ---
probably PHP, but I'm not sure.

For about testing, if you'd supply suitable inputs there occurs a quite
much performance improve.

> * intarray_sort.patch.0 - doesn't applied. isort() is very often called for 
> already sorted and unique arrays (which comes from index), so it should be 
> fast as possible for sorted arrays.

Uh, sorry. I missed that point.

> As I remember ordered array is a worst 
> case for qsort(). May be, it will be better choice to use mergesort.

I'll investigate alternative methods to sort already sorted arrays.


Regards.