Re: intarray internals - Mailing list pgsql-hackers

From Volkan YAZICI
Subject Re: intarray internals
Date
Msg-id 20060511082529.GA223@alamut
Whole thread Raw
In response to Re: [GENERAL] intarray internals  (Teodor Sigaev <teodor@sigaev.ru>)
List pgsql-hackers
Hi,

First, thanks so much for your reply.

On May 10 04:01, Teodor Sigaev wrote:
> > Again, in g_int_decompress(), I couldn't figure out the functionality of
> > below lines:
> 
> gist__int_ops use rangeset compression technique, read about in "THE 
> RD-TREE: AN INDEX STRUCTURE FOR SETS", Joseph M. Hellerstein,
> http://www.sai.msu.su/~megera/postgres/gist/papers/rd-tree.ps

Thanks so much for the papers. I read the related section (and will read
whole today or tomorrow).

> * intarray_union.patch.0 - doesn't applied, but make small optimization to 
> reduce number non-unique values. I don't believe that one pass through 
> array with a lot of ifs will be faster than two pass with simple ifs. Did 
> you some tests?

IMHO, the only significant improvement in my proposal about _int_union()
is that this method will visit arrays only once (with extra price of x2
condition checks), while current one will also make a second visit to
arrays to remove duplicates (with small condition checks).

You can be right, maybe it doesn't worth for worrying about. Improvement
(if there's any) will probably be available to see for very long arrays.
(Sorry, no tests for this proposal.)

> * intarray_same.patch.0 - move SORT as you suggest, but don't touch 
> algorithm.
>  1) if (A[0] == B[0] && A[1] == B[1] && ...)
> 
>  2) if (A[0] == B[0] && A[  N] == B[  N] &&
>     A[1] == B[1] && A[N-1] == B[N-1] &&
>     ...)
> 
>   Why are you sure that second a much faster? Did you make tests? Number of 
> comparisons is the same...

Yep, both algorithms have O(n) comparisions in their worst cases. But
for general purposes, AFAICS, second one will perform better. For
instance consider below examples:
[Best case for 2nd algo.]Input    : 1, 2, 3, ..., 6, 7, *91st algo.: O(n)2nd algo.: O(1)
[Worst case for 2nd algo.]Input    : 1, 2, 3, 4, *4, 6, 7, 8, 91st algo.: O(n/2)2nd algo.: O(n)

But as you can see, because of our arrays are sorted, any missing (or
additional) element in the target array will produce a padding in the
end of the array --- assuming that arrays generally don't hold
duplicate values. Therefore, making comparisons for the tail elements
will perform better beucause of the unmatched values caused by padding.

Hope I managed to explain what I try to mean. Actually, IIRC, I saw this
method (both hacks for small sized arrays and comparisons for the tail
elements of a sorted array) in another FOSS project's source code ---
probably PHP, but I'm not sure.

For about testing, if you'd supply suitable inputs there occurs a quite
much performance improve.

> * intarray_sort.patch.0 - doesn't applied. isort() is very often called for 
> already sorted and unique arrays (which comes from index), so it should be 
> fast as possible for sorted arrays.

Uh, sorry. I missed that point.

> As I remember ordered array is a worst 
> case for qsort(). May be, it will be better choice to use mergesort.

I'll investigate alternative methods to sort already sorted arrays.


Regards.


pgsql-hackers by date:

Previous
From: "Zeugswetter Andreas DCP SD"
Date:
Subject: Re: [PERFORM] Big IN() clauses etc : feature proposal
Next
From: Martijn van Oosterhout
Date:
Subject: Re: [PERFORM] Big IN() clauses etc : feature proposal