Thread: has anyone looked at burstsort ?

has anyone looked at burstsort ?

From
Hannu Krosing
Date:
has anyone looked at burstsort
https://sourceforge.net/projects/burstsort

they claim that "Copy-Burstsort is a sorting algorithm for strings that
is cache-efficient. Burstsort and its variants are much faster than
Quicksort and Radixsort especially on large datasets. Copy-Burstsort
works best for sorting short strings such as genomes and words"

if the speed claim is true, and there are no other bad effects, like for
example very bad memory use,  we could try to talk the author into
allowing us to include it under BSD licens (currently it is GPL)

----------------
Hannu




Re: has anyone looked at burstsort ?

From
Tom Lane
Date:
Hannu Krosing <hannu@skype.net> writes:
> has anyone looked at burstsort
> https://sourceforge.net/projects/burstsort
> they claim that "Copy-Burstsort is a sorting algorithm for strings that
> is cache-efficient.

If its reason for living is cache efficiency, then I wonder

(1) how well does it work on data types other than strings, or for
that matter even strings when the comparison function is strcoll()
rather than memcmp().  A heavyweight comparison function is likely
to play hob with any assumptions about memory access patterns.

(2) does it scale to deal with problems larger than memory (ie,
how do you make it spill to disk).

> if the speed claim is true, and there are no other bad effects, like for
> example very bad memory use,  we could try to talk the author into
> allowing us to include it under BSD licens (currently it is GPL)

If we wrote our own implementation ... and realistically, fitting it
into postgres would likely require rewriting much of the code anyway
... then we don't have to worry about the copyright on someone else's
implementation.  What we do have to worry about is whose patent(s)
we might be infringing.
        regards, tom lane


Re: has anyone looked at burstsort ?

From
Gregory Stark
Date:
"Hannu Krosing" <hannu@skype.net> writes:

> has anyone looked at burstsort
> https://sourceforge.net/projects/burstsort
>
> they claim that "Copy-Burstsort is a sorting algorithm for strings that
> is cache-efficient. Burstsort and its variants are much faster than
> Quicksort and Radixsort especially on large datasets. Copy-Burstsort
> works best for sorting short strings such as genomes and words"
>
> if the speed claim is true, and there are no other bad effects, like for
> example very bad memory use,  we could try to talk the author into
> allowing us to include it under BSD licens (currently it is GPL)

The actual implementation isn't very interesting. It's C++ code written for
Visual C++ and it's pretty primitive, has no comments, and only supports
sorting ASCII strings containing only 26 letters...

In any case we can't make use of it for sorting strings unless we want to
special-case text/varchar in the tuplesort code. That might be something to
consider at some point but right now there's still a lot to do with the
generic tuplesort that requires only a comparison operator.

On the other hand it does seem like it might be possible to adapt this
algorithm to sorting multi-column keys. 

The key to the algorithm is that it uses a trie to bin rows with common
leading prefixes together. This avoids performing redundant comparisons
between those columns later.

If you have long keys with each column in the key being relatively low
cardinality you could get some mileage out of doing something like this
algorithm does on a column-by-column basis rather than on a
character-by-character basis.

But can you see any way to adapt this to a disk-sort?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: has anyone looked at burstsort ?

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> The key to the algorithm is that it uses a trie to bin rows with common
> leading prefixes together. This avoids performing redundant comparisons
> between those columns later.

Interesting, but doesn't that make it utterly useless for sorting in
non-C locales?

I'm not that thrilled with introducing datatype-specific paths into the
sort code anyway; seems like a maintenance nightmare.
        regards, tom lane


Re: has anyone looked at burstsort ?

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
> > The key to the algorithm is that it uses a trie to bin rows with common
> > leading prefixes together. This avoids performing redundant comparisons
> > between those columns later.
> 
> Interesting, but doesn't that make it utterly useless for sorting in
> non-C locales?

It seems so.  But on the other hand it might prove helpful for
multicolumn sorts (which removes the "datatype-specific" objection).

-- 
Alvaro Herrera       Valdivia, Chile   ICBM: S 39º 49' 18.1", W 73º 13' 56.4"
"No hay cielo posible sin hundir nuestras raícesen la profundidad de la tierra"                        (Malucha Pinto)


Re: has anyone looked at burstsort ?

From
Martijn van Oosterhout
Date:
On Fri, Jul 13, 2007 at 03:29:16PM +0100, Gregory Stark wrote:
> The key to the algorithm is that it uses a trie to bin rows with common
> leading prefixes together. This avoids performing redundant comparisons
> between those columns later.

Sounds like a variation on the idea suggested before, which is to allow
each datatype to provide an xfrm function that returns a signed
integer, which would allow you to compare values without invoking the
actual datatype comparison function in most cases. That approach would
work on any datatype, not just strings.

Whether it's more efficient than the current method is another question
entirely.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.