Home > mailing lists
Re: [PERFORM] A Better External Sort? - Mailing list pgsql-hackers

From	Dann Corbit
Subject	Re: [PERFORM] A Better External Sort?
Date	September 30, 2005 20:52:46
Msg-id	D425483C2C5C9F49B5B7A41F8944154757D145@postal.corporate.connx.com Whole thread Raw
In response to	[PERFORM] A Better External Sort? (Ron Peacetree <rjpeace@earthlink.net>)
List	pgsql-hackers
Tree view
I have perused the tuple sort stuff.

The good:
The documentation of the sort algorithm from Knuth's TAOCP was
beautifully done.  Everyone who writes an algorithm should credit the
original source like this, and also where it deviates.  That was done
very nicely.

The bad:
With random access, tape style merging is not necessary.  A priority
queue based merge will be a lot faster.

The UGLY:
Please, someone, tell me I was hallucinating.  Is that code really
READING AND WRITING THE WHOLE TUPLE with every sort exchange?!  Maybe
there is a layer of abstraction that I am somehow missing.  I just can't
imagine that is what it is really doing.  If (somehow) it really is
doing that, a pointer based sort which forms a permutation based upon
the keys, would be a lot better.

The fundamental algorithm itself could also be improved somewhat.

Here is a {public domain} outline for an introspective quick sort that
Pete Filander and I wrote some time ago and contributed to FastDB.  It
is written as a C++ template, but it will take no effort to make it a
simple C routine.  It assumes that e_type has comparison operators, so
in C you would use a compare function instead.

/*
** Sorting stuff by Dann Corbit and Pete Filandr.
** (dcorbit@connx.com and pfilandr@mindspring.com)
** Use it however you like.
*/

//
//  The insertion sort template is used for small partitions.
//

template < class e_type >
void insertion_sort(e_type * array, size_t nmemb)
{
    e_type temp,
          *last,
          *first,
          *middle;
    if (nmemb > 1) {
        first = middle = 1 + array;
        last = nmemb - 1 + array;
        while (first != last) {
            ++first;
            if ((*(middle) > *(first))) {
                middle = first;
            }
        }
        if ((*(array) > *(middle))) {
            ((void) ((temp) = *(array), *(array) = *(middle), *(middle)
= (temp)));
        }
        ++array;
        while (array != last) {
            first = array++;
            if ((*(first) > *(array))) {
                middle = array;
                temp = *middle;
                do {
                    *middle-- = *first--;
                } while ((*(first) > *(&temp)));
                *middle = temp;
            }
        }
    }
}

//
// The median estimate is used to choose pivots for the quicksort
algorithm
//

template < class e_type >
void median_estimate(e_type * array, size_t n)
{
    e_type          temp;
    long unsigned   lu_seed = 123456789LU;
    const size_t    k = ((lu_seed) = 69069 * (lu_seed) + 362437) % --n;
    ((void) ((temp) = *(array), *(array) = *(array + k), *(array + k) =
(temp)));
    if ((*((array + 1)) > *((array)))) {
        (temp) = *(array + 1);
        if ((*((array + n)) > *((array)))) {
            *(array + 1) = *(array);
            if ((*(&(temp)) > *((array + n)))) {
                *(array) = *(array + n);
                *(array + n) = (temp);
            } else {
                *(array) = (temp);
            }
        } else {
            *(array + 1) = *(array + n);
            *(array + n) = (temp);
        }
    } else {
        if ((*((array)) > *((array + n)))) {
            if ((*((array + 1)) > *((array + n)))) {
                (temp) = *(array + 1);
                *(array + 1) = *(array + n);
                *(array + n) = *(array);
                *(array) = (temp);
            } else {
                ((void) (((temp)) = *((array)), *((array)) = *((array +
n)), *((array + n)) = ((temp))));
            }
        }
    }
}


//
// This is the heart of the quick sort algorithm used here.
// If the sort is going quadratic, we switch to heap sort.
// If the partition is small, we switch to insertion sort.
//

template < class e_type >
void qloop(e_type * array, size_t nmemb, size_t d)
{
    e_type temp,
          *first,
          *last;
    while (nmemb > 50) {
        if (sorted(array, nmemb)) {
            return;
        }
        if (!d--) {
            heapsort(array, nmemb);
            return;
        }
        median_estimate(array, nmemb);
        first = 1 + array;
        last = nmemb - 1 + array;
        do {
            ++first;
        } while ((*(array) > *(first)));
        do {
            --last;
        } while ((*(last) > *(array)));
        while (last > first) {
            ((void) ((temp) = *(last), *(last) = *(first), *(first) =
(temp)));
            do {
                ++first;
            } while ((*(array) > *(first)));
            do {
                --last;
            } while ((*(last) > *(array)));
        }
        ((void) ((temp) = *(array), *(array) = *(last), *(last) =
(temp)));
        qloop(last + 1, nmemb - 1 + array - last, d);
        nmemb = last - array;
    }
    insertion_sort(array, nmemb);
}

//
// This heap sort is better than average because it uses Lamont's heap.
//

template < class e_type >
void heapsort(e_type * array, size_t nmemb)
{
    size_t i,
           child,
           parent;
    e_type temp;
    if (nmemb > 1) {
        i = --nmemb / 2;
        do {
            {
                (parent) = (i);
                (temp) = (array)[(parent)];
                (child) = (parent) * 2;
                while ((nmemb) > (child)) {
                    if ((*((array) + (child) + 1) > *((array) +
(child)))) {
                        ++(child);
                    }
                    if ((*((array) + (child)) > *(&(temp)))) {
                        (array)[(parent)] = (array)[(child)];
                        (parent) = (child);
                        (child) *= 2;
                    } else {
                        --(child);
                        break;
                    }
                }
                if ((nmemb) == (child) && (*((array) + (child)) >
*(&(temp)))) {
                    (array)[(parent)] = (array)[(child)];
                    (parent) = (child);
                }
                (array)[(parent)] = (temp);
            }
        } while (i--);
        ((void) ((temp) = *(array), *(array) = *(array + nmemb), *(array
+ nmemb) = (temp)));
        for (--nmemb; nmemb; --nmemb) {
            {
                (parent) = (0);
                (temp) = (array)[(parent)];
                (child) = (parent) * 2;
                while ((nmemb) > (child)) {
                    if ((*((array) + (child) + 1) > *((array) +
(child)))) {
                        ++(child);
                    }
                    if ((*((array) + (child)) > *(&(temp)))) {
                        (array)[(parent)] = (array)[(child)];
                        (parent) = (child);
                        (child) *= 2;
                    } else {
                        --(child);
                        break;
                    }
                }
                if ((nmemb) == (child) && (*((array) + (child)) >
*(&(temp)))) {
                    (array)[(parent)] = (array)[(child)];
                    (parent) = (child);
                }
                (array)[(parent)] = (temp);
            }
            ((void) ((temp) = *(array), *(array) = *(array + nmemb),
*(array + nmemb) = (temp)));
        }
    }
}

//
// We use this to check to see if a partition is already sorted.
//

template < class e_type >
int sorted(e_type * array, size_t nmemb)
{
    for (--nmemb; nmemb; --nmemb) {
        if ((*(array) > *(array + 1))) {
            return 0;
        }
        ++array;
    }
    return 1;
}

//
// We use this to check to see if a partition is already reverse-sorted.
//

template < class e_type >
int             rev_sorted(e_type * array, size_t nmemb)
{
    for (--nmemb; nmemb; --nmemb) {
        if ((*(array + 1) > *(array))) {
            return 0;
        }
        ++array;
    }
    return 1;
}

//
// We use this to reverse a reverse-sorted partition.
//

template < class e_type >
void rev_array(e_type * array, size_t nmemb)
{
    e_type          temp,
        *end;
    for (end = array + nmemb - 1; end > array; ++array) {
        ((void) ((temp) = *(array), *(array) = *(end), *(end) =
(temp)));
        --end;
    }
}

//
// Introspective quick sort algorithm user entry point.
// You do not need to directly call any other sorting template.
// This sort will perform very well under all circumstances.
//

template < class e_type >
void iqsort(e_type * array, size_t nmemb)
{
    size_t d,
           n;
    if (nmemb > 1 && !sorted(array, nmemb)) {
        if (!rev_sorted(array, nmemb)) {
            n = nmemb / 4;
            d = 2;
            while (n) {
                ++d;
                n /= 2;
            }
            qloop(array, nmemb, 2 * d);
        } else {
            rev_array(array, nmemb);
        }
    }
}

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Jignesh K. Shah
> Sent: Friday, September 30, 2005 1:38 PM
> To: Ron Peacetree
> Cc: Josh Berkus; pgsql-hackers@postgresql.org; pgsql-
> performance@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
> I have seen similar performance as Josh and my reasoning is as
follows:
>
> * WAL is the biggest bottleneck with its default size of 16MB. Many
> people hate to recompile the code   to change its default, and
> increasing checkpoint segments help but still there is lot of overhead
> in the rotation of WAL files (Even putting WAL on tmpfs shows that it
is
> still slow). Having an option for bigger size is helpful to a small
> extent percentagewise (and frees up CPU a bit in doing file rotation)
>
> * Growing files: Even though this is OS dependent but it does spend
lot
> of time doing small 8K block increases to grow files. If we can signal
> bigger chunks to grow or "pre-grow"  to expected size of  data files
> that will help a lot in such cases.
>
> * COPY command had restriction but that has been fixed to a large
> extent.(Great job)
>
> But ofcourse I have lost touch with programming and can't begin to
> understand PostgreSQL code to change it myself.
>
> Regards,
> Jignesh
>
>
>
>
> Ron Peacetree wrote:
>
> >That 11MBps was your =bulk load= speed.  If just loading a table
> >is this slow, then there are issues with basic physical IO, not just
> >IO during sort operations.
> >
> >As I said, the obvious candidates are inefficient physical layout
> >and/or flawed IO code.
> >
> >Until the basic IO issues are addressed, we could replace the
> >present sorting code with infinitely fast sorting code and we'd
> >still be scrod performance wise.
> >
> >So why does basic IO suck so badly?
> >
> >Ron
> >
> >
> >-----Original Message-----
> >From: Josh Berkus <josh@agliodbs.com>
> >Sent: Sep 30, 2005 1:23 PM
> >To: Ron Peacetree <rjpeace@earthlink.net>
> >Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
> >Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
> >
> >Ron,
> >
> >
> >
> >>Hmmm.
> >>60GB/5400secs= 11MBps.  That's ssllooww.  So the first
> >>problem is evidently our physical layout and/or HD IO layer
> >>sucks.
> >>
> >>
> >
> >Actually, it's much worse than that, because the sort is only dealing
> >with one column.  As I said, monitoring the iostat our top speed was
> >2.2mb/s.
> >
> >--Josh
> >
> >
> >---------------------------(end of
broadcast)---------------------------
> >TIP 1: if posting/reading through Usenet, please send an appropriate
> >       subscribe-nomail command to majordomo@postgresql.org so that
your
> >       message can get through to the mailing list cleanly
> >
> >
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
pgsql-hackers by date:
From: "Jim C. Nasby"
Date: 30 September 2005, 20:48:59
Subject: Re: Query in SQL statement
From: "Jim C. Nasby"
Date: 30 September 2005, 21:00:56
Subject: Re: PCTFree Results
Re: [PERFORM] A Better External Sort? - Mailing list pgsql-hackers

Previous

Next