Thread: Test coverage for external sorting

Test coverage for external sorting

From
Simon Riggs
Date:
PostgreSQL uses two different sorting algorithms, qsort and the external
sorting method in tuplesort.c. There are some possible improvements in
external sorting, so I'd like to check on the solidity of the testing
mechanisms. 

Whether external sorting can be improved upon is a different debate,
though I do have reason to believe it is possible. Initially, I am
interested in proving correctness of any change, though the eventual
goal would be performance.

I'm looking through the regression tests, but can't find anything that
explicitly tests both types of sort.

If you run the regression tests against an existing database instance it
would be possible to run the tests with various values of work_mem so as
to force the sorts to either be internal or external. 

...only problem is that the largest regression test table: tenk doesn't
occupy as much as 1 MB of space in total and work_mem cannot be set
lower than 1 MB.

Could anybody comment on whether the current tests appropriately cover
the correctness of the external sorting algorithms? Will any changes to
the external sort algorithms be appropriately tested by what is there
currently?

Best Regards, Simon Riggs





Re: Test coverage for external sorting

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Could anybody comment on whether the current tests appropriately cover
> the correctness of the external sorting algorithms?

It's highly unlikely that the regression tests stress external sorts
much, or that anyone would hold still for making them run long enough
to do so ;-)

It's not hard to create a stress test: just load a bunch of random
numbers into a table and create a b-tree index on it.  To check the
correctness of the sort, you could CLUSTER on the index and then read
out the table to see if it were now in sorted order.

BTW, as for your original question about performance, the current
external sort algorithm is mainly designed to conserve disk space,
not to be as fast as possible.  It could probably be a good bit faster
if we didn't mind taking twice as much space (mainly because the
physical disk access pattern would be a lot less random).  But I know
we will get push-back if we try to revert to doing that.
        regards, tom lane


Re: Test coverage for external sorting

From
Simon Riggs
Date:
On Tue, 2005-04-12 at 10:04 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Could anybody comment on whether the current tests appropriately cover
> > the correctness of the external sorting algorithms?
> 
> It's highly unlikely that the regression tests stress external sorts
> much, or that anyone would hold still for making them run long enough
> to do so ;-)

OK

> It's not hard to create a stress test: just load a bunch of random
> numbers into a table and create a b-tree index on it.  To check the
> correctness of the sort, you could CLUSTER on the index and then read
> out the table to see if it were now in sorted order.

Just checking. No point starting anything until a test is in place. Yes,
they're fairly straightforward to do - I just didn't want to do it...

> BTW, as for your original question about performance, the current
> external sort algorithm is mainly designed to conserve disk space,
> not to be as fast as possible.  It could probably be a good bit faster
> if we didn't mind taking twice as much space (mainly because the
> physical disk access pattern would be a lot less random).  But I know
> we will get push-back if we try to revert to doing that.

That's roughly what I'm looking into now: just scoping for the time
being. Anything submitted would take the status quo as default and
present other functionality as an option only.

There's also some research into improved replacement selection
algorithms that may soon be submitted/submittable.

Best Regards, Simon Riggs



Re: Test coverage for external sorting

From
Josh Berkus
Date:
Tom,

> BTW, as for your original question about performance, the current
> external sort algorithm is mainly designed to conserve disk space,
> not to be as fast as possible.  It could probably be a good bit faster
> if we didn't mind taking twice as much space (mainly because the
> physical disk access pattern would be a lot less random).  But I know
> we will get push-back if we try to revert to doing that.

We could do it as a compile-time option or a GUC.  I know I wouldn't mind
taking extra disk space if my sorts ran faster on very large tables.
Currently, building a new btree index on a 100GB table takes about 2 hours on
a v40z.  And that's not becuase of I/O; disk bandwidth is less than 20% used.

--
Josh Berkus
Aglio Database Solutions
San Francisco