Thread: Test coverage for external sorting
PostgreSQL uses two different sorting algorithms, qsort and the external sorting method in tuplesort.c. There are some possible improvements in external sorting, so I'd like to check on the solidity of the testing mechanisms. Whether external sorting can be improved upon is a different debate, though I do have reason to believe it is possible. Initially, I am interested in proving correctness of any change, though the eventual goal would be performance. I'm looking through the regression tests, but can't find anything that explicitly tests both types of sort. If you run the regression tests against an existing database instance it would be possible to run the tests with various values of work_mem so as to force the sorts to either be internal or external. ...only problem is that the largest regression test table: tenk doesn't occupy as much as 1 MB of space in total and work_mem cannot be set lower than 1 MB. Could anybody comment on whether the current tests appropriately cover the correctness of the external sorting algorithms? Will any changes to the external sort algorithms be appropriately tested by what is there currently? Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Could anybody comment on whether the current tests appropriately cover > the correctness of the external sorting algorithms? It's highly unlikely that the regression tests stress external sorts much, or that anyone would hold still for making them run long enough to do so ;-) It's not hard to create a stress test: just load a bunch of random numbers into a table and create a b-tree index on it. To check the correctness of the sort, you could CLUSTER on the index and then read out the table to see if it were now in sorted order. BTW, as for your original question about performance, the current external sort algorithm is mainly designed to conserve disk space, not to be as fast as possible. It could probably be a good bit faster if we didn't mind taking twice as much space (mainly because the physical disk access pattern would be a lot less random). But I know we will get push-back if we try to revert to doing that. regards, tom lane
On Tue, 2005-04-12 at 10:04 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Could anybody comment on whether the current tests appropriately cover > > the correctness of the external sorting algorithms? > > It's highly unlikely that the regression tests stress external sorts > much, or that anyone would hold still for making them run long enough > to do so ;-) OK > It's not hard to create a stress test: just load a bunch of random > numbers into a table and create a b-tree index on it. To check the > correctness of the sort, you could CLUSTER on the index and then read > out the table to see if it were now in sorted order. Just checking. No point starting anything until a test is in place. Yes, they're fairly straightforward to do - I just didn't want to do it... > BTW, as for your original question about performance, the current > external sort algorithm is mainly designed to conserve disk space, > not to be as fast as possible. It could probably be a good bit faster > if we didn't mind taking twice as much space (mainly because the > physical disk access pattern would be a lot less random). But I know > we will get push-back if we try to revert to doing that. That's roughly what I'm looking into now: just scoping for the time being. Anything submitted would take the status quo as default and present other functionality as an option only. There's also some research into improved replacement selection algorithms that may soon be submitted/submittable. Best Regards, Simon Riggs
Tom, > BTW, as for your original question about performance, the current > external sort algorithm is mainly designed to conserve disk space, > not to be as fast as possible. It could probably be a good bit faster > if we didn't mind taking twice as much space (mainly because the > physical disk access pattern would be a lot less random). But I know > we will get push-back if we try to revert to doing that. We could do it as a compile-time option or a GUC. I know I wouldn't mind taking extra disk space if my sorts ran faster on very large tables. Currently, building a new btree index on a 100GB table takes about 2 hours on a v40z. And that's not becuase of I/O; disk bandwidth is less than 20% used. -- Josh Berkus Aglio Database Solutions San Francisco