Thread: Sorting Improvements for 8.4

Sorting Improvements for 8.4

From
Simon Riggs
Date:
Just wanted to review a few thoughts and ideas around improving external
sorts, as recently encouraged to do by Jim Nasby. 

Current issues/opportunities are these:

ISSUES

a) Memory is always in short supply, so using what we have more
effectively is going to be welcome.

b) Heap sort has a reasonably strong anti-memory effect, meaning that
there is an optimum amount of memory for any sort. This shows itself
with the CPU time increasing during run forming, making this stage of
the sort CPU bound.

c) Many sorts are performed prior to aggregation. It might be possible
to aggregate prior to writing to disk, as a way of reducing the overall
I/O cost. Benefit would occur when the total CPU cost was same no matter
when aggregation occurred; that would not apply in all cases, so we
would need to sense when benefit was possible.

d) Generally reducing the I/O cost of sorting may help the merging
stages of a sort.


SOLUTIONS

The ideas that Greg Stark, Jim Nasby, Heikki and myself have discussed
to date were the following:

1. Sort I/O Compression
2. Aggregation during Sort
3. Memory Pools
4. Dynamic Heap Management
5. Dynamic Run Handling

I've added (5) to the list as well, which hasn't yet been discussed.

1. SORT I/O COMPRESSION

This idea is not dead yet, it just needs a full set of tests to confirm
that there is benefit in all cases. If there's not benefit in all cases,
we may be able to work out which cases those are, so we know when to use
it.


2. AGGREGATION DURING SORT

Many sorts are preliminary steps before aggregation. Aggregation during
run forming would potentially reduce size of heap and reduce number of
comparisons. For many types of aggregate this would not theoretically
increase the number of ops since sum(), avg(), min(), max() are all
commutative according to their inputs. We would probably need to add
another option to Aggregate Functions to indicate the possibility of
calculating the aggregate in this way, since some aggregates might rely
on the current situation that they expect all their inputs at once in
sorted order. (Windowed aggregates are unlikely to be this way).


3. MEMORY POOLS

Solving a) could be done by sensible management and allocation of
resources. Discussed before, so not rehashed here.


4. DYNAMIC HEAP MANAGEMENT

The size of the active heap required to produce the fewest number of
runs varies as the sort progresses. For example, sorting an already
sorted input needs a trivial heap size. 

Larger heap sizes simply avoid forming more runs, which is not
necessarily a bad thing. More runs only become bad things when we go
beyond our ability to perform a single final merge (see Dynamic Run
Handling below).

Smaller heap sizes reduce the number of comparisons required, plus
increase the L2+ cache efficiencies. Those two things are the cause of
the anti-memory effect.

Because of b), optimising the size of the heap could potentially be a
good thing. This can make a considerable difference for nearly sorted
data (measurements required...).

When we have M amount of memory available to us, we don't start by using
it all. We start with m memory and only increase up to M if required.
Runs are built with memory set at m. If a tuple arrives that would force
the formation of a new run we assess

i) do we care if another run is formed? Use our knowledge of the likely
amount of data coming our way, compared with number of runs formed so
far and see if we really care. If we don't care, allow the new run to be
formed and carry on with just heap size of m. (see Dynamic Run Handling
later).

ii) if we do care about number of runs, then allow the heap to grow by
increments up to the full size of M. Increments would be at least x2 and
possibly x4. That way we always have work space to rearrange the heap.

All of this dances too cleverly around the exact technique and potential
costs of rearranging the heap. That is not to be ignored and is the next
task in evaluating and accepting/dismissing this potential technique.

In combination with memory pooling this technique might also allow
memory to be better distributed to other users.


5. DYNAMIC RUN HANDLING (in Final Merge)

Another way of addressing a) is to simply make better use of memory
itself. Let's look at that in more detail:

Number of runs that can be merged at once is currently fixed, based upon
available memory. This has the underlying assumption that all runs will
be concurrently active during final merging, which may not always be
true.

If we have random data then almost all runs will overlap with all other
runs, i.e. the min and max values are sufficiently wide that the runs do
all overlap. In many cases, data arrives in somewhat sorted order, e.g.
financial data is fairly regular with some late payers but not many, and
those trail off with a fairly tight decay. In the somewhat sorted case
we find that the actual overlap is less than total, so there are many
later runs that don't overlap the earlier ones. In the best case we
would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4
overlap.

This is also the point where I suggest breaking away from Knuth
completely. All of the main algorithms described by Knuth are tape
sorts. A run is written to a particular tape and then stays there until
"moved" to another tape. That means we have to get super-clever about
how runs should be written and formed (see Knuth). If we realise that
the runs aren't fixed to particular tapes they are all just independent
runs, we can radically rethink sorting. There is no need to implement
Cascade Sort, but we do need to rethink merging from the ground up. (All
of which is a relief, because Knuth et al are definitely smarter than
me, but I've got disks and lots of memory and those guys had tapes.).

If we track the min and max values for each run, when run building is
finished we will be able to build a merging plan that allows us to be
smart about the runs we should bring together. We start with the run
with the lowest min value, as well as all runs that overlap that run.
When that run is exhausted we move to the next lowest and at that point
start merging all runs that overlap that one. 

This then means we may be able to begin final merging with more runs
than the current cut-off. It's possible that we could merge an infinite
number of runs in final merge with fixed memory. If we *do* need to
merge we can work out which runs should be our best pre-merge
candidates, based upon how big they are and which other runs they
overlap. (That's much better than being forced to merge tapes 2, 7 and
17 because some bizarre math says so (see Knuth).)

Anyway, claiming to have found a better way than Knuth makes me feel a
little nervous, so some searching questions on this are very welcome.

Interestingly, if we combine this technique with dynamic heap management
we may be able to allow a very large number of efficiently written runs
to form without it causing any merging.

mac_man recently noted the possibility that some runs don't overlap at
all and so can be merged for free. That's true, though doesn't actually
improve the basic idea here which is building a merge plan after runs
have been formed, with an eye on minimizing and potentially elimination
the merge phase.

There's probably some typos or thinkos above, so go easy on me Greg!
They aren't there because I want to skim over anything.

I'm not likely to get a chance to do all of this in the near future, so
documenting it now should help others to carry things forward.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Sorting Improvements for 8.4

From
Sam Mason
Date:
On Tue, Nov 27, 2007 at 06:03:46PM +0000, Simon Riggs wrote:
> Just wanted to review a few thoughts and ideas around improving external
> sorts, as recently encouraged to do by Jim Nasby. 

Is there any way of PG knowing that having an index on a subset of the
sorted columns is sometimes a win?  For example, if we have:
 CREATE TABLE foo (   a INTEGER NOT NULL PRIMARY KEY,   b INTEGER NOT NULL,   c INTEGER );

and we request:
 SELECT * FROM foo ORDER BY a,b LIMIT 10;

then it may be a win to do smaller sorts for each value of "a", rather
than one big sort after all the data has been pulled out.  Obviously,
it would depend on the distribution of "a", large numbers of distinct
values for "a" being good, and a small number being bad.

I think this would help in a number of other situations as well, but
that's just the most obvious case.

 Sam


Re: Sorting Improvements for 8.4

From
Jeff Davis
Date:
On Tue, 2007-11-27 at 18:03 +0000, Simon Riggs wrote:
> 5. DYNAMIC RUN HANDLING (in Final Merge)
> 
> Another way of addressing a) is to simply make better use of memory
> itself. Let's look at that in more detail:
> 
> Number of runs that can be merged at once is currently fixed, based upon
> available memory. This has the underlying assumption that all runs will
> be concurrently active during final merging, which may not always be
> true.
> 
> If we have random data then almost all runs will overlap with all other
> runs, i.e. the min and max values are sufficiently wide that the runs do
> all overlap. In many cases, data arrives in somewhat sorted order, e.g.
> financial data is fairly regular with some late payers but not many, and
> those trail off with a fairly tight decay. In the somewhat sorted case
> we find that the actual overlap is less than total, so there are many
> later runs that don't overlap the earlier ones. In the best case we
> would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4
> overlap.

I have spoken with Len Shapiro, a professor at Portland State
University, regarding sorting before.

He suggests that PostgreSQL should implement forecasting, which is
similar to what you're describing. Forecasting does not require that
entire runs are disjoint, it works by tracking the maximum values from
the last block read from every run. This allows you to know which run
you will need more blocks from the soonest.

I'm still looking into the problem to understand it better, but the
algorithm is in Knuth Vol 3.

I can look at it in more detail, but have you already looked into this
idea? Is there a reason we don't do this currently?

Regards,Jeff Davis



Re: Sorting Improvements for 8.4

From
Simon Riggs
Date:
On Fri, 2007-11-30 at 12:07 -0800, Jeff Davis wrote: 
> On Tue, 2007-11-27 at 18:03 +0000, Simon Riggs wrote:
> > 5. DYNAMIC RUN HANDLING (in Final Merge)
> > 
> > Another way of addressing a) is to simply make better use of memory
> > itself. Let's look at that in more detail:
> > 
> > Number of runs that can be merged at once is currently fixed, based upon
> > available memory. This has the underlying assumption that all runs will
> > be concurrently active during final merging, which may not always be
> > true.
> > 
> > If we have random data then almost all runs will overlap with all other
> > runs, i.e. the min and max values are sufficiently wide that the runs do
> > all overlap. In many cases, data arrives in somewhat sorted order, e.g.
> > financial data is fairly regular with some late payers but not many, and
> > those trail off with a fairly tight decay. In the somewhat sorted case
> > we find that the actual overlap is less than total, so there are many
> > later runs that don't overlap the earlier ones. In the best case we
> > would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4
> > overlap.
> 
> I have spoken with Len Shapiro, a professor at Portland State
> University, regarding sorting before.
> 
> He suggests that PostgreSQL should implement forecasting, which is
> similar to what you're describing. Forecasting does not require that
> entire runs are disjoint, it works by tracking the maximum values from
> the last block read from every run. This allows you to know which run
> you will need more blocks from the soonest.
> 
> I'm still looking into the problem to understand it better, but the
> algorithm is in Knuth Vol 3.
> 
> I can look at it in more detail, but have you already looked into this
> idea? Is there a reason we don't do this currently?

Interesting, I hadn't read that part.

Knuth's Algorithm F covers how to do a P-way merge using 2P + 2 buffers.
My ideas cover how to do a P-way merge when you don't have enough memory
for that many buffers. 

The current sort code makes two assumptions, amongst others

1. minimizing number of runs is always worth it
2. there is a single fixed maximum size of P, depending upon memory

I'm challenging both of those. Only runs that overlap need to be merged
simultaneously, so if the runs aren't overlapping then its OK to allow
more runs to be formed. If its OK to allow more runs, then reducing heap
size to allow better CPU efficiency is possible.

So Algorithm F is somewhat orthogonal to what I've proposed, though
maybe still interesting. What we do now is fairly close, but you should
look at the code in tuplesort.c and logtape.c to see how well it
matches. That might lead to an increase in the limit of the number of
concurrent runs mergeable at any one time.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Sorting Improvements for 8.4

From
Jeff Davis
Date:
On Mon, 2007-12-03 at 11:51 +0000, Simon Riggs wrote:
> So Algorithm F is somewhat orthogonal to what I've proposed, though
> maybe still interesting. What we do now is fairly close, but you should
> look at the code in tuplesort.c and logtape.c to see how well it
> matches. That might lead to an increase in the limit of the number of
> concurrent runs mergeable at any one time.
> 

tuplesort.c:
* When merging runs, we use a heap containing just the frontmost tuple from* each source run; we repeatedly output the
smallesttuple and insert the* next tuple from its source tape (if any).  When the heap empties, the merge* is complete.
The basic merge algorithm thus needs very little memory ---* only M tuples for an M-way merge, and M is constrained to
asmall number.* However, we can still make good use of our full workMem allocation by* pre-reading additional tuples
fromeach source tape.  Without prereading,* our access pattern to the temporary file would be very erratic; on average*
we'dread one block from each of M source tapes during the same time that* we're writing M blocks to the output tape, so
thereis no sequentiality of* access at all, defeating the read-ahead methods used by most Unix kernels.* Worse, the
outputtape gets written into a very random sequence of blocks* of the temp file, ensuring that things will be even
worsewhen it comes* time to read that tape.  A straightforward merge pass thus ends up doing a* lot of waiting for disk
seeks. We can improve matters by prereading from* each source tape sequentially, loading about workMem/M bytes from
eachtape* in turn.  Then we run the merge algorithm, writing but not reading until* one of the preloaded tuple series
runsout.  Then we switch back to preread* mode, fill memory again, and repeat.  This approach helps to localize both*
readand write accesses.
 

The idea of prefetching, as I understand it, is that we don't blindly
preread workMem/M bytes from each of M tapes; instead we predict
which tapes we will need tuples from next through forecasting.

If I understand correctly, we just keep track of the maximum value of
the last block read from each run, and then always read from the run in
which the last block read has the lowest maximum.

It seems as if this would allow a variable number of runs to be merged
at once, but if the data really *is* random, we'd want it to degrade
gracefully something resembling the current implementation.

I'm being somewhat vague here because I haven't taken the time to
really understand it. If you think this idea has potential I will look
into it in more detail.

Regards,Jeff Davis





Re: Sorting Improvements for 8.4

From
Simon Riggs
Date:
On Mon, 2007-12-03 at 10:32 -0800, Jeff Davis wrote:

> If I understand correctly, we just keep track of the maximum value of
> the last block read from each run, and then always read from the run in
> which the last block read has the lowest maximum.

Yep, sounds like Algorithm F

> It seems as if this would allow a variable number of runs to be merged
> at once, but if the data really *is* random, we'd want it to degrade
> gracefully something resembling the current implementation.

If we also keep track of the endpoints of runs that we haven't yet read
from, then yes that would link my ideas with Algorithm F, so we just
have a single implementation. (F++ ?)

Probably easiest to store the endpoint tuples directly, with some sane
limits for when we have very large tuples.

You'll still need to do run-level forecasting as I had proposed to tell
whether you need to do any intermediate merging prior to the final
merge. So the two sets of ideas can't be brought together completely.

> I'm being somewhat vague here because I haven't taken the time to
> really understand it. If you think this idea has potential I will look
> into it in more detail.

Yes, F++ sound like it will use memory more effectively than we do
currently. That's likely to improve performance when the number of runs
approaches the limit for the size of work_mem. So this will improve
external sorts with too small memory allocations, but it won't do
anything about sorts with too large a memory allocation. That's probably
the order of importance for tackling sort performance, so thats good.

Probably best to test with 
- 1M - 4M work_mem, so we see the full benefit of any improvements in
memory utilisation in a typical context
- number of runs is nearly at limit for memory
- total sort is very large, so we see real I/O issues starkly

You'll need to instrument things carefully so you can tell how many runs
are being merged at any one time and how that effects elapsed time/row.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Sorting Improvements for 8.4

From
Gregory Stark
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:

> On Mon, 2007-12-03 at 10:32 -0800, Jeff Davis wrote:
>
>> If I understand correctly, we just keep track of the maximum value of
>> the last block read from each run, and then always read from the run in
>> which the last block read has the lowest maximum.

So it sounds like the use case where this is the biggest win would be a
situation where you have presorted input which has been sliced up. So for
example sorting by "zip code" in a table which was clustered by city. The
alphabetic order of the cities isn't correlated to the results but all the zip
codes for a city are in a contiguous block somewhere in the output.

In such a case after doing a single pass we would have a bunch of tapes each
of which corresponded to a single city and was able to completely reorder the
zip codes in that city to be ordered. So the desired results would be, for
example, all the tuples from tape 17 (NYC) followed by all the tuples from
tape 3 (Buffalo) followed by all the tuples from tape 1 (Albuquerque), etc.

We currently preread an equal amount from each tape and then would empty all
the preread tuples from tape 17, refill them, preread them again, repeat until
tape 17 is empty then move on to tape 3. All the tuples except the currently
active tape are completely idle.

I think the way to do what you're proposing is to preread one tuple from each
tape, then when one preread bunch is emptied refill it with twice as many and
repeat. In this case you would end up with nearly all of workmem full of
tuples from NYC until you're done with NYC. That would increase the prereading
block size by a factor of 20 in this case. 

So the question is just how many seeks are we doing during sorting. If we're
doing 0.1% seeks and 99.9% sequential i/o then eliminating the 1% entirely
(which we can't do) isn't going to speed up seeking all that much. If we're
doing 20% seeks and can get that down to 10% it might be worthwhile.

I'm not sure where the idea of keeping the current bounds of the input tapes
comes into it. We only preread when we run out of tuples anyways and then we
don't really have a choice about which tape we want to preread from. And it's
a good thing too since maintaining such a list of bounds and finding the
lowest or highest would mean maintaining a second heap which would basically
double the cpu cost of sorting.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Sorting Improvements for 8.4

From
Simon Riggs
Date:
On Mon, 2007-12-03 at 20:40 +0000, Gregory Stark wrote:

> I think the way to do what you're proposing is...

Don't understand that. Algorithm F covers that already doesn't it?

> So the question is just how many seeks are we doing during sorting. If we're
> doing 0.1% seeks and 99.9% sequential i/o then eliminating the 1% entirely
> (which we can't do) isn't going to speed up seeking all that much. If we're
> doing 20% seeks and can get that down to 10% it might be worthwhile.

The buffer size at max tapes is an optimum - a trade off between
avoiding intermediate merging and merging efficiently. Freeing more
memory is definitely going to help in the case of low work_mem and lots
of runs.

You're right that there is a limit to the benefit you can get. I wrote a
patch in 2005/6 to optimise the memory usage when there were few runs
and lots of memory. I still think there's value in that.

> I'm not sure where the idea of keeping the current bounds of the input tapes
> comes into it. We only preread when we run out of tuples anyways and then we
> don't really have a choice about which tape we want to preread from.

You have to decide whether to perform intermediate merges or whether you
can do everything at the final merge. Otherwise you can't merge more
runs than you have buffers for, since you'd at some point freeze up and
not be able to input.
And it's
> a good thing too since maintaining such a list of bounds and finding the
> lowest or highest would mean maintaining a second heap which would basically
> double the cpu cost of sorting.

I think you're not understanding me.

You only need to record the lowest or highest when a run
completes/starts. When all runs have been written we then have a table
of the highest and lowest values for each run. We then scan that to see
whether we can perform merging in one pass, or if not what kind of
intermediate merging is required. We keep the merge plan in memory and
then follow it. So probably very small % of total sort cost, though
might save you doing intermediate merges with huge costs.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Sorting Improvements for 8.4

From
Gregory Stark
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:

> The buffer size at max tapes is an optimum - a trade off between
> avoiding intermediate merging and merging efficiently. Freeing more
> memory is definitely going to help in the case of low work_mem and lots
> of runs.

I can't follow these abstract arguments. That's why I tried to spell out a
concrete example.

> I think you're not understanding me.
>
> You only need to record the lowest or highest when a run
> completes/starts. When all runs have been written we then have a table
> of the highest and lowest values for each run. We then scan that to see
> whether we can perform merging in one pass, or if not what kind of
> intermediate merging is required. We keep the merge plan in memory and
> then follow it. So probably very small % of total sort cost, though
> might save you doing intermediate merges with huge costs.

Ok, that's a very different concept than what I was thinking.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning


Re: Sorting Improvements for 8.4

From
Jeff Davis
Date:
On Mon, 2007-12-03 at 20:40 +0000, Gregory Stark wrote:
> So the question is just how many seeks are we doing during sorting. If we're
> doing 0.1% seeks and 99.9% sequential i/o then eliminating the 1% entirely
> (which we can't do) isn't going to speed up seeking all that much. If we're
> doing 20% seeks and can get that down to 10% it might be worthwhile.

It's not just about eliminating seeks, it's about being able to merge
more runs at one time.

If you are merging 10 runs at once, and only two of those runs overlap
and the rest are much greater values, you might be spending 99% of the
time in sequential I/O. 

But the point is, we're wasting the memory holding those other 8 runs in
memory (wasting 80% of the memory you're using), so we really could be
merging a lot more than 10 runs at once. This might eliminate stages
from the merge process.

My point is just that "how many seeks are we doing" is not the only
question. We could be doing 99% sequential I/O and still make huge wins.

In reality, of course, the runs aren't going to be disjoint completely,
but they may be partially disjoint. That's where forecasting comes in:
you preread from the tapes you will actually need tuples from soonest.

Regards,Jeff Davis



Re: Sorting Improvements for 8.4

From
Jeff Davis
Date:
On Mon, 2007-12-03 at 20:40 +0000, Gregory Stark wrote:
> I'm not sure where the idea of keeping the current bounds of the input tapes
> comes into it. We only preread when we run out of tuples anyways and then we
> don't really have a choice about which tape we want to preread from. And it's
> a good thing too since maintaining such a list of bounds and finding the
> lowest or highest would mean maintaining a second heap which would basically
> double the cpu cost of sorting.
> 

You're only keeping track of the maximum value for each run, which
should be cheap to track. The only time it changes is when you're
reading more data from that run, in which case it increases.

The tradeoff that's happening right now is: we want to merge many runs
at once because it reduces the number of merge phases, but the problem
is that it increases the seeking because we read one block from one run,
then one block from another run, etc., especially if the input is
random.

If we reduce the number of runs, then we can preread more efficiently.
See:

tuplesort.c:

* as sorted runs, we can eliminate any repeated I/O at all.  In the
current
* code we determine the number of tapes M on the basis of workMem: we
want
* workMem/M to be large enough that we read a fair amount of data each
time
* we preread from a tape, so as to maintain the locality of access
described
* above.  Nonetheless, with large workMem we can have many tapes.

So, for workMem/M to be "large enough", M has to be small enough. But a
small M means we have to do more merge phases, which is expensive.

Forecasting improves this trade. Forecasting no longer _blindly_
prereads from each tape, it uses information that it already has (the
max value of the last block read from each run) to determine the runs
from which we need tuples the soonest.

Then, it prereads the _correct_ data.

Regards,Jeff Davis




Re: Sorting Improvements for 8.4

From
Ron Mayer
Date:
Has anyone looked into sorting algorithms that could use
more than one CPU or core at a time?

Benchmarks I see[1][2] suggest that sorting is an area that
improves greatly with multiple processors and even with
multi-threading on a single core processor.
  "For 1-processor and 2-threads (1p2t), the algorithm sorts  the relation about 48% faster than the single-threaded
version with a speedup of 31% during the quicksort and 58% during the  mergesort. The dual-processor (2p2t) version
providesan even  faster total speedup of 86% over the single-threaded version  with a speedup of 60% during the
quicksortand 100% during  the merge sort."       [from the acm paper on link 2 below]
 

PS: Yeah, I know multi-threading is a hot-button on these
lists; but sorting seems a relatively isolated of the code
and I'd wonder if it'd be isolate-able enough that multiple
CPUs could be used there.

[1]
http://www.cs.cmu.edu/~damon2005/damonpdf/4%20best%20paper%20-%20multithreaded%20architectures%20and%20the%20sort%20benchmark.pdf
[2]
http://delivery.acm.org/10.1145/1120000/1114254/DaMoN_103.pdf?key1=1114254&key2=5713023711&coll=&dl=ACM&CFID=15151515&CFTOKEN=6184618




Re: Sorting Improvements for 8.4

From
Dimitri Fontaine
Date:
Hi,

Le mardi 18 décembre 2007, Ron Mayer a écrit :
> Has anyone looked into sorting algorithms that could use
> more than one CPU or core at a time?
[...]
> PS: Yeah, I know multi-threading is a hot-button on these
> lists; but sorting seems a relatively isolated of the code
> and I'd wonder if it'd be isolate-able enough that multiple
> CPUs could be used there.

And before that objection to multi-threading implementation and portability
concerns arise, what about using a coroutine BSD-licenced portable
implementation such as Protothreads to have backend code use several CPU at a
time? http://www.sics.se/~adam/pt/

With such a tool, would it be possible to think about producer/consumer
parallel executions for sorting, aggregates nodes or other parts of the
executor?

Hope this helps, regards,
--
dim

Re: Sorting Improvements for 8.4

From
Simon Riggs
Date:
On Mon, 2007-12-17 at 16:34 -0800, Ron Mayer wrote:

> PS: Yeah, I know multi-threading is a hot-button on these
> lists; but sorting seems a relatively isolated of the code
> and I'd wonder if it'd be isolate-able enough that multiple
> CPUs could be used there.

I'm not sure multi-threading is the issue you think. Threads is, but
only for architectural reasons. Using multiple processes to complete a
task seems very sensible to me.

Yeh, sorting is isolated enough to try out some of those ideas on. I was
unaware of the work on finding medians, so thats a good way of dividing
the workloads for parallelism.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Sorting Improvements for 8.4

From
Jeff Davis
Date:
On Tue, 2007-12-18 at 09:31 +0000, Simon Riggs wrote:
> On Mon, 2007-12-17 at 16:34 -0800, Ron Mayer wrote:
> 
> > PS: Yeah, I know multi-threading is a hot-button on these
> > lists; but sorting seems a relatively isolated of the code
> > and I'd wonder if it'd be isolate-able enough that multiple
> > CPUs could be used there.
> 
> I'm not sure multi-threading is the issue you think. Threads is, but
> only for architectural reasons. Using multiple processes to complete a
> task seems very sensible to me.

My first thought would be that we would need a new executor node (e.g.
"ParallelSort") that would only be chosen when the cost of the sort is
large enough to outweigh other factors (such as process creation time,
dividing available work_mem, and any necessary IPC).

It seems to me the simplest way to do it would be to allow each sub
process to allocate work_mem/P where P is the degree of parallelization.
However, that somewhat works against our schemes for dynamic run
handling and forecasting, and may lead to more random reading from disk.
Any other scheme I can think of would involve more IPC, which might make
the idea just too complex.

Regards,Jeff Davis



Re: Sorting Improvements for 8.4

From
Mark Mielke
Date:
Jeff Davis wrote:
> My first thought would be that we would need a new executor node (e.g.
> "ParallelSort") that would only be chosen when the cost of the sort is
> large enough to outweigh other factors (such as process creation time,
> dividing available work_mem, and any necessary IPC).
>
> It seems to me the simplest way to do it would be to allow each sub
> process to allocate work_mem/P where P is the degree of parallelization.
> However, that somewhat works against our schemes for dynamic run
> handling and forecasting, and may lead to more random reading from disk.
> Any other scheme I can think of would involve more IPC, which might make
> the idea just too complex.
>   
I am curious - what algorithms exist to efficiently do a parallel sort? 
Do you mean if sorting 1 million items, it is possible to separate this 
into  2 sets of 500 thousand each, execute them in separate threads 
(with task administration and synchronization overhead) , combine the 
results, and complete the task in significantly less time than doing it 
in one thread? I am skeptical that this is possible, and suspect that 
the overall efficiency of the system would go down even if the 
throughput of a single execution increases.

Or do you mean being able to perform parts of the query plan fully in 
parallel? If this, then one would need a lot more than ParallelSort...

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>


Re: Sorting Improvements for 8.4

From
"Michał Zaborowski"
Date:
2007/12/19, Mark Mielke <mark@mark.mielke.cc>:
> Jeff Davis wrote:
> > My first thought would be that we would need a new executor node (e.g.
> > "ParallelSort") that would only be chosen when the cost of the sort is
> > large enough to outweigh other factors (such as process creation time,
> > dividing available work_mem, and any necessary IPC).
> >
> > It seems to me the simplest way to do it would be to allow each sub
> > process to allocate work_mem/P where P is the degree of parallelization.
> > However, that somewhat works against our schemes for dynamic run
> > handling and forecasting, and may lead to more random reading from disk.
> > Any other scheme I can think of would involve more IPC, which might make
> > the idea just too complex.
> >
> I am curious - what algorithms exist to efficiently do a parallel sort?
> Do you mean if sorting 1 million items, it is possible to separate this
> into  2 sets of 500 thousand each, execute them in separate threads
> (with task administration and synchronization overhead) , combine the
> results, and complete the task in significantly less time than doing it
> in one thread? I am skeptical that this is possible, and suspect that
> the overall efficiency of the system would go down even if the
> throughput of a single execution increases.
>
Ok - we want to sort table with quick sort and we want to do it on - N threads.
Every thread - gets begin and end of indices of the table. First step starts
at 0 and lasts with count -1. Single step:  find medium value and move
lover before it and bigger after. In normal case - we use recursive call - so
the same procedure is being called for that two parts. In thread we can put
indices at side list - and use queue of threads to pick up data from the list.
We can use common table, access to side list with indices has to be serialized.

> Or do you mean being able to perform parts of the query plan fully in
> parallel? If this, then one would need a lot more than ParallelSort...
>
Nice to have, but rather for data warehouses. In other cases... IE - backend
for Internet - there are many requests and every processor / core works nice.

--
Regards, Michał Zaborowski (TeXXaS)

Re: Sorting Improvements for 8.4

From
Andreas Joseph Krogh
Date:
On Tuesday 18 December 2007 10:03:25 Dimitri Fontaine wrote:
> Hi,
>
> Le mardi 18 décembre 2007, Ron Mayer a écrit :
> > Has anyone looked into sorting algorithms that could use
> > more than one CPU or core at a time?
>
> [...]
>
> > PS: Yeah, I know multi-threading is a hot-button on these
> > lists; but sorting seems a relatively isolated of the code
> > and I'd wonder if it'd be isolate-able enough that multiple
> > CPUs could be used there.
>
> And before that objection to multi-threading implementation and portability
> concerns arise, what about using a coroutine BSD-licenced portable
> implementation such as Protothreads to have backend code use several CPU at
> a time?
>   http://www.sics.se/~adam/pt/
>
> With such a tool, would it be possible to think about producer/consumer
> parallel executions for sorting, aggregates nodes or other parts of the
> executor?
>
> Hope this helps, regards,

And remember; Users don't care about portability-issues, they care about
performance. If multi-threading is a way to speed up sorting considerably, it
should, IMHO, be considered seriously.

--
Andreas Joseph Krogh


Re: Sorting Improvements for 8.4

From
Tom Lane
Date:
Andreas Joseph Krogh <andreak@officenet.no> writes:
> And remember; Users don't care about portability-issues, they care about 
> performance.

Nonsense.  If Postgres stops working on their machine, they'll care.
        regards, tom lane


Re: Sorting Improvements for 8.4

From
Mark Mielke
Date:
Michał Zaborowski wrote:<br /><blockquote cite="mid:e2289d9e0712190231u6d1cd5e0qe57643c99492e4a5@mail.gmail.com"
type="cite"><prewrap="">Ok - we want to sort table with quick sort and we want to do it on - N threads.
 
Every thread - gets begin and end of indices of the table. First step starts
at 0 and lasts with count -1. Single step:  find medium value and move
lover before it and bigger after. In normal case - we use recursive call - so
the same procedure is being called for that two parts. In thread we can put
indices at side list - and use queue of threads to pick up data from the list.
We can use common table, access to side list with indices has to be serialized. </pre></blockquote> Stupid question #2:
Isit well recognized that the CPU is the bottleneck in the PostgreSQL sorting mechanism? Or might it be memory
bandwidthand I/O?<br /><br /> It would seem to me that any sort worth parallelizing (administrative and synchronization
overhead),must have data larger than the L2 cache. If larger than the L2 cache, it becomes real memory speed. If real
memoryspeed, wouldn't one CPU without hardware synchronization, be able to fill the memory read/write pipe? If 'divide
andconquer' to parallize, wouldn't the values written<br /> from one thread, often (1 / N) need to be read from another
thread,requiring hardware data synchronization?<br /><br /> I see the wikipedia.org page describes how easy it is to
parallelizequick sort, and scale performance linearly with the number of processors, but I don't see references to back
thisclaim.<br /> At least some of these steps seem difficult or impractical to parallelize. For example, the initial
partitionreorder that moves items lower than the pivot to the left, and items higher than the pivot to the right, would
notbe easy to parallelize using an in-place re-order. It needs to move one partition down before it can 'divide and
conquer'.They say no synchronization is required, but I think they are missing the hardware synchronization required
(especiallyin the inner most loops where the thread task becomes shorter, and starts to fit in L1/L2). They say linear,
butthen talk about a 'new thread being created'. New thread creation has a cost, and if reduced to using a thread pool,
thensynchronization *is* required.<br /><br /> It sounds like a 'nice in theory' idea. :-) Which doesn't mean it is
wrong...<br/><br /> I am curious enough to write a test...<br /><blockquote
cite="mid:e2289d9e0712190231u6d1cd5e0qe57643c99492e4a5@mail.gmail.com"type="cite"><blockquote type="cite"><pre
wrap="">Ordo you mean being able to perform parts of the query plan fully in
 
parallel? If this, then one would need a lot more than ParallelSort..</pre></blockquote><pre wrap="">Nice to have, but
ratherfor data warehouses. In other cases... IE - backend
 
for Internet - there are many requests and every processor / core works nice. </pre></blockquote> I'm a fan of the
'eachplan item is a task, that is assigned to the pool, with each CPU grabbing tasks from the pool'. Another 'nice in
theory'idea (used by DB2?). As it is, though, I think PostgreSQL planning is heavily designed to maximize performance
ona single CPU, and single queries would not easily scale to multiple CPUs. (Perhaps hashing could be done on another
CPU,or as you describe above, sorting)<br /><br /> Cheers,<br /> mark<br /><br /><pre class="moz-signature"
cols="72">--
 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Sorting Improvements for 8.4

From
Jeff Davis
Date:
On Wed, 2007-12-19 at 12:08 -0500, Mark Mielke wrote:
> Stupid question #2: Is it well recognized that the CPU is the
> bottleneck in the PostgreSQL sorting mechanism? Or might it be memory
> bandwidth and I/O?
> 

I think it depends a lot on several factors. It's probably a different
bottleneck for integers versus localized text, and depends on the
available memory and I/O characteristics.

> 
> It would seem to me that any sort worth parallelizing (administrative
> and synchronization overhead), must have data larger than the L2
> cache. If larger than the L2 cache, it becomes real memory speed. If
> real memory speed, wouldn't one CPU without hardware synchronization,
> be able to fill the memory read/write pipe? 

We do an external merge sort, which involves merging M runs together.
You seem to be implying that we can generate the output run at disk
speed, and therefore the CPU speed is unimportant.

I suspect that the comparison costs are enough that the above statement
isn't true in all cases, particularly in the case of localized text. 

Also, there is probably a lot of memory copying going on, and that
probably destroys a lot of the effectiveness of L2 caching. When L2
caching is ineffective, the CPU spends a lot of time just waiting on
memory. In that case, it's better to have P threads of execution all
waiting on memory operations in parallel.

This would explain why "1p2t" would outperform a "1p1t" in Ron's
reference above.

These are just my first thoughts, however. There is a lot of existing
research out there that we can look into, and also a lot of tests that
we can run before jumping into this.

I think parallel sorting can be looked into separately from the other
sorting improvements.

Regards,Jeff Davis





Re: Sorting Improvements for 8.4

From
Ron Mayer
Date:
Mark Mielke wrote:
> I am curious - what algorithms exist to efficiently do a parallel sort?
> Do you mean if sorting 1 million items, it is possible to separate this
> into  2 sets of 500 thousand each, execute them in separate threads
> (with task administration and synchronization overhead) , combine the
> results, and complete the task in significantly less time than doing it
> in one thread? I am skeptical that this is possible...

The link in the beginning of the thread points to articles
that seem to describe one such algorithm; along with benchmarks.
(http://tinyurl.com/3bvu4u, http://tinyurl.com/32wg2m)
The improvements were pretty consistent from set sizes ranging
from very small sets (hundreds) to quite large ones (hundreds of K).

Interestingly, even multi-threading helped a lot.
  "Our tests correlate well with previous research that showed   Intel’s implementation of SMT (Hyper-Threading) to be
adept at hiding this latency [6, 20, 12].Table 4 shows that by   having two threads access memory at the same time,
performance  improved over 80% when compared to the singlethreaded version.
 

It uses both quicksort phases and merge phases; for the merge phase
using 2CPUs (no hyperthreading) apparently gave more than 2X speed
improvement; apparently because it could parallelize memory access
with CPU more.

> Or do you mean being able to perform parts of the query plan fully in
> parallel? If this, then one would need a lot more than ParallelSort...

I wouldn't recommend that - it seems like a Hard Problem.

My guess is that the best way to use multiple threads in one backend
would be to find specific algorithms like sorting that  would be
easier to isolate.


Re: Sorting Improvements for 8.4

From
Mark Mielke
Date:
Ron Mayer wrote: <blockquote cite="mid:47696249.4090602@cheapcomplexdevices.com" type="cite"><pre wrap="">The link in
thebeginning of the thread points to articles
 
that seem to describe one such algorithm; along with benchmarks.
(<a class="moz-txt-link-freetext" href="http://tinyurl.com/3bvu4u">http://tinyurl.com/3bvu4u</a>, <a
class="moz-txt-link-freetext"href="http://tinyurl.com/32wg2m">http://tinyurl.com/32wg2m</a>)
 
The improvements were pretty consistent from set sizes ranging
from very small sets (hundreds) to quite large ones (hundreds of K).

Interestingly, even multi-threading helped a lot.
  "Our tests correlate well with previous research that showed   Intel’s implementation of SMT (Hyper-Threading) to be
adept at hiding this latency [6, 20, 12].Table 4 shows that by   having two threads access memory at the same time,
performance  improved over 80% when compared to the singlethreaded version.
 

It uses both quicksort phases and merge phases; for the merge phase
using 2CPUs (no hyperthreading) apparently gave more than 2X speed
improvement; apparently because it could parallelize memory access
with CPU more. </pre></blockquote> Good points. I had forgotten about DDR and DDR2 having high throughput at the cost
ofhigh latency. Somewhere in there, having the most number of memory requests in the queue would allow hardware to
eliminatethis high latency effect.<br /><br /><blockquote cite="mid:47696249.4090602@cheapcomplexdevices.com"
type="cite"><blockquotetype="cite"><pre wrap="">Or do you mean being able to perform parts of the query plan fully in
 
parallel? If this, then one would need a lot more than ParallelSort...   </pre></blockquote><pre wrap="">I wouldn't
recommendthat - it seems like a Hard Problem.
 

My guess is that the best way to use multiple threads in one backend
would be to find specific algorithms like sorting that  would be
easier to isolate. </pre></blockquote> Also a good point. :-)<br /><br /> Cheers,<br /> mark<br /><br /><pre
class="moz-signature"cols="72">-- 
 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Sorting Improvements for 8.4

From
Mark Mielke
Date:
Jeff Davis wrote: <blockquote cite="mid:1198088313.28804.387.camel@dogma.ljc.laika.com" type="cite"><pre wrap="">On
Wed,2007-12-19 at 12:08 -0500, Mark Mielke wrote: </pre><blockquote type="cite"><pre wrap="">Stupid question #2: Is it
wellrecognized that the CPU is the
 
bottleneck in the PostgreSQL sorting mechanism? Or might it be memory
bandwidth and I/O?   </pre></blockquote><pre wrap="">I think it depends a lot on several factors. It's probably a
different
bottleneck for integers versus localized text, and depends on the
available memory and I/O characteristics. </pre></blockquote> Makes sense.<br /><br /><blockquote
cite="mid:1198088313.28804.387.camel@dogma.ljc.laika.com"type="cite"><blockquote type="cite"><pre wrap="">It would seem
tome that any sort worth parallelizing (administrative
 
and synchronization overhead), must have data larger than the L2
cache. If larger than the L2 cache, it becomes real memory speed. If
real memory speed, wouldn't one CPU without hardware synchronization,
be able to fill the memory read/write pipe?    </pre></blockquote><pre wrap="">We do an external merge sort, which
involvesmerging M runs together.
 
You seem to be implying that we can generate the output run at disk
speed, and therefore the CPU speed is unimportant. </pre></blockquote> Correct. Or, alternatively, you could achieve
thesame effect using asychronous I/O or read ahead.<br /><blockquote
cite="mid:1198088313.28804.387.camel@dogma.ljc.laika.com"type="cite"><pre wrap="">I suspect that the comparison costs
areenough that the above statement
 
isn't true in all cases, particularly in the case of localized text.</pre></blockquote> That sounds possible, but I
stillfeel myself suspecting that disk reads will be much slower than localized text comparison. Perhaps I am
overestimatingthe performance of the comparison function?<br /><br /><blockquote
cite="mid:1198088313.28804.387.camel@dogma.ljc.laika.com"type="cite"><pre wrap=""> 
 
Also, there is probably a lot of memory copying going on, and that
probably destroys a lot of the effectiveness of L2 caching. When L2
caching is ineffective, the CPU spends a lot of time just waiting on
memory. In that case, it's better to have P threads of execution all
waiting on memory operations in parallel. </pre></blockquote> I didn't consider the high throughput / high latency
effect.This could be true if the CPU prefetch isn't effective enough.<br /><br /><blockquote
cite="mid:1198088313.28804.387.camel@dogma.ljc.laika.com"type="cite"><pre wrap="">
 
This would explain why "1p2t" would outperform a "1p1t" in Ron's
reference above.

These are just my first thoughts, however. There is a lot of existing
research out there that we can look into, and also a lot of tests that
we can run before jumping into this.

I think parallel sorting can be looked into separately from the other
sorting improvements. </pre></blockquote> Yep - I started to read up on it. It still sounds like it's a hard-ish
problem(to achieve near N times speedup for N CPU cores without degrading performance for existing loads), but that
doesn'tmean impossible. :-)<br /><br /> Cheers,<br /> mark<br /><br /><pre class="moz-signature" cols="72">-- 
 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Sorting Improvements for 8.4

From
"Dann Corbit"
Date:
As long as sorting improvements are being considered, may I suggest an
experiment that uses a very simple model?

Assuming that you have K subfiles created by the initial sorting pass,
insert the top record of each file into a priority queue.

Then, emit records from the queue until the priority queue is empty.

Now, there will be the objection that we will be jumping willy-nilly all
over the disk because of reading one record at a time, but (depending on
how it is implemented) generally several records are buffered during a
read.

So (as a gentle suggestion) I suggest testing the model.  It works great
for a single CPU or multiple CPU system for the work that *I* do.  I
have no idea if it will be a benefit for PostgreSQL or not, but it
should be a very simple matter to try it.  As long as someone is doing
the work right now, it would be a good time to give it a go.

I am not very familiar with PostgreSQL internals, but I would be willing
to give a hand with it (not really sure how much time I can guarantee,
though, since I would be doing it on my free time).



Re: Sorting Improvements for 8.4

From
Jeff Davis
Date:
On Wed, 2007-12-19 at 15:51 -0500, Mark Mielke wrote: 
> That sounds possible, but I still feel myself suspecting that disk
> reads will be much slower than localized text comparison. Perhaps I am
> overestimating the performance of the comparison function?

I think this simple test will change your perceptions:

Do an initdb with --locale="en_US.UTF-8" and start postgres.

test=> create table sorter(t text, b bytea, f float); CREATE TABLE
test=> insert into sorter select r AS rt, r::text::bytea AS rb, r AS rf
from (select random() as r from generate_series(1,1000000)) a;
INSERT 0 1000000
test=> select pg_size_pretty(pg_total_relation_size('sorter'));
pg_size_pretty
----------------
70 MB
(1 row)

test=> explain analyze select * from sorter order by t; 
test=> explain analyze select * from sorter order by b;
test=> explain analyze select * from sorter order by f;

On my machine this table fits easily in memory (so there aren't any disk
reads at all). Sorting takes 7 seconds for floats, 9 seconds for binary
data, and 20 seconds for localized text. That's much longer than it
would take to read that data from disk, since it's only 70MB (which
takes a fraction of a second on my machine).

I think this disproves your hypothesis that sorting happens at disk
speed.

> Yep - I started to read up on it. It still sounds like it's a hard-ish
> problem (to achieve near N times speedup for N CPU cores without
> degrading performance for existing loads), but that doesn't mean
> impossible. :-)
> 

You don't even need multiple cores to achieve a speedup, according to
Ron's reference.

Regards,Jeff Davis




Re: Sorting Improvements for 8.4

From
Jeff Davis
Date:
On Wed, 2007-12-19 at 14:41 -0800, Dann Corbit wrote:
> As long as sorting improvements are being considered, may I suggest an
> experiment that uses a very simple model?
> 
> Assuming that you have K subfiles created by the initial sorting pass,
> insert the top record of each file into a priority queue.
> 
> Then, emit records from the queue until the priority queue is empty.
> 

What is the principle difference between that idea and our existing sort
algorithm?

There's a good explanation in the comment at the top of tuplesort.c.

Regards,Jeff Davis



Re: Sorting Improvements for 8.4

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: Jeff Davis [mailto:pgsql@j-davis.com]
> Sent: Wednesday, December 19, 2007 3:10 PM
> To: Dann Corbit
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Sorting Improvements for 8.4
>
> On Wed, 2007-12-19 at 14:41 -0800, Dann Corbit wrote:
> > As long as sorting improvements are being considered, may I suggest
an
> > experiment that uses a very simple model?
> >
> > Assuming that you have K subfiles created by the initial sorting
pass,
> > insert the top record of each file into a priority queue.
> >
> > Then, emit records from the queue until the priority queue is empty.
> >
>
> What is the principle difference between that idea and our existing
sort
> algorithm?
>
> There's a good explanation in the comment at the top of tuplesort.c.


According to the comments, PostgreSQL uses replacement selection.
Replacement selection is a wonderful thing because it creates runs that
are twice as long as normal due to the snowplow effect.  See (for
instance):
http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/69/27216/01209012.
pdf

Then, the merge routine will have half as many runs to merge the files
together.

So (for instance) without replacement selection, if you create 1024
subfiles, then replacement selection will create 512.  That saves one
merge pass.

The algorithm that I am suggesting will take exactly one pass to merge
all of the files.

It works like this...

Imagine an array of pointers to the subfiles:
[*subfile][*subfile]...[*subfile]

Step 0:
We sort the array by a comparison operator that examines the top element
of each subfile.  So now the array is ordered such that the record with
the smallest key is in array slot 0.

Step 1:
We remove the first record from the subfile in array slot 0.  Now, the
priority of the first element *may* have changed.  So if it is no longer
smaller than the subfile immediately to the right, we do a binary
insertion to put this subfile in its new location, moving the contents
of array slot[1] to array slot 0 if it is needed.

Step 2:
Is the entire list of subfiles empty?  If yes, then terminate, if no
then go to Step 1.

Like I said, it is ultra-simple and it sorts the entire contents of all
subfiles to the output with a single pass.

Consider the way that current replacement selection works.  The actual
O(f(N)) behavior of replacement selection is just terrible O(n^2).  But
because we save one full merge pass, it is usually worth it anyway,
since memory access is much faster than disk.  And if we only have a few
subfiles, the savings will be large.  In the case of a priority queue
merge, we only have one single merge pass no matter how many subfiles
there are.




Re: Sorting Improvements for 8.4

From
"Dann Corbit"
Date:
P.S.
A beautiful paper on replacement selection is found here:
http://students.fim.uni-passau.de/~fickensc/Proseminar/Proseminar.pdf



Re: Sorting Improvements for 8.4

From
Jeff Davis
Date:
On Wed, 2007-12-19 at 15:19 -0800, Dann Corbit wrote:
> The algorithm that I am suggesting will take exactly one pass to merge
> all of the files.
> 

>From tuplesort.c:

"In the current code we determine the number of tapes M on the basis of
workMem: we want workMem/M to be large enough that we read a fair amount
of data each time we preread from a tape, so as to maintain the locality
of access described above.  Nonetheless, with large workMem we can have
many tapes."

It seems like you are just choosing M to be equal to the number of
initial runs, whereas the current code takes into account the cost of
having workMem/M too small.

We do want to increase the number of runs that can be merged at once;
that's what dynamic run handling and forecasting are all about. But we
want to avoid unnecessary seeking, also.

Regards,Jeff Davis



Re: Sorting Improvements for 8.4

From
Tom Lane
Date:
Mark Mielke <mark@mark.mielke.cc> writes:
> Jeff Davis wrote:
>> Also, there is probably a lot of memory copying going on, and that
>> probably destroys a lot of the effectiveness of L2 caching. When L2
>> caching is ineffective, the CPU spends a lot of time just waiting on
>> memory. In that case, it's better to have P threads of execution all
>> waiting on memory operations in parallel.
>> 
> I didn't consider the high throughput / high latency effect. This could 
> be true if the CPU prefetch isn't effective enough.

Note that if this is the argument, then there's a ceiling on the speedup
you can expect to get: it's just the extent of mismatch between the CPU
and memory speeds.  I can believe that suitable test cases would show
2X improvement for 2 threads, but it doesn't follow that you will get
10X improvement with 10 threads, or even 4X with 4.
        regards, tom lane


Re: Sorting Improvements for 8.4

From
Gregory Stark
Date:
"Jeff Davis" <pgsql@j-davis.com> writes:

> test=> explain analyze select * from sorter order by t; 
> test=> explain analyze select * from sorter order by b;
> test=> explain analyze select * from sorter order by f;
>
> On my machine this table fits easily in memory (so there aren't any disk
> reads at all). Sorting takes 7 seconds for floats, 9 seconds for binary
> data, and 20 seconds for localized text. That's much longer than it
> would take to read that data from disk, since it's only 70MB (which
> takes a fraction of a second on my machine).
>
> I think this disproves your hypothesis that sorting happens at disk
> speed.

I suspect most of that is spent just copying the data around. Which would not
be helped by having multiple threads doing the copying -- and in fact might be
exacerbated if it required an extra copy to consolidate all the data in the
end.

How long does a "explain analyze sinmple select * from sorter" take?

And assuming you're doing disk sorts (in disk cache) you're doing quite a lot
of copying to temporary files (in disk cache) and then back to memory.


Note that speeding up a query from 20s to 5s isn't terribly useful. If it's
OLTP you can't be using all your cores for each user anyways. And if it's DSS
20s isn't a problem.

Where parallel processing like this becomes attractive is when you're running
a 2 hour query on a machine sequentially running scheduled batch jobs which
can be sped up to 30 minutes. But in that case you're almost certainly being
limited by your disk bandwidth, not your cpu speed.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Sorting Improvements for 8.4

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Gregory Stark
> Sent: Wednesday, December 19, 2007 5:26 PM
> To: Jeff Davis
> Cc: Mark Mielke; Michał Zaborowski; Simon Riggs; Ron Mayer; pgsql-
> hackers@postgresql.org
> Subject: Re: [HACKERS] Sorting Improvements for 8.4
>
>
> "Jeff Davis" <pgsql@j-davis.com> writes:
>
> > test=> explain analyze select * from sorter order by t;
> > test=> explain analyze select * from sorter order by b;
> > test=> explain analyze select * from sorter order by f;
> >
> > On my machine this table fits easily in memory (so there aren't any disk
> > reads at all). Sorting takes 7 seconds for floats, 9 seconds for binary
> > data, and 20 seconds for localized text. That's much longer than it
> > would take to read that data from disk, since it's only 70MB (which
> > takes a fraction of a second on my machine).
> >
> > I think this disproves your hypothesis that sorting happens at disk
> > speed.
>
> I suspect most of that is spent just copying the data around. Which would
> not
> be helped by having multiple threads doing the copying -- and in fact
> might be
> exacerbated if it required an extra copy to consolidate all the data in
> the
> end.

Benchmarking a single system will really only explain that system.
Someone may have a disk farm with 2GB/Sec throughput:
http://www.sql-server-performance.com/articles/per/system_storage_configuration_p1.aspx

But such a configuration is very unlikely.

Someone may have 10GB/S NIC cards, but those too, are rare.

So for any benchmark, we will really just end up with a number for that system.

Typically, disk is the bottleneck.
I found this on the net somewhere, but it's quite a useful table for capacity planning (to find the weak link in the
chainusing back of the envelope calculations): 

Interface    Width    Frequency    Bytes/Sec    Bits/Sec
4-way interleaved PC1600 (DDR200) SDRAM    4 x 64bits    100 MHz DDR    6.4 GB/s    51 Gbps
Opteron HyperTransport memory bus    128bits    200 MHz DDR    6.4 GB/s    51 Gbps
Pentium 4 "800 MHz" FSB    64bits    200 MHz QDR    6.4 GB/s    51 Gbps
PC2 6400 (DDR-II 800) SDRAM    64bits    400 MHz DDR    6.4 GB/s    51 Gbps
PC2 5300 (DDR-II 667) SDRAM    64bits    333 MHz DDR    5.3 GB/s    43 Gbps
Pentium 4 "533 MHz" FSB    64bits    133 MHz QDR    4.3 GB/s    34 Gbps
PC2 4300 (DDR-II 533) SDRAM    64bits    266 MHz DDR    4.3 GB/s    34 Gbps
2-channel PC1066 RDRAM    2 x 16bits    533 MHz DDR    4.3 GB/s    34 Gbps
PCI-X 533    64bits    533 MHz    4.3 GB/s    34 Gbps
PCI-Express x16    serial/16lanes    2.5 GHz    4 GB/s    32 Gbps
Pentium 4 "400 MHz" FSB    64bits    100 MHz QDR    3.2 GB/s    25.6 Gbps
2-channel PC800 RDRAM    2 x 16bits    400 MHz DDR    3.2 GB/s    25.6 Gbps
2-way interleaved PC1600 (DDR200) SDRAM    2 x 64bits    100 MHz DDR    3.2 GB/s    25.6 Gbps
PC2 3200 (DDR-II 400) SDRAM    64bits    200 MHz DDR    3.2 GB/s    25.6 Gbps
PC3200 (DDR400) SDRAM    64bits    200 MHz DDR    3.2 GB/s    25.6 Gbps
PC2700 (DDR333) SDRAM    64bits    167 MHz DDR    2.7 GB/s    21 Gbps
PC2100 (DDR266) SDRAM    64bits    133 MHz DDR    2.1 GB/s    17 Gbps
AGP 8x    32bits    533 MHz    2.1 GB/s    17 Gbps
PCI-X 266    64bits    266 MHz    2.1 GB/s    17 Gbps
PCI-Express x8    serial/8lanes    2.5 GHz    2 GB/s    16 Gbps
EV6 bus (Athlon/Duron FSB)    64bits    100 MHz DDR    1.6 GB/s    13 Gbps
PC1600 (DDR200) SDRAM    64bits    100 MHz DDR    1.6 GB/s    13 Gbps
PC800 RDRAM    16bits    400 MHz DDR    1.6 GB/s    13 Gbps
PC150 SDRAM    64bits    150 MHz    1.3 GB/s    10.2 Gbps
10 gigabit ethernet    serial    10 GHz    1.25 GB/s    10 Gbps
OC-192    serial    9.953 GHz    1.24 GB/s    9.953 Gbps
133 MHz FSB    64bits    133 MHz    1.06 GB/s    8.5 Gbps
PC133 SDRAM    64bits    133 MHz    1.06 GB/s    8.5 Gbps
AGP 4x    32bits    266 MHz    1.06 GB/s    8.5 Gbps
PCI-X    64bits    133 MHz    1.06 GB/s    8.5 Gbps
PCI-Express x4    serial/4lanes    2.5 GHz    1 GB/s    8 Gbps
100 MHz FSB    64bits    100 MHz    800 MB/s    6.4 Gbps
PC100 SDRAM    64bits    100 MHz    800 MB/s    6.4 Gbps
PC66 SDRAM    64bits    66 MHz    533 MB/s    4.3 Gbps
fast/wide PCI    64bits    66 MHz    533 MB/s    4.3 Gbps
AGP 2x    32bits    133 MHz    533 MB/s    4.3 Gbps
single-link DVI    12bits    165 MHz DDR    495 MB/s    3.96 Gbps
Ultra-320 SCSI    16bits    160 MHz    320 MB/s    2.6 Gbps
OC-48 network    serial    2.488 GHz    311 MB/s    2.488 Gbps
AGP    32bits    66 MHz    266 MB/s    2.1 Gbps
PCI-Express x1    serial    2.5 GHz    250 MB/s    2 Gbps
Serial ATA/1500 disk    serial    1.5 GHz    187 MB/s    1.5 Gbps
Ultra-160 SCSI    16bits    80 MHz    160 MB/s    1.3 Gbps
OC-24 network    serial    1.244 GHz    155 MB/s    1.244 Gbps
PCI    32bits    33 MHz    133 MB/s    1.06 Gbps
ATA/133 disk    8bits    66 MHz DDR    133 MB/s    1.06 Gbps
gigabit ethernet    serial    1 GHz    125 MB/s    1 Gbps
ATA/100 disk    8bits    50 MHz DDR    100 MB/s    800 Mbps
IEEE 1394b    serial    800 MHz    100 MB/s    800 Mbps
Ultra-2 Wide SCSI    16bits    40 MHz    80 MB/s    640 Mbps
OC-12 network    serial    622.08 MHz    77.7 MB/s    622.08 Mbps
ATA/66 disk    8bits    33 MHz DDR    66 MB/s    533 Mbps
USB-2    serial    480 MHz    60 MB/s    480 Mbps
IEEE 1394    serial    400 MHz    50 MB/s    400 Mbps
Ultra Wide SCSI    16bits    20 MHz    40 MB/s    320 Mbps
ATA/33 disk    8bits    16.6 MHz DDR    33 MB/s    266 Mbps
Fast Wide SCSI    16bits    10 MHz    20 MB/s    160 Mbps
OC-3 network    serial    155.52 MHz    19.4 MB/s    155.52 Mbps
100baseT ethernet    serial    100 MHz    12.5 MB/s    100 Mbps
OC-1 network    serial    51.84 MHz    6.5 MB/s    51.84 Mbps
T-3 network    serial    45 MHz    5.6 MB/s    44.736 Mbps
USB    serial    12 MHz    1.5 MB/s    12 Mbps
10baseT ethernet    serial    10 MHz    1.25 MB/s    10 Mbps
IrDA-2    serial    4 MHz    500 KB/s    4 Mbps
T-1 network    serial    1.5 MHz    193 KB/s    1.544 Mbps

> How long does a "explain analyze sinmple select * from sorter" take?
>
> And assuming you're doing disk sorts (in disk cache) you're doing quite a
> lot
> of copying to temporary files (in disk cache) and then back to memory.
>
>
> Note that speeding up a query from 20s to 5s isn't terribly useful. If
> it's
> OLTP you can't be using all your cores for each user anyways. And if it's
> DSS
> 20s isn't a problem.

Unless (of course) there are 20,000 users doing the queries that would take 20 seconds but now they take 5 (when run
single-user). They will still have a bit of a wait, of course. 
> Where parallel processing like this becomes attractive is when you're
> running
> a 2 hour query on a machine sequentially running scheduled batch jobs
> which
> can be sped up to 30 minutes. But in that case you're almost certainly
> being
> limited by your disk bandwidth, not your cpu speed.

A linear speedup of 2 or more is always worth while[*].  Since sorting (e.g. for 'group by' and 'order by') and sort
joinsare a major database task, I guess that a linear speedup by a factor of 2 might make the database operations on
thewhole be 10% faster or so {OK, it's a SWAG}.  I guess it would look good on the benchmarks, if nothing else. 

[*] unless it is already fast enough.  If, at peak load a query takes 1 ms, then making the query take 0.5 ms is not
goingto win you any medals, especially if the improvement costs $10,000. 



Re: Sorting Improvements for 8.4

From
Greg Smith
Date:
On Wed, 19 Dec 2007, Dann Corbit wrote:

> Benchmarking a single system will really only explain that system.
> Someone may have a disk farm with 2GB/Sec throughput
> But such a configuration is very unlikely.

If you believe comments like those at 
http://www.c0t0d0s0.org/archives/1792-Do-it-yourself-X4500.html it's 
possible to hit >2GB/s total to the 48 disks in one of the Sun X4500 
servers, which start at $24K.  May be unlikely to you, but I was reading 
there after I set one up last night, and that's a boring standard 
configuration for some Sun and Greenplum customers.

Also, that's today--by the time 8.4 is mainstream high-end machines will 
be even faster.  Wanna make a bet on how much disk throughput will be 
available as SSD disks go mainstream in the next two years?

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Sorting Improvements for 8.4

From
Gregory Stark
Date:
"Dann Corbit" <DCorbit@connx.com> writes:

>> Note that speeding up a query from 20s to 5s isn't terribly useful. If it's
>> OLTP you can't be using all your cores for each user anyways. And if it's
>> DSS 20s isn't a problem.
>
> Unless (of course) there are 20,000 users doing the queries that would take 20
> seconds but now they take 5 (when run single-user). They will still have a bit
> of a wait, of course.

I'm not exactly following. If you have 20,000 users then you're probably using
all the processors already. If you process them one by one on 4 cores in 5s
then you'll get the same throughput as if you ran them four at a time on 1
core each in 20s.

>> Where parallel processing like this becomes attractive is when you're
>> running a 2 hour query on a machine sequentially running scheduled batch
>> jobs which can be sped up to 30 minutes. But in that case you're almost
>> certainly being limited by your disk bandwidth, not your cpu speed.
>
> A linear speedup of 2 or more is always worth while[*]. Since sorting (e.g. for
> group by' and 'order by') and sort joins are a major database task, I guess
> that a linear speedup by a factor of 2 might make the database operations on
> the whole be 10% faster or so {OK, it's a SWAG}. I guess it would look good on
> the benchmarks, if nothing else.

Except note that you're not getting this linear speedup for free. To get a
linear speedup of 2x you'll be using more than 2x the cpu resources. If there
is nothing else contending for that resource (such as the scenario I described
where you're running a single large batch query on a system and want to use
all available resources to run it as fast as possible), then you'll get a 2x
speedup. 

But if there is more than one query running on the system then you're not
actually gaining anything. Each query will run faster but you won't be able to
run as many simultaneously without having them slow back down. And the
overhead of parallelizing the query will be a net loss.


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: Sorting Improvements for 8.4

From
Gregory Stark
Date:
"Greg Smith" <gsmith@gregsmith.com> writes:

> On Wed, 19 Dec 2007, Dann Corbit wrote:
>
>> Benchmarking a single system will really only explain that system.
>> Someone may have a disk farm with 2GB/Sec throughput
>> But such a configuration is very unlikely.
>
> If you believe comments like those at
> http://www.c0t0d0s0.org/archives/1792-Do-it-yourself-X4500.html it's possible
> to hit >2GB/s total to the 48 disks in one of the Sun X4500 servers, which
> start at $24K.  May be unlikely to you, but I was reading there after I set one
> up last night, and that's a boring standard configuration for some Sun and
> Greenplum customers.

Surely such machines have kickass memory backplanes too though? How could it
ever be reasonable to have an i/o controller with more bandwidth than your
memory?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Sorting Improvements for 8.4

From
Ron Mayer
Date:
Tom Lane wrote:
> ...I can believe that suitable test cases would show
> 2X improvement for 2 threads,

One other thing I found  interesting is that their test case
showed a near 2X improvement for hyperthreading; where I haven't
heard of many other ways to get hyperthreading to show improvements
for postgreql.


> but it doesn't follow that you will get
> 10X improvement with 10 threads, or even 4X with 4.

Yeah - unless those 10 cores have additional I/O to the
memories compared to a 1 core system (which I'd hope
would be the case or else I'd expect many apps would be
run into memory bottlenecks on such systems, no?).





Re: Sorting Improvements for 8.4

From
Greg Smith
Date:
On Thu, 20 Dec 2007, Gregory Stark wrote:

> Surely such machines have kickass memory backplanes too though? How could it
> ever be reasonable to have an i/o controller with more bandwidth than your
> memory?

Dann had the right general numbers here--max of 6.4GB/s between processors 
and you might coax an aggregate of double that out of the DDR RAM with 2 
4-way interleaved banks of memory.  Let's call it 12GB/s theoretical max. 
If the theoretical max of the disks is 2GB/s, that's only a 6:1 headroom, 
and with a decent cache rate it's not outrageous to imagine you could 
bottleneck on memory with some things before you run out of disk 
throughput.

Right now I think a lot of the disk bottlenecks are seek-limited more than 
anything, but SSD will knock that one out for apps that are more about 
throughput than maximum storage.  I could already switch to SDD usefully 
today for some of what I do that's in that category, it's just a bit too 
expensive to do yet; soon, though.

Just trying to usefully estimate where the edge of that back of the 
envelope should go to.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Sorting Improvements for 8.4

From
Martijn van Oosterhout
Date:
On Wed, Dec 19, 2007 at 07:17:21PM -0800, Ron Mayer wrote:
> > but it doesn't follow that you will get
> > 10X improvement with 10 threads, or even 4X with 4.
>
> Yeah - unless those 10 cores have additional I/O to the
> memories compared to a 1 core system (which I'd hope
> would be the case or else I'd expect many apps would be
> run into memory bottlenecks on such systems, no?).

I don't suppose you saw the document from Ulrich Drepper "What Every
Programmer Should Know About Memory". It's a fact that most machines
with multiple cores have less L2 cache/core than a single core
machines. And having multiple conduits to main memory isn't that common
at all. So having more threads sometimes *decreases* performance
because you cut the L2 cache and memory bandwidth in half.

The document is very useful for getting tips about how to work out
optimal thread/memory/datasize ratios.

The way around this is a NUMA architecture, but that's a whole
other ball of wax.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Re: Sorting Improvements for 8.4

From
Greg Smith
Date:
On Thu, 20 Dec 2007, Martijn van Oosterhout wrote:

> The way around this is a NUMA architecture, but that's a whole
> other ball of wax.

Quick note for those reading Ulrich's paper:  he refers in a couple of 
places to Intel's upcoming CSI approach to NUMA.  This has now been 
renamed QuickPath, and it looks like it will be late 2008 before that even 
makes it to Itanium processors.

The fact that AMD has a good NUMA implementation in their Opteron lines 
while Intel's Xeon processors do not is one area AMD still has a clear 
competative lead on.  But you need memory bandwidth starved application 
before that matters more than the fact that the current Xeons are faster 
in general.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Sorting Improvements for 8.4

From
Brian Hurt
Date:
While we're blue skying things, I've had an idea for a sorting algorithm 
kicking around for a couple of years that might be interesting.  It's a 
variation on heapsort to make it significantly more block-friendly.  I 
have no idea if the idea would work, or how well it'd work, but it might 
be worthwhile kicking around.

Now, the core idea of heapsort is that the array is put into heap order- 
basically, that a[i] >= a[2i+1] and a[i] >= a[2i+2] (doing the 0-based 
array version here).  The problem is that, assuming that the length of a 
is larger than memory, then a[2i+1] is likely going to be on a different 
page or block than a[i].  That means every time you have to bubble down 
a new element, you end up reading O(log N) blocks- this is *per element*.

The variation is to instead work with blocks, so you have a block of 
entries b[i], and you change the definition of heap order, so that 
min(b[i]) >= max(b[2i+1]) and min(b[i]) >= max(b[2i+2]).  Also, during 
bubble down, you need to be carefull to only change the minimum value of 
one of the two child blocks b[2i+1] and b[2i+2].  Other than that, the 
algorithm works as normal.  The advantage of doing it this way is that 
while each bubble down still takes O(log N) blocks being touched, you 
get a entire block worth of results for your effort.  Make your blocks 
large enough (say, 1/4 the size of workmem) and you greatly reduce N, 
the number of blocks you have to deal with, and get much better I/O 
(when you're reading, you're reading megabytes at a shot).

Now, there are boatloads of complexities I'm glossing over here.  This 
is more of a sketch of the idea.  But it's something to consider.

Brian



Re: Sorting Improvements for 8.4

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Brian Hurt
> Sent: Thursday, December 20, 2007 6:42 AM
> To: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Sorting Improvements for 8.4
>
> While we're blue skying things, I've had an idea for a sorting
algorithm
> kicking around for a couple of years that might be interesting.  It's
a
> variation on heapsort to make it significantly more block-friendly.  I
> have no idea if the idea would work, or how well it'd work, but it
might
> be worthwhile kicking around.
>
> Now, the core idea of heapsort is that the array is put into heap
order-
> basically, that a[i] >= a[2i+1] and a[i] >= a[2i+2] (doing the 0-based
> array version here).  The problem is that, assuming that the length of
a
> is larger than memory, then a[2i+1] is likely going to be on a
different
> page or block than a[i].  That means every time you have to bubble
down
> a new element, you end up reading O(log N) blocks- this is *per
element*.
>
> The variation is to instead work with blocks, so you have a block of
> entries b[i], and you change the definition of heap order, so that
> min(b[i]) >= max(b[2i+1]) and min(b[i]) >= max(b[2i+2]).  Also, during
> bubble down, you need to be carefull to only change the minimum value
of
> one of the two child blocks b[2i+1] and b[2i+2].  Other than that, the
> algorithm works as normal.  The advantage of doing it this way is that
> while each bubble down still takes O(log N) blocks being touched, you
> get a entire block worth of results for your effort.  Make your blocks
> large enough (say, 1/4 the size of workmem) and you greatly reduce N,
> the number of blocks you have to deal with, and get much better I/O
> (when you're reading, you're reading megabytes at a shot).
>
> Now, there are boatloads of complexities I'm glossing over here.  This
> is more of a sketch of the idea.  But it's something to consider.

It's an interesting idea to work with a "heap of heaps" where you try to
keep each heap page-sized.  It reminds me of the B+ tree, where you
collect a whole bunch of nodes into a single page.

I don't know if you have examined weak-heaps, but there are some
interesting results for weak-heap approaches.  As you know, heapsort
variants do not degenerate to O(N^2).

On this link:
http://www.jea.acm.org/2002/EdelkampHeapsort/

I highly recommend all the goodies he has embedded (papers, source,
etc.)



Re: Sorting Improvements for 8.4

From
Ron Mayer
Date:
Gregory Stark wrote:
> Note that speeding up a query from 20s to 5s isn't terribly useful. 

I disagree totally with that.

That is the difference between no chance of someone waiting for a web
page to load; vs. a good chance they'd wait.   And 2s vs 0.5s is the
difference between a web site that feels responsive and one that doesn't.

> If it's OLTP you can't be using all your cores for each user anyways.

Even so, I'd much rather keep each response time lower.   If web page
requests are coming in at 1 a second, it's much nicer to respond to
each of them in 1 second than in 4 seconds -- even if the overall
throughput is identical.



Re: Sorting Improvements for 8.4

From
Mark Mielke
Date:
Jeff Davis wrote: <blockquote cite="mid:1198105396.10057.23.camel@dogma.ljc.laika.com" type="cite"><pre wrap="">On
Wed,2007-12-19 at 15:51 -0500, Mark Mielke wrote:  </pre><blockquote type="cite"><pre wrap="">That sounds possible, but
Istill feel myself suspecting that disk
 
reads will be much slower than localized text comparison. Perhaps I am
overestimating the performance of the comparison function?   </pre></blockquote><pre wrap="">I think this simple test
willchange your perceptions: </pre></blockquote> Yes - I received the same results (although my PostgreSQL doesn't have
abuilt in case ::text::bytea... :-) )<br /><br /><blockquote cite="mid:1198105396.10057.23.camel@dogma.ljc.laika.com"
type="cite"><prewrap="">On my machine this table fits easily in memory (so there aren't any disk
 
reads at all). Sorting takes 7 seconds for floats, 9 seconds for binary
data, and 20 seconds for localized text. That's much longer than it
would take to read that data from disk, since it's only 70MB (which
takes a fraction of a second on my machine). </pre></blockquote> Might this mean that PostgreSQL performs too many copy
operations?:-)<br /><br /><blockquote cite="mid:1198105396.10057.23.camel@dogma.ljc.laika.com" type="cite"><pre
wrap="">Ithink this disproves your hypothesis that sorting happens at disk
 
speed. </pre></blockquote> Yes.<br /><br /><blockquote cite="mid:1198105396.10057.23.camel@dogma.ljc.laika.com"
type="cite"><blockquotetype="cite"><pre wrap="">Yep - I started to read up on it. It still sounds like it's a hard-ish
 
problem (to achieve near N times speedup for N CPU cores without
degrading performance for existing loads), but that doesn't mean
impossible. :-)   </pre></blockquote><pre wrap="">You don't even need multiple cores to achieve a speedup, according
to
Ron's reference. </pre></blockquote> I think Ron's reference actually said that you don't need full cores to achieve a
speedup.It spoke of Intel's HT system. A single CPU with a single execution pipeline is not going to function better
withmultiple threads unless the single thread case is written wrong. Multiple threads is always an overall loss without
hardwaresupport. The thinking on this is that multiple threads can sometimes lead to cleaner designs, which are
sometimesmore naturally written to be performing. In my experience, the opposite is usually true.<br /><br /> But, if
youdo have HT, and the algorithm can be modified to take advantage of it for an overall increase in speed - great.<br
/><br/> Cheers,<br /> mark<br /><br /><pre class="moz-signature" cols="72">-- 
 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Sorting Improvements for 8.4

From
Jeff Davis
Date:
On Thu, 2007-12-20 at 01:26 +0000, Gregory Stark wrote:
> I suspect most of that is spent just copying the data around. Which would not
> be helped by having multiple threads doing the copying -- and in fact might be
> exacerbated if it required an extra copy to consolidate all the data in the
> end.

The theory is that it could be helped by multiple threads, because of
the memory latency.

> How long does a "explain analyze sinmple select * from sorter" take?

2 seconds, but the table is already in cache I'm sure (since it's so
small).

> Note that speeding up a query from 20s to 5s isn't terribly useful. If it's
> OLTP you can't be using all your cores for each user anyways. And if it's DSS
> 20s isn't a problem.

I'm not pushing for parallel sort, I'm just brainstorming. I think Ron's
idea has merit, but I realize it also has limitations.

> Where parallel processing like this becomes attractive is when you're running
> a 2 hour query on a machine sequentially running scheduled batch jobs which
> can be sped up to 30 minutes. But in that case you're almost certainly being
> limited by your disk bandwidth, not your cpu speed.

Are you sure that's always the case? My test seemed to indicate that
sorting took longer than it would to read the file from disk. 

Regards,Jeff Davis



Re: Sorting Improvements for 8.4

From
Mark Mielke
Date:
Jeff Davis wrote:<br /><blockquote cite="mid:1198193593.10057.72.camel@dogma.ljc.laika.com" type="cite"><blockquote
type="cite"><prewrap="">Where parallel processing like this becomes attractive is when you're running
 
a 2 hour query on a machine sequentially running scheduled batch jobs which
can be sped up to 30 minutes. But in that case you're almost certainly being
limited by your disk bandwidth, not your cpu speed.   </pre></blockquote><pre wrap="">Are you sure that's always the
case?My test seemed to indicate that
 
sorting took longer than it would to read the file from disk.  </pre></blockquote> It's probably not a relevant
scenarioeither, as this discussion has only been about improving the performance of the sort, and I suspect there are
veryfew database loads with performance characteristics completely defined by the efficiency of the sort algorithm?
:-)<br/><br /> So far I am getting:<br /><br />     1) Sort is slower than many people expect. (Jeff's test case
emphasizesthis well)<br />     2) White papers exist that document theoretical, simulated, and in some cases actual
executionwhere parallel sort can be beneficial.<br />     3) White papers exist that document how parallel sort is
difficultto get right, and that characteristics of machines in use today prevent full utilization.<br />     4)
PostgreSQLis not designed to spread a single query across multiple execution units (whether CPUs, cores, or HT).<br
/><br/> It's interesting discussion for me thus far.<br /><br /> Cheers,<br /> mark<br /><br /><pre
class="moz-signature"cols="72">-- 
 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Sorting Improvements for 8.4

From
Brian Hurt
Date:
Brian Hurt wrote:

> While we're blue skying things, I've had an idea for a sorting 
> algorithm kicking around for a couple of years that might be 
> interesting.  It's a variation on heapsort to make it significantly 
> more block-friendly.  I have no idea if the idea would work, or how 
> well it'd work, but it might be worthwhile kicking around.
>
> Now, the core idea of heapsort is that the array is put into heap 
> order- basically, that a[i] >= a[2i+1] and a[i] >= a[2i+2] (doing the 
> 0-based array version here).  The problem is that, assuming that the 
> length of a is larger than memory, then a[2i+1] is likely going to be 
> on a different page or block than a[i].  That means every time you 
> have to bubble down a new element, you end up reading O(log N) blocks- 
> this is *per element*.
>
> The variation is to instead work with blocks, so you have a block of 
> entries b[i], and you change the definition of heap order, so that 
> min(b[i]) >= max(b[2i+1]) and min(b[i]) >= max(b[2i+2]).  Also, during 
> bubble down, you need to be carefull to only change the minimum value 
> of one of the two child blocks b[2i+1] and b[2i+2].  Other than that, 
> the algorithm works as normal.  The advantage of doing it this way is 
> that while each bubble down still takes O(log N) blocks being touched, 
> you get a entire block worth of results for your effort.  Make your 
> blocks large enough (say, 1/4 the size of workmem) and you greatly 
> reduce N, the number of blocks you have to deal with, and get much 
> better I/O (when you're reading, you're reading megabytes at a shot).
>
> Now, there are boatloads of complexities I'm glossing over here.  This 
> is more of a sketch of the idea.  But it's something to consider.
>
Following up to myself (my apologies), but it's occurred to me that 
there are three advantages to this proposal that I've since thought of:

1) The two child blocks b[2i+1] and b[2i+2]- the one with the larger 
minimum element is the one we might replace.  In other words, if 
min(b[2i+1]) > min(b[2i+2]) and min(b[i]) < min(b[2i+1]), then we know 
we're going to want the blocks b[4i+3] and b[4i+4]- before we're done 
with blocks b[2i+1] and b[2i+2].  The point here is that this would work 
wonders with the posix_fadvise/asyncio ideas kicking around.  It'd be 
easy for the code to keep 2 large writes and 2 large reads going pretty 
constantly.

2) There is some easy parallelization available.  I'm not sure how much 
worth this is, but the bubble down code is fairly easy to parallelize.  
If we have two bubble-downs going on in parallel, once they go down 
different branches (one thread goes to block b[2i+1] while the other 
goes to b[2i+2]) they no longer interact.  Blocks near the root of the 
heap would be contended over, and multiple threads means smaller blocks 
to keep the total memory foot print the same.  Personally, I think the 
asyncio idea above is more likely to be worthwhile.

3) It's possible to perform the sort lazily.  You have the initial O(N) 
pass over the list, but then each block is only O(log N) cost.  If it's 
likely that only the first part of the result is needed, then much of 
the work can be avoided.

Brian



Re: Sorting Improvements for 8.4

From
Gregory Stark
Date:
"Brian Hurt" <bhurt@janestcapital.com> writes:

> 3) It's possible to perform the sort lazily.  You have the initial O(N) pass
> over the list, but then each block is only O(log N) cost.  If it's likely that
> only the first part of the result is needed, then much of the work can be
> avoided.

Now that's a *fascinating* idea. I'm having trouble coming up with a really
killer use case for it since the bounded heap sort takes care of many cases
where it would seem to apply. But it seems rally promising.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: Sorting Improvements for 8.4

From
"Joris Dobbelsteen"
Date:
>-----Original Message-----
>From: pgsql-hackers-owner@postgresql.org
>[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ron Mayer
>Sent: Wednesday, 19 December 2007 19:26
>To: Mark Mielke; pgsql-hackers@postgresql.org
>Subject: Re: [HACKERS] Sorting Improvements for 8.4
>
>> Or do you mean being able to perform parts of the query plan
>fully in
>> parallel? If this, then one would need a lot more than
>ParallelSort...
>
>I wouldn't recommend that - it seems like a Hard Problem.
>
>My guess is that the best way to use multiple threads in one
>backend would be to find specific algorithms like sorting that
> would be easier to isolate.

To give my view on this problem: if I'm looking at a competing
(commercial) database product, they added some operations called
"parallize" and "combine". Basically they split the data across several
threads at one point and combine them later. This is basically what you
are also implementing for "parallelsort", but as a single step in the
query exeuction.

In my opinion your starting point is too narrow and specific, especially
since a fairly simple generalization is possible. Instead, the issue
becomes the spill-to-disk code that needs to operate in parallel (which
needs to be tackled sooner or later anyways).

If you can change the sort into three steps: parallelize, sort (multiple
parallel instances) and combine (merge) you still have the same base
case. However I believe such a thing is much easier to extend to more
operations.

Futhermore it seems that cache is a considered a major problem,
especially the varying sizes. Wouldn't a cache-oblivious algorithm, like
<http://erikdemaine.org/papers/BRICS2002/> or
<http://etd.uwaterloo.ca/etd/afarzan2004.pdf> be a good starting point
for refinements on sort algorithm itself?
I believe you can get a more consistent performance depending on the
cache sizes, but it might be slower than a well-tuned quicksort.

Just my EUR 0,02...

- Joris



Re: Sorting Improvements for 8.4

From
James Mansion
Date:
Ron Mayer wrote:
>> Or do you mean being able to perform parts of the query plan fully in
>> parallel? If this, then one would need a lot more than ParallelSort...
>>     
>
> I wouldn't recommend that - it seems like a Hard Problem.
>
>   
Isn't it the case that the implicit unions from processing partitioned 
data provides a
more-or-less-ideal opportunity here?

I certainly have sympathy for parallelising expensive queries to bring 
the best response
time down, even if the average under full load goes up slightly, since 
any implied locks
(including pinning of read-ahead ages) will be released sooner.

And when load is light, users who are online get more of the hardware 
they paid for.

James



Re: Sorting Improvements for 8.4

From
Bruce Momjian
Date:
Added to TODO:

> * Consider being smarter about memory and external files used during
>   sorts
>
>   http://archives.postgresql.org/pgsql-hackers/2007-11/msg01101.php
>   http://archives.postgresql.org/pgsql-hackers/2007-12/msg00045.php


---------------------------------------------------------------------------

Simon Riggs wrote:
> Just wanted to review a few thoughts and ideas around improving external
> sorts, as recently encouraged to do by Jim Nasby. 
> 
> Current issues/opportunities are these:
> 
> ISSUES
> 
> a) Memory is always in short supply, so using what we have more
> effectively is going to be welcome.
> 
> b) Heap sort has a reasonably strong anti-memory effect, meaning that
> there is an optimum amount of memory for any sort. This shows itself
> with the CPU time increasing during run forming, making this stage of
> the sort CPU bound.
> 
> c) Many sorts are performed prior to aggregation. It might be possible
> to aggregate prior to writing to disk, as a way of reducing the overall
> I/O cost. Benefit would occur when the total CPU cost was same no matter
> when aggregation occurred; that would not apply in all cases, so we
> would need to sense when benefit was possible.
> 
> d) Generally reducing the I/O cost of sorting may help the merging
> stages of a sort.
> 
> 
> SOLUTIONS
> 
> The ideas that Greg Stark, Jim Nasby, Heikki and myself have discussed
> to date were the following:
> 
> 1. Sort I/O Compression
> 2. Aggregation during Sort
> 3. Memory Pools
> 4. Dynamic Heap Management
> 5. Dynamic Run Handling
> 
> I've added (5) to the list as well, which hasn't yet been discussed.
> 
> 1. SORT I/O COMPRESSION
> 
> This idea is not dead yet, it just needs a full set of tests to confirm
> that there is benefit in all cases. If there's not benefit in all cases,
> we may be able to work out which cases those are, so we know when to use
> it.
> 
> 
> 2. AGGREGATION DURING SORT
> 
> Many sorts are preliminary steps before aggregation. Aggregation during
> run forming would potentially reduce size of heap and reduce number of
> comparisons. For many types of aggregate this would not theoretically
> increase the number of ops since sum(), avg(), min(), max() are all
> commutative according to their inputs. We would probably need to add
> another option to Aggregate Functions to indicate the possibility of
> calculating the aggregate in this way, since some aggregates might rely
> on the current situation that they expect all their inputs at once in
> sorted order. (Windowed aggregates are unlikely to be this way).
> 
> 
> 3. MEMORY POOLS
> 
> Solving a) could be done by sensible management and allocation of
> resources. Discussed before, so not rehashed here.
> 
> 
> 4. DYNAMIC HEAP MANAGEMENT
> 
> The size of the active heap required to produce the fewest number of
> runs varies as the sort progresses. For example, sorting an already
> sorted input needs a trivial heap size. 
> 
> Larger heap sizes simply avoid forming more runs, which is not
> necessarily a bad thing. More runs only become bad things when we go
> beyond our ability to perform a single final merge (see Dynamic Run
> Handling below).
> 
> Smaller heap sizes reduce the number of comparisons required, plus
> increase the L2+ cache efficiencies. Those two things are the cause of
> the anti-memory effect.
> 
> Because of b), optimising the size of the heap could potentially be a
> good thing. This can make a considerable difference for nearly sorted
> data (measurements required...).
> 
> When we have M amount of memory available to us, we don't start by using
> it all. We start with m memory and only increase up to M if required.
> Runs are built with memory set at m. If a tuple arrives that would force
> the formation of a new run we assess
> 
> i) do we care if another run is formed? Use our knowledge of the likely
> amount of data coming our way, compared with number of runs formed so
> far and see if we really care. If we don't care, allow the new run to be
> formed and carry on with just heap size of m. (see Dynamic Run Handling
> later).
> 
> ii) if we do care about number of runs, then allow the heap to grow by
> increments up to the full size of M. Increments would be at least x2 and
> possibly x4. That way we always have work space to rearrange the heap.
> 
> All of this dances too cleverly around the exact technique and potential
> costs of rearranging the heap. That is not to be ignored and is the next
> task in evaluating and accepting/dismissing this potential technique.
> 
> In combination with memory pooling this technique might also allow
> memory to be better distributed to other users.
> 
> 
> 5. DYNAMIC RUN HANDLING (in Final Merge)
> 
> Another way of addressing a) is to simply make better use of memory
> itself. Let's look at that in more detail:
> 
> Number of runs that can be merged at once is currently fixed, based upon
> available memory. This has the underlying assumption that all runs will
> be concurrently active during final merging, which may not always be
> true.
> 
> If we have random data then almost all runs will overlap with all other
> runs, i.e. the min and max values are sufficiently wide that the runs do
> all overlap. In many cases, data arrives in somewhat sorted order, e.g.
> financial data is fairly regular with some late payers but not many, and
> those trail off with a fairly tight decay. In the somewhat sorted case
> we find that the actual overlap is less than total, so there are many
> later runs that don't overlap the earlier ones. In the best case we
> would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4
> overlap.
> 
> This is also the point where I suggest breaking away from Knuth
> completely. All of the main algorithms described by Knuth are tape
> sorts. A run is written to a particular tape and then stays there until
> "moved" to another tape. That means we have to get super-clever about
> how runs should be written and formed (see Knuth). If we realise that
> the runs aren't fixed to particular tapes they are all just independent
> runs, we can radically rethink sorting. There is no need to implement
> Cascade Sort, but we do need to rethink merging from the ground up. (All
> of which is a relief, because Knuth et al are definitely smarter than
> me, but I've got disks and lots of memory and those guys had tapes.).
> 
> If we track the min and max values for each run, when run building is
> finished we will be able to build a merging plan that allows us to be
> smart about the runs we should bring together. We start with the run
> with the lowest min value, as well as all runs that overlap that run.
> When that run is exhausted we move to the next lowest and at that point
> start merging all runs that overlap that one. 
> 
> This then means we may be able to begin final merging with more runs
> than the current cut-off. It's possible that we could merge an infinite
> number of runs in final merge with fixed memory. If we *do* need to
> merge we can work out which runs should be our best pre-merge
> candidates, based upon how big they are and which other runs they
> overlap. (That's much better than being forced to merge tapes 2, 7 and
> 17 because some bizarre math says so (see Knuth).)
> 
> Anyway, claiming to have found a better way than Knuth makes me feel a
> little nervous, so some searching questions on this are very welcome.
> 
> Interestingly, if we combine this technique with dynamic heap management
> we may be able to allow a very large number of efficiently written runs
> to form without it causing any merging.
> 
> mac_man recently noted the possibility that some runs don't overlap at
> all and so can be merged for free. That's true, though doesn't actually
> improve the basic idea here which is building a merge plan after runs
> have been formed, with an eye on minimizing and potentially elimination
> the merge phase.
> 
> There's probably some typos or thinkos above, so go easy on me Greg!
> They aren't there because I want to skim over anything.
> 
> I'm not likely to get a chance to do all of this in the near future, so
> documenting it now should help others to carry things forward.
> 
> -- 
>   Simon Riggs
>   2ndQuadrant  http://www.2ndQuadrant.com
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +