Thread: Using quicksort for every external sort run

Using quicksort for every external sort run

From

Peter Geoghegan

Date:

20 August 2015, 02:24:38

I'll start a new thread for this, since my external sorting patch has
now evolved well past the original "quicksort with spillover"
idea...although not quite how I anticipated it would. It seems like
I've reached a good point to get some feedback.

I attach a patch series featuring a new, more comprehensive approach
to quicksorting runs during external sorts. What I have now still
includes "quicksort with spillover", but it's just a part of a larger
project. I am quite happy with the improvements in performance shown
by my testing, which I go into below.

Controversy
=========

A few weeks ago, I did not anticipate that I'd propose that
replacement selection sort be used far less (only somewhat less, since
I was only somewhat doubtful about the algorithm at the time). I had
originally planned on continuing to *always* use it for the first run,
to make "quicksort with spillover" possible (thereby sometimes
avoiding significant I/O by not spilling most tuples), but also to
make cases always considered sympathetic to replacement selection
continue to happen. I thought that second or subsequent runs could
still be quicksorted, but that I must still care about this latter
category, the traditional sympathetic cases. This latter category is
mostly just one important property of replacement selection: even
without a strong logical/physical correlation, the algorithm tends to
produce runs that are about twice the size of work_mem. (It's also
notable that replacement selection only produces one run with mostly
presorted input, even where input far exceeds work_mem, which is a
neat trick.)

I wanted to avoid controversy, but the case for the controversy is too
strong for me to ignore: despite these upsides, replacement selection
is obsolete, and should usually be avoided.

Replacement selection sort still has a role to play in making
"quicksort with spillover" possible (when a sympathetic case is
*anticipated*), but other than that it seems generally inferior to a
simple hybrid sort-merge strategy on modern hardware. By modern
hardware, I mean anything manufactured in at least the last 20 years.
We've already seen that the algorithm's use of a heap works badly with
modern CPU caches, but that is just one factor contributing to its
obsolescence.

The big selling point of replacement selection sort in the 20th
century was that it sometimes avoided multi-pass sorts as compared to
a simple sort-merge strategy (remember when tuplesort.c always used 7
tapes? When you need to use 7 actual magnetic tapes, rewinding is
expensive and in general this matters a lot!). We all know that memory
capacity has grown enormously since then, but we must also consider
another factor: At the same time, a simple hybrid sort-merge
strategy's capacity to more or less get the important detail here
right -- to avoid a multi-pass sort -- has increased quadratically
(relative to work_mem/memory capacity). As an example, testing shows
that for a datum tuplesort that requires about 2300MB of work_mem to
be completed as a simple internal sort this patch only needs 30MB to
just do one pass (see benchmark query below). I've mostly regressed
that particular property of tuplesort (it used to be less than 30MB),
but that's clearly the wrong thing to worry about for all kinds of
reasons, probably even in the unimportant cases now forced to do
multiple passes.

Multi-pass sorts
---------------------

I believe, in general, that we should consider a multi-pass sort to be
a kind of inherently suspect thing these days, in the same way that
checkpoints occurring 5 seconds apart are: not actually abnormal, but
something that we should regard suspiciously. Can you really not
afford enough work_mem to only do one pass? Does it really make sense
to add far more I/O and CPU costs to avoid that other tiny memory
capacity cost?

In theory, the answer could be "yes", but it seems highly unlikely.
Not only is very little memory required to avoid a multi-pass merge
step, but as described above the amount required grows very slowly
relative to linear growth in input. I propose to add a
checkpoint_warning style warning (with a checkpoint_warning style GUC
to control it). ISTM that these days, multi-pass merges are like
saving $2 on replacing a stairwell light bulb, at the expense of
regularly stumbling down the stairs in the dark. It shouldn't matter
if you have a 50 terabyte decision support database or if you're
paying Heroku a small monthly fee to run a database backing your web
app: simply avoiding multi-pass merges is probably always the most
economical solution, and by a wide margin.

Note that I am not skeptical of polyphase merging itself, even though
it is generally considered to be a complimentary technique to
replacement selection (some less formal writing on external sorting
seemingly fails to draw a sharp distinction). Nothing has changed
there.

Patch, performance
===============

Let's focus on a multi-run sort, that does not use "quicksort with
spillover", since that is all new, and is probably the most compelling
case for very large databases with hundreds of gigabytes of data to
sort.

I think that this patch requires a machine with more I/O bandwidth
than my laptop to get a proper sense of the improvement made. I've
been using a tmpfs temp_tablespace for testing, to simulate this. That
may leave me slightly optimistic about I/O costs, but you can usually
get significantly more sequential I/O bandwidth by adding additional
disks, whereas you cannot really buy new hardware to improve the
situation with excessive CPU cache misses.

Benchmark
---------------

-- Setup, 100 million tuple table with high cardinality int4 column (2
billion possible int4 values)
create table big_high_cardinality_int4 as
  select (random() * 2000000000)::int4 s,
  'abcdefghijlmn'::text junk
  from generate_series(1, 100000000);
-- Make cost model hinting accurate:
analyze big_high_cardinality_int4;
checkpoint;

Let's start by comparing an external sort that uses 1/3 the memory of
an internal sort against the master branch.  That's completely unfair
on the patch, of course, but it is a useful indicator of how well
external sorts do overall. Although an external sort surely cannot be
as fast as an internal sort, it might be able to approach an internal
sort's speed when there is plenty of I/O bandwidth. That's a good
thing to aim for, I think.

-- Master (just enough memory for internal sort):
set work_mem = '2300MB';
select count(distinct(s)) from big_high_cardinality;

***** Runtime after stabilization: ~33.6 seconds *****

-- Patch series, but with just over 1/3 the memory:
set work_mem = '800MB';
select count(distinct(s)) from big_high_cardinality;

***** Runtime after stabilization: ~37.1 seconds *****

The patch only takes ~10% more time to execute this query, which seems
very good considering that ~1/3 the work_mem has been put to use.

trace_sort output for patch during execution of this case:

LOG:  begin datum sort: workMem = 819200, randomAccess = f
LOG:  switching to external sort with 2926 tapes: CPU 0.39s/2.66u sec
elapsed 3.06 sec
LOG:  replacement selection avg tuple size 24.00 crossover: 0.85
LOG:  hybrid sort-merge in use from row 34952532 with 100000000.00 total rows
LOG:  finished quicksorting run 1: CPU 0.39s/8.84u sec elapsed 9.24 sec
LOG:  finished writing quicksorted run 1 to tape 0: CPU 0.60s/9.61u
sec elapsed 10.22 sec
LOG:  finished quicksorting run 2: CPU 0.87s/18.61u sec elapsed 19.50 sec
LOG:  finished writing quicksorted run 2 to tape 1: CPU 1.07s/19.38u
sec elapsed 20.46 sec
LOG:  performsort starting: CPU 1.27s/21.79u sec elapsed 23.07 sec
LOG:  finished quicksorting run 3: CPU 1.27s/27.07u sec elapsed 28.35 sec
LOG:  finished writing quicksorted run 3 to tape 2: CPU 1.47s/27.69u
sec elapsed 29.18 sec
LOG:  performsort done (except 3-way final merge): CPU 1.51s/28.54u
sec elapsed 30.07 sec
LOG:  external sort ended, 146625 disk blocks used: CPU 1.76s/35.32u
sec elapsed 37.10 sec

Note that the on-tape runs are small relative to CPU costs, so this
query is a bit sympathetic (consider the time spent writing batches
that trace_sort indicates here). CREATE INDEX would not compare so
well with an internal sort, for example, especially if it was a
composite index or something. I've sized work_mem here in a deliberate
way, to make sure there are 3 runs of similar size by the time the
merge step is reached, which makes a small difference in the patch's
favor. All told, this seems like a very significant overall
improvement.

Now, consider master's performance with the same work_mem setting (a
fair test with comparable resource usage for master and patch):

-- Master
set work_mem = '800MB';
select count(distinct(s)) from big_high_cardinality;

***** Runtime after stabilization: ~120.9 seconds *****

The patch is ~3.25x faster than master here, which also seems like a
significant improvement. That's pretty close to the improvement
previously seen for good "quicksort with spillover" cases, but
suitable for every external sort case that doesn't use "quicksort with
spillover". In other words, no variety of external sort is not
significantly improved by the patch.

I think it's safe to suppose that there are also big benefits when
multiple concurrent sort operations run on the same system. For
example, when pg_restore has multiple jobs.

Worst case
---------------

Even with a traditionally sympathetic case for replacement selection
sort, the patch beats replacement selection with multiple on-tape
runs. When experimenting here, I did not forget to account for our
qsort()'s behavior in the event of *perfectly* presorted input
("Bubble sort best case" behavior [1]). Other than that, I have a hard
time thinking of an unsympathetic case for the patch, and could not
find any actual regressions with a fair amount of effort.

Abbreviated keys are not used when merging, but that doesn't seem to
be something that notably counts against the new approach (which will
have shorter runs on average). After all, the reason why abbreviated
keys aren't saved on disk for merging is that they're probably not
very useful when merging. They would resolve far fewer comparisons if
they were used during merging, and having somewhat smaller runs does
not result in significantly more non-abbreviated comparisons, even
when sorting random noise strings.

Avoiding replacement selection *altogether*
=================================

Assuming you agree with my conclusions on replacement selection sort
mostly not being worth it, we need to avoid replacement selection
except when it'll probably allow a "quicksort with spillover". In my
mind, that's now the *only* reason to use replacement selection.
Callers pass a hint to tuplesort indicating how many tuples it is
estimated will ultimately be passed before a sort is performed.
(Typically, this comes from a scan plan node's row estimate, or more
directly from the relcache for things like CREATE INDEX.)

Cost model -- details
----------------------------

Second or subsequent runs *never* use replacement selection -- it is
only *considered* for the first run, right before the possible point
of initial heapification within inittapes(). The cost model is
contained within the new function useselection(). See the second patch
in the series for full details. That's where this is added.

I have a fairly high bar for even using replacement selection for the
first run -- several factors can result in a simple hybrid sort-merge
strategy being used instead of a "quicksort with spillover", because
in general most of the benefit seems to be around CPU cache misses
rather than savings in I/O. Consider my benchmark query above once
more -- with replacement selection used for the first run in the
benchmark case above (e.g., with just the first patch in the series
applied, or setting the "optimize_avoid_selection" debug GUC to
"off"), I found that it took over twice as long to execute, even
though the second-or-subsequent (now smaller) runs were quicksorted
just the same, and were all merged just the same.

The numbers should make it obvious why I gave in to the temptation of
adding an ad-hoc, tuplesort-private cost model. At this point, I'd
rather scrap "quicksort with spillover" (and the use of replacement
selection under all possible circumstances) than scrap the idea of a
cost model. That would make more sense, even though it would give up
on the idea of saving most I/O where the work_mem threshold is only
crossed by a small amount.

Future work
=========

I anticipate a number of other things within the first patch in the
series, some of which are already worked out to some degree.

Asynchronous I/O
-------------------------

This patch leaves open the possibility of using something like
libaio/librt for sorting. That would probably use half of memtuples as
scratch space, while the other half is quicksorted.

Memory prefetching
---------------------------

To test what role memory prefetching is likely to have here, I attach
a custom version of my tuplesort/tuplestore prefetch patch, with
prefetching added to the "quicksort with spillover" and batch dumping
runs WRITETUP()-calling code. This seems to help performance
measurably. However, I guess it shouldn't really be considered as part
of this patch. It can follow the initial commit of the big, base patch
(or will becomes part of the base patch if and when prefetching is
committed first).

cost_sort() changes
--------------------------

I had every intention of making cost_sort() a continuous cost function
as part of this work. This could be justified by "quicksort with
spillover" allowing tuplesort to "blend" from internal to external
sorting as input size is gradually increased. This seemed like
something that would have significant non-obvious benefits in several
other areas. However, I've put off dealing with making any change to
cost_sort() because of concerns about the complexity of overlaying
such changes on top of the tuplesort-private cost model.

I think that this will need to be discussed in a lot more detail. As a
further matter, materialization of sort nodes will probably also
require tweaks to the costing for "quicksort with spillover". Recall
that "quicksort with spillover" can only work for !randomAccess
tuplesort callers.

Run size
------------

This patch continues to have tuplesort determine run size based on the
availability of work_mem only. It does not entirely fix the problem of
having work_mem sizing impact performance in counter-intuitive ways.
In other words, smaller work_mem sizes can still be faster. It does
make that general situation much better, though, because quicksort is
a cache oblivious algorithm. Smaller work_mem sizes are sometimes a
bit faster, but never dramatically faster.

In general, the whole idea of making run size as big as possible is
bogus, unless that enables or is likely to enable a "quicksort with
spillover". The caller-supplied row count hint I've added may in the
future be extended to determine optimal run size ahead of time, when
it's perfectly clear (leaving aside misestimation) that a fully
internal sort (or "quicksort with spillover") will not occur. This
will result in faster external sorts where additional work_mem cannot
be put to good use. As a side benefit, external sorts will not be
effectively wasting a large amount of memory.

The cost model we eventually come up with to determine optimal run
size ought to balance certain things. Assuming a one-pass merge step,
then we should balance the time lost waiting on the first run and time
quicksorting the last run with the gradual increase in the cost during
the merge step. Maybe the non-use of abbreviated keys during the merge
step should also be considered. Alternatively, the run size may be
determined by a GUC that is typically sized at drive controller cache
size (e.g. 1GB) when any kind of I/O avoidance for the sort appears
impossible.

[1] Commit a3f0b3d6
--
Peter Geoghegan

On 20 August 2015 at 03:24, Peter Geoghegan <pg@heroku.com> wrote:

The patch is ~3.25x faster than master

I've tried to read this post twice and both times my work_mem overflowed. ;-)

Can you summarize what this patch does? I understand clearly what it doesn't do...

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

20 August 2015, 17:13:20

On Thu, Aug 20, 2015 at 6:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@mit.edu> writes:
>> Alternately, has anyone tested whether Timsort would work well?
>
> I think that was proposed a few years ago and did not look so good
> in simple testing.

I tested it in 2012. I got as far as writing a patch.

Timsort is very good where comparisons are expensive -- that's why
it's especially compelling when your comparator is written in Python.
However, when testing it with text, even though there were
significantly fewer comparisons, it was still slower than quicksort.
Quicksort is cache oblivious, and that's an enormous advantage. This
was before abbreviated keys; these days, the difference must be
larger.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

20 August 2015, 17:41:44

On Thu, Aug 20, 2015 at 8:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 20 August 2015 at 03:24, Peter Geoghegan <pg@heroku.com> wrote:
>>
>>
>> The patch is ~3.25x faster than master
>
>
> I've tried to read this post twice and both times my work_mem overflowed.
> ;-)
>
> Can you summarize what this patch does? I understand clearly what it doesn't
> do...

The most important thing that it does is always quicksort runs, that
are formed by simply filling work_mem with tuples in no particular
order, rather than trying to make runs that are twice as large as
work_mem on average. That's what the ~3.25x improvement concerned.
That's actually a significantly simpler algorithm than replacement
selection, and appears to be much faster. You might even say that it's
a dumb algorithm, because it is less sophisticated than replacement
selection. However, replacement selection tends to use CPU caches very
poorly, while its traditional advantages have become dramatically less
important due to large main memory sizes in particular. Also, it hurts
that we don't currently dump tuples in batches, for several reasons.
Better to do memory intense operations in batch, rather than having a
huge inner loop, in order to minimize or prevent instruction cache
misses. And we can better take advantage of asynchronous I/O.

The complicated aspect of considering the patch is whether or not it's
okay to not use replacement selection anymore -- is that an
appropriate trade-off?

The reason that the code has not actually been simplified by this
patch is that I still want to use replacement selection for one
specific case: when it is anticipated that a "quicksort with
spillover" can occur, which is only possible with incremental
spilling. That may avoid most I/O, by spilling just a few tuples using
a heap/priority queue, and quicksorting everything else. That's
compelling when you can manage it, but no reason to always use
replacement selection for the first run in the common case where there
well be several runs in total.

Is that any clearer? To borrow a phrase from the processor
architecture community, from a high level this is a "Brainiac versus
Speed Demon" [1] trade-off. (I wish that there was a widely accepted
name for this trade-off.)

[1] http://www.lighterra.com/papers/modernmicroprocessors/#thebrainiacdebate
-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Feng Tian

Date:

20 August 2015, 19:42:54

On Thu, Aug 20, 2015 at 10:41 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Aug 20, 2015 at 8:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 20 August 2015 at 03:24, Peter Geoghegan <pg@heroku.com> wrote:
>>
>>
>> The patch is ~3.25x faster than master
>
>
> I've tried to read this post twice and both times my work_mem overflowed.
> ;-)
>
> Can you summarize what this patch does? I understand clearly what it doesn't
> do...

The most important thing that it does is always quicksort runs, that
are formed by simply filling work_mem with tuples in no particular
order, rather than trying to make runs that are twice as large as
work_mem on average. That's what the ~3.25x improvement concerned.
That's actually a significantly simpler algorithm than replacement
selection, and appears to be much faster. You might even say that it's
a dumb algorithm, because it is less sophisticated than replacement
selection. However, replacement selection tends to use CPU caches very
poorly, while its traditional advantages have become dramatically less
important due to large main memory sizes in particular. Also, it hurts
that we don't currently dump tuples in batches, for several reasons.
Better to do memory intense operations in batch, rather than having a
huge inner loop, in order to minimize or prevent instruction cache
misses. And we can better take advantage of asynchronous I/O.

The complicated aspect of considering the patch is whether or not it's
okay to not use replacement selection anymore -- is that an
appropriate trade-off?

The reason that the code has not actually been simplified by this
patch is that I still want to use replacement selection for one
specific case: when it is anticipated that a "quicksort with
spillover" can occur, which is only possible with incremental
spilling. That may avoid most I/O, by spilling just a few tuples using
a heap/priority queue, and quicksorting everything else. That's
compelling when you can manage it, but no reason to always use
replacement selection for the first run in the common case where there
well be several runs in total.

Is that any clearer? To borrow a phrase from the processor
architecture community, from a high level this is a "Brainiac versus
Speed Demon" [1] trade-off. (I wish that there was a widely accepted
name for this trade-off.)

[1] http://www.lighterra.com/papers/modernmicroprocessors/#thebrainiacdebate
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hi, Peter,

Just a quick anecdotal evidence. I did similar experiment about three years ago. The conclusion was that if you have SSD, just do quick sort and forget the longer runs, but if you are using hard drives, longer runs is the winner (and safer, to avoid cliffs). I did not experiment with RAID0/5 on many spindles though.

Not limited to sort, more generally, SSD is different enough from HDD, therefore it may worth the effort for backend to "guess" what storage device it has, then choose the right thing to do.

Cheers.

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

20 August 2015, 20:16:41

On Thu, Aug 20, 2015 at 12:42 PM, Feng Tian <ftian@vitessedata.com> wrote:
> Just a quick anecdotal evidence.  I did similar experiment about three years
> ago.   The conclusion was that if you have SSD, just do quick sort and
> forget the longer runs, but if you are using hard drives, longer runs is the
> winner (and safer, to avoid cliffs).    I did not experiment with RAID0/5 on
> many spindles though.
>
> Not limited to sort, more generally, SSD is different enough from HDD,
> therefore it may worth the effort for backend to "guess" what storage device
> it has, then choose the right thing to do.

The devil is in the details. I cannot really comment on such a general
statement.

I would be willing to believe that that's true under
unrealistic/unrepresentative conditions. Specifically, when multiple
passes are required with a sort-merge strategy where that isn't the
case with replacement selection. This could happen with a tiny
work_mem setting (tiny in an absolute sense more than a relative
sense). With an HDD, where sequential I/O is so much faster, this
could be enough to make replacement selection win, just as it would
have in the 1970s with magnetic tapes.

As I've said, the solution is to simply avoid multiple passes, which
should be possible in virtually all cases because of the quadratic
growth in a classic hybrid sort-merge strategy's capacity to avoid
multiple passes (growth relative to work_mem's growth). Once you
ensure that, then you probably have a mostly I/O bound workload, which
can be made faster by adding sequential I/O capacity (or, on the
Postgres internals side, adding asynchronous I/O, or with memory
prefetching). You cannot really buy a faster CPU to make a degenerate
heapsort faster.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Feng Tian

Date:

20 August 2015, 20:28:37

On Thu, Aug 20, 2015 at 1:16 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Aug 20, 2015 at 12:42 PM, Feng Tian <ftian@vitessedata.com> wrote:
> Just a quick anecdotal evidence. I did similar experiment about three years
> ago. The conclusion was that if you have SSD, just do quick sort and
> forget the longer runs, but if you are using hard drives, longer runs is the
> winner (and safer, to avoid cliffs). I did not experiment with RAID0/5 on
> many spindles though.
>
> Not limited to sort, more generally, SSD is different enough from HDD,
> therefore it may worth the effort for backend to "guess" what storage device
> it has, then choose the right thing to do.

The devil is in the details. I cannot really comment on such a general
statement.

I would be willing to believe that that's true under
unrealistic/unrepresentative conditions. Specifically, when multiple
passes are required with a sort-merge strategy where that isn't the
case with replacement selection. This could happen with a tiny
work_mem setting (tiny in an absolute sense more than a relative
sense). With an HDD, where sequential I/O is so much faster, this
could be enough to make replacement selection win, just as it would
have in the 1970s with magnetic tapes.

As I've said, the solution is to simply avoid multiple passes, which
should be possible in virtually all cases because of the quadratic
growth in a classic hybrid sort-merge strategy's capacity to avoid
multiple passes (growth relative to work_mem's growth). Once you
ensure that, then you probably have a mostly I/O bound workload, which
can be made faster by adding sequential I/O capacity (or, on the
Postgres internals side, adding asynchronous I/O, or with memory
prefetching). You cannot really buy a faster CPU to make a degenerate
heapsort faster.

--
Peter Geoghegan

Agree everything in principal,except one thing -- no, random IO on HDD in 2010s (relative to CPU/Memory/SSD), is not any faster than tape in 1970s. :-)

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

20 August 2015, 20:31:22

On Thu, Aug 20, 2015 at 1:28 PM, Feng Tian <ftian@vitessedata.com> wrote:
> Agree everything in principal,except one thing -- no, random IO on HDD in
> 2010s (relative to CPU/Memory/SSD), is not any faster than tape in 1970s.
> :-)

Sure. The advantage of replacement selection could be a deciding
factor in unrepresentative cases, as I mentioned, but even then it's
not going to be a dramatic difference as it would have been in the
past.

By the way, please don't top-post.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

20 August 2015, 22:16:55

On Thu, Aug 20, 2015 at 6:05 AM, Greg Stark <stark@mit.edu> wrote:
> On Thu, Aug 20, 2015 at 3:24 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> I believe, in general, that we should consider a multi-pass sort to be
>> a kind of inherently suspect thing these days, in the same way that
>> checkpoints occurring 5 seconds apart are: not actually abnormal, but
>> something that we should regard suspiciously. Can you really not
>> afford enough work_mem to only do one pass? Does it really make sense
>> to add far more I/O and CPU costs to avoid that other tiny memory
>> capacity cost?
>
> I think this is the crux of the argument. And I think you're
> basically, but not entirely, right.

I agree that that's the crux of my argument. I disagree about my not
being entirely right.  :-)

> The key metric there is not how cheap memory has gotten but rather
> what the ratio is between the system's memory and disk storage. The
> use case I think you're leaving out is the classic "data warehouse"
> with huge disk arrays attached to a single host running massive
> queries for hours. In that case reducing run size will reduce I/O
> requirements directly and halving the amount of I/O sort takes will
> halve the time it takes regardless of cpu efficiency. And I have a
> suspicion typical data distributions get much better than a 2x
> speedup.

It could reduce seek time, which might be the dominant cost (but not
I/O as such). I do accept that my argument did not really apply to
this case, but you seem to be making an additional non-conflicting
argument that certain data warehousing cases would be helped in
another way by my patch. My argument was only about multi-gigabyte
cases that I tested that were significantly improved, primarily due to
CPU caching effects. If this helps with extremely large sorts that do
require multiple passes by reducing seek time -- I think that they'd
have to be multi-terabyte sorts, which I am ill-equipped to test --
then so much the better, I suppose.

In any case, as I've said the way we allow run size to be dictated
only by available memory (plus whatever replacement selection can do
to make on-tape runs longer) is bogus. In the future there should be a
cost model for an optimal run size, too.

> But I think you're basically right that this is the wrong use case to
> worry about for most users. Even those users that do have large batch
> queries are probably not processing so much that they should be doing
> multiple passes. The ones that do are are probably more interested in
> parallel query, federated databases, column stores, and so on rather
> than worrying about just how many hours it takes to sort their
> multiple terabytes on a single processor.

I suppose so. If you can afford multiple terabytes of storage, you can
probably still afford gigabytes of memory to do a single pass. My
laptop is almost 3 years old, weighs about 1.5 Kg, and has 16 GiB of
memory. It's usually always that simple, and not really because we
assume that Postgres doesn't have to deal with multi-terabyte sorts.
Maybe I lack perspective, having never really dealt with a real data
warehouse. I didn't mean to imply that in no circumstances could
anyone profit from a multi-pass sort. If you're using Hadoop or
something, I imagine that it still makes sense.

In general, I think you'll agree that we should strongly leverage the
fact that a multi-pass sort just isn't going to be needed when things
are set up correctly under standard operating conditions nowadays.

> I am quite suspicious of quicksort though. It has O(n^2) worst case
> and I think it's only a matter of time before people start worrying
> about DOS attacks from users able to influence the data ordering. It's
> also not very suitable for GPU processing. Quicksort gets most of its
> advantage from cache efficiency, it isn't a super efficient algorithm
> otherwise, are there not other cache efficient algorithms to consider?

I think that high quality quicksort implementations [1] will continue
to be the way to go for sorting integers internally at the very least.
Practically speaking, problems with the worst case performance have
been completely ironed out since the early 1990s. I think it's
possible to DOS Postgres by artificially introducing a worst-case, but
it's very unlikely to be the easiest way of doing that in practice. I
admit that it's probably the coolest way, though.

I think that the benefits of offloading sorting to the GPU are not in
evidence today. This may be especially true of a "street legal"
implementation that takes into account all of the edge cases, as
opposed to a hand customized thing for sorting uniformly distributed
random integers. GPU sorts tend to use radix sort, and I just can't
see that catching on.

[1] https://www.cs.princeton.edu/~rs/talks/QuicksortIsOptimal.pdf
-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Greg Stark

Date:

21 August 2015, 00:03:36

On Thu, Aug 20, 2015 at 11:16 PM, Peter Geoghegan <pg@heroku.com> wrote:
> It could reduce seek time, which might be the dominant cost (but not
> I/O as such).

No I didn't quite follow the argument to completion. Increasing the
run size is a win if it reduces the number of passes. In the
single-pass case it has to read all the data once, write it all out to
tapes, then read it all back in again.So 3x the data. If it's still
not sorted it
needs to write it all back out yet again and read it all back in
again. So 5x the data. If the tapes are larger it can avoid that 66%
increase in total I/O. In large data sets it can need 3, 4, or maybe
more passes through the data and saving one pass would be a smaller
incremental difference. I haven't thought through the exponential
growth carefully enough to tell if doubling the run size should
decrease the number of passes linearly or by a constant number.

But you're right that seems to be less and less a realistic scenario.
Times when users are really processing data sets that large nowadays
they'll just throw it into Hadoop or Biigquery or whatever to get the
parallelism of many cpus. Or maybe Citus and the like.

The main case where I expect people actually run into this is in
building indexes, especially for larger data types (which come to
think of it might be exactly where the comparison is expensive enough
that quicksort's cache efficiency isn't helpful).

But to do fair tests I would suggest you configure work_mem smaller
(since running tests on multi-terabyte data sets is a pain) and sort
some slower data types that don't fit in memory. Maybe arrays of text
or json?

-- 
greg

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

21 August 2015, 01:43:04

On Thu, Aug 20, 2015 at 5:02 PM, Greg Stark <stark@mit.edu> wrote:
> I haven't thought through the exponential
> growth carefully enough to tell if doubling the run size should
> decrease the number of passes linearly or by a constant number.

It seems that with 5 times the data that previously required ~30MB to
avoid a multi-pass sort (where ~2300MB is required for an internal
sort -- the benchmark query), it took ~60MB to avoid a multi-pass
sort. I guess I just didn't exactly determine either threshold due to
that taking too long, and that as predicted, every time the input size
quadruples, the required amount of work_mem to avoid multiple passes
only doubles. That will need to be verified more vigorously, but it
looks that way.

> But you're right that seems to be less and less a realistic scenario.
> Times when users are really processing data sets that large nowadays
> they'll just throw it into Hadoop or Biigquery or whatever to get the
> parallelism of many cpus. Or maybe Citus and the like.

I'm not sure that even that's generally true, simply because sorting a
huge amount of data is very expensive -- it's not really a "big data"
thing, so to speak. Look at recent results on this site:

http://sortbenchmark.org

Last year's winning "Gray" entrant, TritonSort, uses a huge parallel
cluster of 186 machines, but only sorts 100TB. That's just over 500GB
per node. Each node is a 32 core Intel Xeon EC2 instance with 244GB
memory, and lots of SSDs. It seems like the point of the 100TB minimum
rule in the "Gray" contest category is that that's practically
impossible to fit entirely in memory (to avoid merging).

Eventually, linearithmic growth becomes extremely painful, not matter
how much processing power you have. It takes a while, though.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Simon Riggs

Date:

21 August 2015, 06:56:30

On 20 August 2015 at 18:41, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Aug 20, 2015 at 8:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 20 August 2015 at 03:24, Peter Geoghegan <pg@heroku.com> wrote:
>>
>>
>> The patch is ~3.25x faster than master
>
>
> I've tried to read this post twice and both times my work_mem overflowed.
> ;-)
>
> Can you summarize what this patch does? I understand clearly what it doesn't
> do...

The most important thing that it does is always quicksort runs, that
are formed by simply filling work_mem with tuples in no particular
order, rather than trying to make runs that are twice as large as
work_mem on average. That's what the ~3.25x improvement concerned.
That's actually a significantly simpler algorithm than replacement
selection, and appears to be much faster.

Then I think this is fine, not least because it seems like a first step towards parallel sort.

This will give more runs, so merging those needs some thought. It will also give a more predictable number of runs, so we'll be able to predict any merging issues ahead of time. We can more easily find out the min/max tuple in each run, so we only merge overlapping runs.

You might even say that it's
a dumb algorithm, because it is less sophisticated than replacement
selection. However, replacement selection tends to use CPU caches very
poorly, while its traditional advantages have become dramatically less
important due to large main memory sizes in particular. Also, it hurts
that we don't currently dump tuples in batches, for several reasons.
Better to do memory intense operations in batch, rather than having a
huge inner loop, in order to minimize or prevent instruction cache
misses. And we can better take advantage of asynchronous I/O.

The complicated aspect of considering the patch is whether or not it's
okay to not use replacement selection anymore -- is that an
appropriate trade-off?

Using a heapsort is known to be poor for large heaps. We previously discussed the idea of quicksorting the first chunk of memory, then reallocating the heap as a smaller chunk for the rest of the sort. That would solve the cache miss problem.

I'd like to see some discussion of how we might integrate aggregation and sorting. A heap might work quite well for that, whereas quicksort doesn't sound like it would work as well.

The reason that the code has not actually been simplified by this
patch is that I still want to use replacement selection for one
specific case: when it is anticipated that a "quicksort with
spillover" can occur, which is only possible with incremental
spilling. That may avoid most I/O, by spilling just a few tuples using
a heap/priority queue, and quicksorting everything else. That's
compelling when you can manage it, but no reason to always use
replacement selection for the first run in the common case where there
well be several runs in total.

I think its premature to retire that algorithm - I think we should keep it for a while yet. I suspect it may serve well in cases where we have low memory, though I accept that is no longer the case for larger servers that we would now call typical.

This could cause particular issues in optimization, since heap sort is wonderfully predictable. We'd need a cost_sort() that was slightly pessimistic to cover the risk that a quicksort might not be as fast as we hope.

Is that any clearer?

Yes, thank you.

I'd like to see a more general and concise plan for how sorting evolves. We are close to having the infrastructure to perform intermediate aggregation, which would allow that to happen during sorting when required (aggregation, sort distinct). We also agreed some time back that parallel sorting would be the first incarnation of parallel operations, so we need to consider that also.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

21 August 2015, 17:55:12

On Thu, Aug 20, 2015 at 11:56 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> This will give more runs, so merging those needs some thought. It will also
> give a more predictable number of runs, so we'll be able to predict any
> merging issues ahead of time. We can more easily find out the min/max tuple
> in each run, so we only merge overlapping runs.

I think that merging runs can be optimized to reduce the number of
cache misses. Poul-Henning Kamp, the FreeBSD guy, has described
problems with binary heaps and cache misses [1], and I think we could
use his solution for merging. But we should definitely still quicksort
runs.

> Using a heapsort is known to be poor for large heaps. We previously
> discussed the idea of quicksorting the first chunk of memory, then
> reallocating the heap as a smaller chunk for the rest of the sort. That
> would solve the cache miss problem.
>
> I'd like to see some discussion of how we might integrate aggregation and
> sorting. A heap might work quite well for that, whereas quicksort doesn't
> sound like it would work as well.

If you're talking about deduplicating within tuplesort, then there are
techniques. I don't know that that needs to be an up-front priority of
this work.

> I think its premature to retire that algorithm - I think we should keep it
> for a while yet. I suspect it may serve well in cases where we have low
> memory, though I accept that is no longer the case for larger servers that
> we would now call typical.

I have given one case where I think the first run should still use
replacement selection: where that enables a "quicksort with
spillover". For that reason, I would consider that I have not actually
proposed to retire the algorithm. In principle, I agree with also
using it under any other circumstances where it is likely to be
appreciably faster, but it's just not in evidence that there is any
other such case. I did look at all the traditionally sympathetic
cases, as I went into, and it still seemed to not be worth it at all.
But by all means, if you think I missed something, please show me a
test case.

> This could cause particular issues in optimization, since heap sort is
> wonderfully predictable. We'd need a cost_sort() that was slightly
> pessimistic to cover the risk that a quicksort might not be as fast as we
> hope.

Wonderfully predictable? Really? It's totally sensitive to CPU cache
characteristics. I wouldn't say that at all. If you're alluding to the
quicksort worst case, that seems like the wrong thing to worry about.
The risk around that is often overstated, or based on experience from
third-rate implementations that don't follow various widely accepted
recommendations from the research community.

> I'd like to see a more general and concise plan for how sorting evolves. We
> are close to having the infrastructure to perform intermediate aggregation,
> which would allow that to happen during sorting when required (aggregation,
> sort distinct). We also agreed some time back that parallel sorting would be
> the first incarnation of parallel operations, so we need to consider that
> also.

I agree with everything you say here, I think. I think it's
appropriate that this work anticipate adding a number of other
optimizations in the future, at least including:

* Parallel sort using worker processes.

* Memory prefetching.

* Offset-value coding of runs, a compression technique that was used
in System R, IIRC. This can speed up merging a lot, and will save I/O
bandwidth on dumping out runs.

* Asynchronous I/O.

There should be an integrated approach to applying every possible
optimization, or at least leaving the possibility open. A lot of these
techniques are complementary. For example, there are significant
benefits where the "onlyKey" optimization is now used with external
sorts, which you get for free by using quicksort for runs. In short, I
am absolutely on-board with the idea that these things need to be
anticipated at the very least. For another speculative example, offset
coding makes the merge step cheaper, but the work of doing the offset
coding can be offloaded to worker processes, whereas the merge step
proper cannot really be effectively parallelized -- those two
techniques together are greater than the sum of their parts. One big
problem that I see with replacement selection is that it makes most of
these things impossible.

In general, I think that parallel sort should be an external sort
technique first and foremost. If you can only parallelize an internal
sort, then running out of road when there isn't enough memory to do
the sort in memory becomes a serious issue. Besides, you need to
partition the input anyway, and external sorting naturally needs to do
that while not precluding runs not actually being dumped to disk.

[1] http://queue.acm.org/detail.cfm?id=1814327
-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

06 September 2015, 00:48:30

On Wed, Aug 19, 2015 at 7:24 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Let's start by comparing an external sort that uses 1/3 the memory of
> an internal sort against the master branch.  That's completely unfair
> on the patch, of course, but it is a useful indicator of how well
> external sorts do overall. Although an external sort surely cannot be
> as fast as an internal sort, it might be able to approach an internal
> sort's speed when there is plenty of I/O bandwidth. That's a good
> thing to aim for, I think.

> The patch only takes ~10% more time to execute this query, which seems
> very good considering that ~1/3 the work_mem has been put to use.

> Note that the on-tape runs are small relative to CPU costs, so this
> query is a bit sympathetic (consider the time spent writing batches
> that trace_sort indicates here). CREATE INDEX would not compare so
> well with an internal sort, for example, especially if it was a
> composite index or something.

This is something that I've made great progress on (see "concrete
example" below for numbers). The differences in the amount of I/O
required between these two cases (due to per-case variability in the
width of tuples written to tape for datum sorts and index sorts) did
not significantly factor in to the differences in performance, it
turns out. The big issue was that while a pass-by-value datum sort
accidentally has good cache characteristics during the merge step,
that is not generally true. I figured out a way of making it generally
true, though. I attach a revised patch series with a new commit that
adds an optimization to the merge step, relieving what was a big
remaining bottleneck in the CREATE INDEX case (and *every* external
sort case that isn't a pass-by-value datum sort, which is most
things). There are a few tweaks to earlier commits including, but
nothing very interesting.

All of my benchmarks suggests that this most recent revision puts
external sorting within a fairly small margin of a fully internal sort
on the master branch in many common cases. This difference is seen
when the implementation only makes use of a fraction of the memory
required for an internal sort, provided the system is reasonably well
balanced. For a single backend, there is an overhead of about 5% - 20%
against master's internal sort performance. This speedup appears to be
fairly robust across a variety of different cases.

I particularly care about CREATE INDEX, since that is where most pain
is felt in the real world, and I'm happy that I found a way to make
CREATE INDEX external sort reasonably comparable in run time to
internal sorts that consume much more memory. I think it's time to
stop talking about this as performance work, and start talking about
it as scalability work. With that in mind, I'm mostly going to compare
the performance of the new, optimized external sort implementation
with the existing internal sort implementation from now on.

New patch -- Sequential memory access
===============================

The trick I hit upon for relieving the merge bottleneck was fairly simple.

Prefetching works for internal sorts, but isn't practical for external
sorts while merging. OTOH, I can arrange to have runs allocate their
"tuple proper" contents into a memory pool, partitioned by final
on-the-fly tape number. Today, runs/tapes are slurped from disk
sequentially in a staggered fashion, based on the availability of
in-memory tuples from each tape while merging. The new patch is very
effective in reducing cache misses by simply making sure that each
tape's "tuple proper" (e.g. each IndexTuple) is accessed in memory in
the natural, predictable order (the sorted order that runs on tape
always have). Unlike with internal sorts (where explicit memory
prefetching of each "tuple proper" may be advisable), the final order
in which the caller must consume a tape's "tuple proper" is
predictable well in advance.

A little rearrangement is required to make what were previously retail
palloc() calls during prereading (a palloc() for each "tuple proper",
within each READTUP() routine) consume space from the memory pool
instead. The pool (a big, once-off memory allocation) is reused in a
circular fashion per tape partition. This saves a lot of palloc()
overhead.

Under this scheme, each tape's next few IndexTuples are all in one
cacheline. This patch has the merge step make better use of available
memory bandwidth, rather than attempting to conceal memory latency.
Explicit prefetch instructions (that we may independently end up using
to do something similar with internal sorts when fetching tuples
following sorting proper) are all about hiding latency.

Concrete example -- performance
---------------------------------------------

I attach a text file describing a practical, reproducible example
CREATE INDEX. It shows how CREATE INDEX now compares fairly well with
an equivalent operation that has enough maintenance_work_mem to
complete its sort internally. I'll just summarize it here:

A CREATE INDEX on a single int4 attribute on an unlogged table takes
only ~18% longer. This is a 100 million row table that is 4977 MB on
disk. On master, CREATE INDEX takes 66.6 seconds in total with an
*internal* sort. With the patch series applied, an *external* sort
involving a final on-the-fly merge of 6 runs takes 78.5 seconds.
Obviously, since there are 6 runs to merge, work_mem is only
approximately 1/6 of what is required for a fully internal sort.

High watermark memory usage
------------------------------------------

One concern about the patch may be that it increases the high
watermark memory usage by any on-the-fly final merge step. It takes
full advantage of the availMem allowance at a point where every "tuple
proper" is freed, and availMem has only had SortTuple/memtuples array
"slot" memory subtracted (plus overhead). Memory is allocated in bulk
once, and partitioned among active tapes, with no particular effort
towards limiting memory usage beyond enforcing that we always
!LACKMEM().

A lot of the overhead of many retail palloc() calls is removed by
simply using one big memory allocation. In practice, LACKMEM() will
rarely become true, because the availability of slots now tends to be
the limiting factor. This is partially explained by the number of
slots being established when palloc() overhead was in play, prior to
the final merge step. However, I have concerns about the memory usage
of this new approach.

With the int4 CREATE INDEX case above, which has a uniform
distribution, I noticed that about 40% of each tape's memory space
remains unused when slots are exhausted. Ideally, we'd only have
allocated enough memory to run out at about the same time that slots
are exhausted, since the two would be balanced. This might be possible
for fixed-sized tuples. I have not allocated each final on-the-fly
merge step's active tape's pool individually, because while this waste
of memory is large enough to be annoying, it's not large enough to be
significantly helped by managing a bunch of per-tape buffers and
enlarging them as needed geometrically (e.g. starting small, and
doubling each time the buffer size is hit until the per-tape limit is
finally reached).

The main reason that the high watermark is increased is not because of
this, though. It's mostly just that "tuple proper" memory is not freed
until the sort is done, whereas before there were many small pfree()
calls to match the many palloc() calls -- calls that occurred early
and often. Note that the availability of "slots" (i.e. the size of the
memtuples array, minus one element for each tape's heap item) is
currently determined by whatever size it happened to be at when
memtuples stopped growing, which isn't particularly well principled
(hopefully this is no worse now).

Optimal memory usage
-------------------------------

In the absence of any clear thing to care about most beyond making
sorting faster while still enforcing !LACKMEM(), for now I've kept it
simple. I am saving a lot of memory by clawing back palloc() overhead,
but may be wasting more than that in another way now, to say nothing
of the new high watermark itself. If we're entirely I/O bound, maybe
we should not waste memory by simply not allocating as much anyway
(i.e. the extra memory may only theoretically help even when it is
written to). But what does it really mean to be I/O bound? The OS
cache probably consumes plenty of memory, too.

Finally, let us not forget that it's clearly still the case that even
following this work, run size needs to be optimized using a cost
model, rather than simply being determined by how much memory can be
made available (work_mem). If we get a faster sort using far less
work_mem, then the DBA is probably accidentally wasting huge amounts
of memory due to failing to do that. As an implementor, it's really
hard to balance all of these concerns, or to say that one in
particular is most urgent.

Parallel sorting
===========

Simon rightly emphasized the need for joined-up thinking in relation
to applying important tuplesort optimizations. We must at least
consider parallelism as part of this work.

I'm glad that the first consumer of parallel infrastructure is set to
be parallel sequential scans, not internal parallel sorts. That's
because it seems that overall, a significant cost is actually reading
tuples into memtuples to sort -- heap scanning and related costs in
the buffer manager (even assuming everything is in shared_buffers),
COPYTUP() palloc() calls, and so on. Taken together, they can be a
bigger overall cost than sorting proper, even assuming abbreviated
keys are not used. The third bucket that I tend to categorize costs
into, "time spent actually writing out finished runs", is small on a
well balanced system. Surprisingly small, I would say.

I will sketch a simple implementation of parallel sorting based on the
patch series that may be workable, and requires relatively little
implementation effort compare to other ideas that were raised at
various times:

* Establish an optimal run size ahead of time using a cost model. We
need this for serial external sorts anyway, to relieve the DBA of
having to worry about sizing maintenance_work_mem according to obscure
considerations around cache efficiency within tuplesort. Parallelism
probably doesn't add much complexity to the cost model, which is not
especially complicated to begin with. Note that I have not added this
cost model yet (just the ad-hoc, tuplesort-private cost model for
using replacement selection to get a "quicksort with spillover"). It
may be best if this cost model lives in the optimizer.

* Have parallel workers do a parallel heap scan of the relation until
they fill this optimal run size. Use local memory to sort within
workers. Write runs out in the usual way. Then, the worker picks up
the next run scheduled. If there are no more runs to build, there is
no more work for the parallel workers.

* Shut down workers. Do an on-the-fly merge in the parent process.
This is the same as with a serial merge, but with a little
coordination with worker processes to make sure every run is
available, etc. In general, coordination is kept to an absolute
minimum.

I tend to think that this really simple approach would get much of the
gain of something more complicated -- no need to write shared memory
management code, minimal need to handle coordination between workers,
and no real changes to the algorithms used for each sub-problem. This
makes merging more of a bottleneck again, but that is a bottleneck on
I/O and especially memory bandwidth. Parallelism cannot help much with
that anyway (except by compressing runs with offset coding, perhaps,
but that isn't specific to parallelism and won't always help). Writing
out runs in bulk is very fast here -- certainly much faster than I
thought it would be when I started thinking about external sorting.
And if that turns out to be a problem for cases that have sufficient
memory to do everything internally, that can later be worked on
non-invasively.

As I've said in the past, I think parallel sorting only makes sense
when memory latency and bandwidth are not huge bottlenecks, which we
should bend over backwards to avoid. In a sense, you can't really make
use of parallel workers for sorting until you fix that problem first.

I am not suggesting that we do this because it's easier than other
approaches. I think it's actually most effective to not make parallel
sorting too divergent from serial sorting, because making things
cumulative makes speed-ups from localized optimizations cumulative,
while at the same time, AFAICT there isn't anything to recommend
extensive specialization for parallel sort. If what I've sketched is
also a significantly easier approach, then that's a bonus.

--
Peter Geoghegan

On Wed, Aug 19, 2015 at 7:24 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I'll start a new thread for this, since my external sorting patch has
> now evolved well past the original "quicksort with spillover"
> idea...although not quite how I anticipated it would. It seems like
> I've reached a good point to get some feedback.

Corey Huinker has once again assisted me with this work, by doing some
benchmarking on an AWS instance of his:

32 cores (c3.8xlarge, I suppose)
MemTotal: 251902912 kB

I believe it had one EBS volume.

This testing included 2 data sets:

* A data set that he happens to have that is representative of his
production use-case. Corey had some complaints about the sort
performance of PostgreSQL, particularly prior to 9.5, and I like to
link any particular performance optimization to an improvement in an
actual production workload, if at all possible.

* A tool that I wrote, that works on top of sortbenchmark.org's
"gensort" [1] data generation tool. It seems reasonable to me to drive
this work in part with a benchmark devised by Jim Gray. He did after
all receive a Turing award for this contribution to transaction
processing. I'm certainly a fan of his work. A key practical advantage
of that is that is has reasonable guarantees about determinism, making
these results relatively easy to recreate independently.

The modified "gensort" is available from
https://github.com/petergeoghegan/gensort

The python script postgres_load.py, which performs bulk-loading for
Postgres using COPY FREEZE. It ought to be fairly self-documenting:

$:~/gensort$ ./postgres_load.py --help
usage: postgres_load.py [-h] [-w WORKERS] [-m MILLION] [-s] [-l] [-c]

optional arguments: -h, --help show this help message and exit -w WORKERS, --workers WORKERS
Number of gensort workers (default: 4) -m MILLION, --million MILLION Generate n million
tuples(default: 100) -s, --skew Skew distribution of output keys (default: False) -l, --logged Use
loggedPostgreSQL table (default: False) -c, --collate Use default collation rather than C collation
(default: False)

For this initial report to the list, I'm going to focus on a case
involving 16 billion non-skewed tuples generated using the gensort
tool. I wanted to see how a sort of a ~1TB table (1017GB as reported
by psql, actually) could be improved, as compared to relatively small
volumes of data (in the multiple gigabyte range) that were so improved
by sorts on my laptop, which has enough memory to avoid blocking on
physical I/O much of the time. How the new approach deals with
hundreds of runs that are actually reasonably sized is also of
interest. This server does have a lot of memory, and many CPU cores.
It was kind of underpowered on I/O, though.

The initial load of 16 billion tuples (with a sortkey that is "C"
locale text) took about 10 hours. My tool supports parallel generation
of COPY format files, but serial performance of that stage isn't
especially fast. Further, in order to support COPY FREEZE, and in
order to ensure perfect determinism, the COPY operations occur
serially in a single transaction that creates the table that we
performed a CREATE INDEX on.

Patch, with 3GB maintenance_work_mem:

...
LOG: performsort done (except 411-way final merge): CPU
1017.95s/17615.74u sec elapsed 23910.99 sec
STATEMENT: create index on sort_test (sortkey );
LOG: external sort ended, 54740802 disk blocks used: CPU
2001.81s/31395.96u sec elapsed 41648.05 sec
STATEMENT: create index on sort_test (sortkey );

So just over 11 hours (11:34:08), then. The initial sorting for 411
runs took 06:38:30.99, as you can see.

Master branch:

...
LOG: finished writing run 202 to tape 201: CPU 1224.68s/31060.15u sec
elapsed 34409.16 sec
LOG: finished writing run 203 to tape 202: CPU 1230.48s/31213.55u sec
elapsed 34580.41 sec
LOG: finished writing run 204 to tape 203: CPU 1236.74s/31366.63u sec
elapsed 34750.28 sec
LOG: performsort starting: CPU 1241.70s/31501.61u sec elapsed 34898.63 sec
LOG: finished writing run 205 to tape 204: CPU 1242.19s/31516.52u sec
elapsed 34914.17 sec
LOG: finished writing final run 206 to tape 205: CPU
1243.23s/31564.23u sec elapsed 34963.03 sec
LOG: performsort done (except 206-way final merge): CPU
1243.86s/31570.58u sec elapsed 34974.08 sec
LOG: external sort ended, 54740731 disk blocks used: CPU
2026.98s/48448.13u sec elapsed 55299.24 sec
CREATE INDEX
Time: 55299315.220 ms

So 15:21:39 for master -- it's much improved, but this was still
disappointing given the huge improvements on relatively small cases.

Finished index was fairly large, which can be seen here by working
back from "total relation size":

postgres=# select pg_size_pretty(pg_total_relation_size('sort_test'));pg_size_pretty
----------------1487 GB
(1 row)

I think that this is probably due to the relatively slow I/O on this
server, and because the merge step is more of a bottleneck. As we
increase maintenance_work_mem, we're likely to then suffer from the
lack of explicit asynchronous I/O here. It helps, still, but not
dramatically. With with maintenance_work_mem = 30GB, patch is somewhat
faster (no reason to think that this would help master at all, so that
was untested):

...
LOG: starting quicksort of run 40: CPU 1815.99s/19339.80u sec elapsed
24910.38 sec
LOG: finished quicksorting run 40: CPU 1820.09s/19565.94u sec elapsed
25140.69 sec
LOG: finished writing run 40 to tape 39: CPU 1833.76s/19642.11u sec
elapsed 25234.44 sec
LOG: performsort starting: CPU 1849.46s/19803.28u sec elapsed 25499.98 sec
LOG: starting quicksort of run 41: CPU 1849.46s/19803.28u sec elapsed
25499.98 sec
LOG: finished quicksorting run 41: CPU 1852.37s/20000.73u sec elapsed
25700.43 sec
LOG: finished writing run 41 to tape 40: CPU 1864.89s/20069.09u sec
elapsed 25782.93 sec
LOG: performsort done (except 41-way final merge): CPU
1965.43s/20086.28u sec elapsed 25980.80 sec
LOG: external sort ended, 54740909 disk blocks used: CPU
3270.57s/31595.37u sec elapsed 40376.43 sec
CREATE INDEX
Time: 40383174.977 ms

So that takes 11:13:03 in total -- we only managed to shave about 20
minutes off the total time taken, despite a 10x increase in
maintenance_work_mem. Still, at least it gets moderately better, not
worse, which is certainly what I'd expect from the master branch. 60GB
was half way between 3GB and 30GB in terms of performance, so it
doesn't continue to help, but, again, at least things don't get much
worse.

Thoughts on these results:

* I'd really like to know the role of I/O here. Better, low-overhead
instrumentation is required to see when and how we are I/O bound. I've
been doing much of that on a more-or-less ad hoc basis so far, using
iotop. I'm looking into a way to usefully graph the I/O activity over
many hours, to correlate with the trace_sort output that I'll also
show. I'm open to suggestions on the easiest way of doing that. Having
used the "perf" tool for instrumenting I/O at all in the past.

* Parallelism would probably help us here *a lot*.

* As I said, I think we suffer from the lack of asynchronous I/O much
more at this scale. Will need to confirm that theory.

* It seems kind of ill-advised to make run size (which is always in
linear proportion to maintenance_work_mem with this new approach to
sorting) larger, because it probably will hurt writing runs more than
it will help in making merging cheaper (perhaps mostly due to the lack
of asynchronous I/O to hide the latency of writes -- Linux might not
do so well at this scale).

* Maybe adding actual I/O bandwidth is the way to go to get a better
picture. I wouldn't be surprised if we were very bottlenecked on I/O
here. Might be worth using many parallel EBS volumes here, for
example.

[1] http://sortbenchmark.org/FAQ-2015.html
--
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Corey Huinker

Date:

09 November 2015, 17:27:44

<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Fri, Nov 6, 2015 at 8:08 PM, Peter Geoghegan <span
dir="ltr"><<ahref="mailto:pg@heroku.com" target="_blank">pg@heroku.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Wed, Aug 19,
2015at 7:24 PM, Peter Geoghegan <<a href="mailto:pg@heroku.com">pg@heroku.com</a>> wrote:<br /></span><span
class="">>I'll start a new thread for this, since my external sorting patch has<br /> > now evolved well past the
original"quicksort with spillover"<br /> > idea...although not quite how I anticipated it would. It seems like<br />
>I've reached a good point to get some feedback.<br /><br /></span>Corey Huinker has once again assisted me with
thiswork, by doing some<br /> benchmarking on an AWS instance of his:<br /><br /> 32 cores (c3.8xlarge, I suppose)<br
/>MemTotal:       251902912 kB<br /><br /> I believe it had one EBS volume.<br /><br /> This testing included 2 data
sets:<br/><br /> * A data set that he happens to have that is representative of his<br /> production use-case. Corey
hadsome complaints about the sort<br /> performance of PostgreSQL, particularly prior to 9.5, and I like to<br /> link
anyparticular performance optimization to an improvement in an<br /> actual production workload, if at all possible.<br
/><br/> * A tool that I wrote, that works on top of <a href="http://sortbenchmark.org" rel="noreferrer"
target="_blank">sortbenchmark.org</a>'s<br/> "gensort" [1] data generation tool. It seems reasonable to me to drive<br
/>this work in part with a benchmark devised by Jim Gray. He did after<br /> all receive a Turing award for this
contributionto transaction<br /> processing. I'm certainly a fan of his work. A key practical advantage<br /> of that
isthat is has reasonable guarantees about determinism, making<br /> these results relatively easy to recreate
independently.<br/><br /> The modified "gensort" is available from<br /><a
href="https://github.com/petergeoghegan/gensort"rel="noreferrer"
target="_blank">https://github.com/petergeoghegan/gensort</a><br/><br /> The python script postgres_load.py, which
performsbulk-loading for<br /> Postgres using COPY FREEZE. It ought to be fairly self-documenting:<br /><br />
$:~/gensort$./postgres_load.py --help<br /> usage: postgres_load.py [-h] [-w WORKERS] [-m MILLION] [-s] [-l] [-c]<br
/><br/> optional arguments:<br />   -h, --help            show this help message and exit<br />   -w WORKERS, --workers
WORKERS<br/>                         Number of gensort workers (default: 4)<br />   -m MILLION, --million MILLION<br />
                       Generate n million tuples (default: 100)<br />   -s, --skew            Skew distribution of
outputkeys (default: False)<br />   -l, --logged          Use logged PostgreSQL table (default: False)<br />   -c,
--collate        Use default collation rather than C collation<br />                         (default: False)<br /><br
/>For this initial report to the list, I'm going to focus on a case<br /> involving 16 billion non-skewed tuples
generatedusing the gensort<br /> tool. I wanted to see how a sort of a ~1TB table (1017GB as reported<br /> by psql,
actually)could be improved, as compared to relatively small<br /> volumes of data (in the multiple gigabyte range) that
wereso improved<br /> by sorts on my laptop, which has enough memory to avoid blocking on<br /> physical I/O much of
thetime. How the new approach deals with<br /> hundreds of runs that are actually reasonably sized is also of<br />
interest.This server does have a lot of memory, and many CPU cores.<br /> It was kind of underpowered on I/O,
though.<br/><br /> The initial load of 16 billion tuples (with a sortkey that is "C"<br /> locale text) took about 10
hours.My tool supports parallel generation<br /> of COPY format files, but serial performance of that stage isn't<br />
especiallyfast. Further, in order to support COPY FREEZE, and in<br /> order to ensure perfect determinism, the COPY
operationsoccur<br /> serially in a single transaction that creates the table that we<br /> performed a CREATE INDEX
on.<br/><br /> Patch, with 3GB maintenance_work_mem:<br /><br /> ...<br /> LOG:  performsort done (except 411-way final
merge):CPU<br /> 1017.95s/17615.74u sec elapsed 23910.99 sec<br /> STATEMENT:  create index on sort_test (sortkey );<br
/>LOG:  external sort ended, 54740802 disk blocks used: CPU<br /> 2001.81s/31395.96u sec elapsed 41648.05 sec<br />
STATEMENT: create index on sort_test (sortkey );<br /><br /> So just over 11 hours (11:34:08), then. The initial
sortingfor 411<br /> runs took 06:38:30.99, as you can see.<br /><br /> Master branch:<br /><br /> ...<br /> LOG: 
finishedwriting run 202 to tape 201: CPU 1224.68s/31060.15u sec<br /> elapsed 34409.16 sec<br /> LOG:  finished writing
run203 to tape 202: CPU 1230.48s/31213.55u sec<br /> elapsed 34580.41 sec<br /> LOG:  finished writing run 204 to tape
203:CPU 1236.74s/31366.63u sec<br /> elapsed 34750.28 sec<br /> LOG:  performsort starting: CPU 1241.70s/31501.61u sec
elapsed34898.63 sec<br /> LOG:  finished writing run 205 to tape 204: CPU 1242.19s/31516.52u sec<br /> elapsed 34914.17
sec<br/> LOG:  finished writing final run 206 to tape 205: CPU<br /> 1243.23s/31564.23u sec elapsed 34963.03 sec<br />
LOG: performsort done (except 206-way final merge): CPU<br /> 1243.86s/31570.58u sec elapsed 34974.08 sec<br /> LOG: 
externalsort ended, 54740731 disk blocks used: CPU<br /> 2026.98s/48448.13u sec elapsed 55299.24 sec<br /> CREATE
INDEX<br/> Time: 55299315.220 ms<br /><br /> So 15:21:39 for master -- it's much improved, but this was still<br />
disappointinggiven the huge improvements on relatively small cases.<br /><br /> Finished index was fairly large, which
canbe seen here by working<br /> back from "total relation size":<br /><br /> postgres=# select
pg_size_pretty(pg_total_relation_size('sort_test'));<br/>  pg_size_pretty<br /> ----------------<br />  1487 GB<br />
(1row)<br /><br /> I think that this is probably due to the relatively slow I/O on this<br /> server, and because the
mergestep is more of a bottleneck. As we<br /> increase maintenance_work_mem, we're likely to then suffer from the<br
/>lack of explicit asynchronous I/O here. It helps, still, but not<br /> dramatically. With with maintenance_work_mem =
30GB,patch is somewhat<br /> faster (no reason to think that this would help master at all, so that<br /> was
untested):<br/><br /> ...<br /> LOG:  starting quicksort of run 40: CPU 1815.99s/19339.80u sec elapsed<br /> 24910.38
sec<br/> LOG:  finished quicksorting run 40: CPU 1820.09s/19565.94u sec elapsed<br /> 25140.69 sec<br /> LOG:  finished
writingrun 40 to tape 39: CPU 1833.76s/19642.11u sec<br /> elapsed 25234.44 sec<br /> LOG:  performsort starting: CPU
1849.46s/19803.28usec elapsed 25499.98 sec<br /> LOG:  starting quicksort of run 41: CPU 1849.46s/19803.28u sec
elapsed<br/> 25499.98 sec<br /> LOG:  finished quicksorting run 41: CPU 1852.37s/20000.73u sec elapsed<br /> 25700.43
sec<br/> LOG:  finished writing run 41 to tape 40: CPU 1864.89s/20069.09u sec<br /> elapsed 25782.93 sec<br /> LOG: 
performsortdone (except 41-way final merge): CPU<br /> 1965.43s/20086.28u sec elapsed 25980.80 sec<br /> LOG:  external
sortended, 54740909 disk blocks used: CPU<br /> 3270.57s/31595.37u sec elapsed 40376.43 sec<br /> CREATE INDEX<br />
Time:40383174.977 ms<br /><br /> So that takes 11:13:03 in total -- we only managed to shave about 20<br /> minutes off
thetotal time taken, despite a 10x increase in<br /> maintenance_work_mem. Still, at least it gets moderately better,
not<br/> worse, which is certainly what I'd expect from the master branch. 60GB<br /> was half way between 3GB and 30GB
interms of performance, so it<br /> doesn't continue to help, but, again, at least things don't get much<br />
worse.<br/><br /> Thoughts on these results:<br /><br /> * I'd really like to know the role of I/O here. Better,
low-overhead<br/> instrumentation is required to see when and how we are I/O bound. I've<br /> been doing much of that
ona more-or-less ad hoc basis so far, using<br /> iotop. I'm looking into a way to usefully graph the I/O activity
over<br/> many hours, to correlate with the trace_sort output that I'll also<br /> show. I'm open to suggestions on the
easiestway of doing that. Having<br /> used the "perf" tool for instrumenting I/O at all in the past.<br /><br /> *
Parallelismwould probably help us here *a lot*.<br /><br /> * As I said, I think we suffer from the lack of
asynchronousI/O much<br /> more at this scale. Will need to confirm that theory.<br /><br /> * It seems kind of
ill-advisedto make run size (which is always in<br /> linear proportion to maintenance_work_mem with this new approach
to<br/> sorting) larger, because it probably will hurt writing runs more than<br /> it will help in making merging
cheaper(perhaps mostly due to the lack<br /> of asynchronous I/O to hide the latency of writes -- Linux might not<br />
doso well at this scale).<br /><br /> * Maybe adding actual I/O bandwidth is the way to go to get a better<br />
picture.I wouldn't be surprised if we were very bottlenecked on I/O<br /> here. Might be worth using many parallel EBS
volumeshere, for<br /> example.<br /><br /> [1] <a href="http://sortbenchmark.org/FAQ-2015.html" rel="noreferrer"
target="_blank">http://sortbenchmark.org/FAQ-2015.html</a><br/><span class="HOEnZb"><font color="#888888">--<br />
PeterGeoghegan<br /></font></span></blockquote></div><br /></div><div class="gmail_extra">The machine in question still
exists,so if you have questions about it, commands you'd like me to run to give you insight as to the I/O capabilities
ofthe machine, let me know. I can't guarantee we'll keep the machine much longer.</div><div class="gmail_extra"><br
/></div><divclass="gmail_extra"><br /></div><div class="gmail_extra"><br /></div><div class="gmail_extra"><br
/></div></div>

Re: Using quicksort for every external sort run

From

Jeff Janes

Date:

18 November 2015, 18:31:52

On Wed, Aug 19, 2015 at 7:24 PM, Peter Geoghegan <pg@heroku.com> wrote:

Hi Peter,

Your most recent versions of this patch series (not the ones on the
email I am replying to) give a compiler warning:

tuplesort.c: In function 'mergeruns':
tuplesort.c:2741: warning: unused variable 'memNowUsed'

> Multi-pass sorts
> ---------------------
>
> I believe, in general, that we should consider a multi-pass sort to be
> a kind of inherently suspect thing these days, in the same way that
> checkpoints occurring 5 seconds apart are: not actually abnormal, but
> something that we should regard suspiciously. Can you really not
> afford enough work_mem to only do one pass?

I don't think it is really about the cost of RAM.  What people can't
afford is spending all of their time personally supervising all the
sorts on the system. It is pretty easy for a transient excursion in
workload to make a server swap itself to death and fall over. Not just
the PostgreSQL server, but the entire OS. Since we can't let that
happen, we have to be defensive about work_mem. Yes, we have far more
RAM than we used to. We also have far more things demanding access to
it at the same time.

I agree we don't want to optimize for low memory, but I don't think we
should throw it under the bus, either.  Right now we are effectively
saying the CPU-cache problems with the heap start exceeding the larger
run size benefits at 64kb (the smallest allowed setting for work_mem).
While any number we pick is going to be a guess that won't apply to
all hardware, surely we can come up with a guess better than 64kb.
Like, 8 MB, say.  If available memory for the sort is 8MB or smaller
and the predicted size anticipates a multipass merge, then we can use
the heap method rather than the quicksort method.  Would a rule like
that complicate things much?

It doesn't matter to me personally at the moment, because the smallest
work_mem I run on a production system is 24MB.  But if for some reason
I had to increase max_connections, or had to worry about plans with
many more possible concurrent work_mem allocations (like some
partitioning), then I might not need to rethink that setting downward.

>
> In theory, the answer could be "yes", but it seems highly unlikely.
> Not only is very little memory required to avoid a multi-pass merge
> step, but as described above the amount required grows very slowly
> relative to linear growth in input. I propose to add a
> checkpoint_warning style warning (with a checkpoint_warning style GUC
> to control it).

I'm skeptical about a warning for this.  I think it is rather unlike
checkpointing, because checkpointing is done in a background process,
which greatly limits its visibility, while sorting is a foreground
thing.  I know if my sorts are slow, without having to go look in the
log file.  If we do have the warning, shouldn't it use a log-level
that gets sent to the front end where the person running the sort can
see it and locally change work_mem?  And if we have a GUC, I think it
should be a dial, not a binary.  If I have a sort that takes a 2-way
merge and then a final 29-way merge, I don't think that that is worth
reporting.  So maybe, if the maximum number of runs on a tape exceeds
2 (rather than exceeds 1, which is the current behavior with the
patch) would be the setting I would want to use, if I were to use it
at all.

...

> This patch continues to have tuplesort determine run size based on the
> availability of work_mem only. It does not entirely fix the problem of
> having work_mem sizing impact performance in counter-intuitive ways.
> In other words, smaller work_mem sizes can still be faster. It does
> make that general situation much better, though, because quicksort is
> a cache oblivious algorithm. Smaller work_mem sizes are sometimes a
> bit faster, but never dramatically faster.

Yes, that is what I found as well.   I think the main reason it is
even that small bit slower at large memory is because writing and
sorting are not finely interleaved, like they are with heap selection.
Once you sit down to qsort 3GB of data, you are not going to write any
more tuples until that qsort is entirely done.  I didn't do any
testing beyond 3GB of maintenance_work_mem, but I imagine this could
get more important if people used dozens or hundreds of GB.

One idea would be to stop and write out a just-sorted partition
whenever that partition is contiguous to the already-written portion.
If the qsort is tweaked to recurse preferentially into the left
partition first, this would result in tuples being written out at a
pretty study pace.  If the qsort was unbalanced and the left partition
was always the larger of the two, then that approach would have to be
abandoned at some point.  But I think there are already defenses
against that, and at worst you would give up and revert to the
sort-them-all then write-them-all behavior.

Overall this is very nice.  Doing some real world index builds of
short text (~20 bytes ascii) identifiers, I could easily get speed ups
of 40% with your patch if I followed the philosophy of "give it as
much maintenance_work_mem as I can afford".  If I fine-tuned the
maintenance_work_mem so that it was optimal for each sort method, then
the speed up quite a bit less, only 22%.  But 22% is still very
worthwhile, and who wants to spend their time fine-tuning the memory
use for every index build?

Cheers,

Jeff

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

18 November 2015, 23:29:58

Hi Jeff,

On Wed, Nov 18, 2015 at 10:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> tuplesort.c: In function 'mergeruns':
> tuplesort.c:2741: warning: unused variable 'memNowUsed'

That was caused by a last-minute change to the mulitpass warning
message. I forgot to build at -O2, and missed this.

>> I believe, in general, that we should consider a multi-pass sort to be
>> a kind of inherently suspect thing these days, in the same way that
>> checkpoints occurring 5 seconds apart are: not actually abnormal, but
>> something that we should regard suspiciously. Can you really not
>> afford enough work_mem to only do one pass?
>
> I don't think it is really about the cost of RAM.  What people can't
> afford is spending all of their time personally supervising all the
> sorts on the system. It is pretty easy for a transient excursion in
> workload to make a server swap itself to death and fall over. Not just
> the PostgreSQL server, but the entire OS. Since we can't let that
> happen, we have to be defensive about work_mem. Yes, we have far more
> RAM than we used to. We also have far more things demanding access to
> it at the same time.

I agree with you, but I'm not sure that I've been completely clear on
what I mean. Even as the demand on memory has grown, the competitive
advantage of replacement selection in avoiding a multi-pass merge has
diminished far faster. You should simply not allow it to happen as a
DBA -- that's the advice that other systems' documentation makes.
Avoiding a multi-pass merge was always the appeal of replacement
selection, even in the 1970s, but it will rarely if ever make that
critical difference these days.

As I said, as the volume of data to be sorted in memory increases
linearly, the point at which a multi-pass merge phase happens
increases quadratically with my patch. The advantage of replacement
selection is therefore almost irrelevant. That is why, in general,
interest in replacement selection is far far lower today than it was
in the past.

The poor CPU cache characteristics of the heap (priority queue) are
only half the story about why replacement selection is more or less
obsolete these days.

> I agree we don't want to optimize for low memory, but I don't think we
> should throw it under the bus, either.  Right now we are effectively
> saying the CPU-cache problems with the heap start exceeding the larger
> run size benefits at 64kb (the smallest allowed setting for work_mem).
> While any number we pick is going to be a guess that won't apply to
> all hardware, surely we can come up with a guess better than 64kb.
> Like, 8 MB, say.  If available memory for the sort is 8MB or smaller
> and the predicted size anticipates a multipass merge, then we can use
> the heap method rather than the quicksort method.  Would a rule like
> that complicate things much?

I'm already using replacement selection for the first run when it is
predicted by my new ad-hoc cost model that we can get away with a
"quicksort with spillover", avoiding almost all I/O. We only
incrementally spill as many tuples as needed right now, but it would
be pretty easy to not quicksort the remaining tuples, but continue to
incrementally spill everything. So no, it wouldn't be too hard to hang
on to the old behavior sometimes, if it looked worthwhile.

In principle, I have no problem with doing that. Through testing, I
cannot see any actual upside, though. Perhaps I just missed something.
Even 8MB is enough to avoid the multipass merge in the event of a
surprisingly high volume of data (my work laptop is elsewhere, so I
don't have my notes on this in front of me, but I figured out the
crossover point for a couple of cases).

>> In theory, the answer could be "yes", but it seems highly unlikely.
>> Not only is very little memory required to avoid a multi-pass merge
>> step, but as described above the amount required grows very slowly
>> relative to linear growth in input. I propose to add a
>> checkpoint_warning style warning (with a checkpoint_warning style GUC
>> to control it).
>
> I'm skeptical about a warning for this.

Other systems expose this explicitly, and, as I said, say in an
unqualified way that a multi-pass merge should be avoided. Maybe the
warning isn't the right way of communicating that message to the DBA
in detail, but I am confident that it ought to be communicated to the
DBA fairly clearly.

> One idea would be to stop and write out a just-sorted partition
> whenever that partition is contiguous to the already-written portion.
> If the qsort is tweaked to recurse preferentially into the left
> partition first, this would result in tuples being written out at a
> pretty study pace.  If the qsort was unbalanced and the left partition
> was always the larger of the two, then that approach would have to be
> abandoned at some point.  But I think there are already defenses
> against that, and at worst you would give up and revert to the
> sort-them-all then write-them-all behavior.

Seems kind of invasive.

> Overall this is very nice.  Doing some real world index builds of
> short text (~20 bytes ascii) identifiers, I could easily get speed ups
> of 40% with your patch if I followed the philosophy of "give it as
> much maintenance_work_mem as I can afford".  If I fine-tuned the
> maintenance_work_mem so that it was optimal for each sort method, then
> the speed up quite a bit less, only 22%.  But 22% is still very
> worthwhile, and who wants to spend their time fine-tuning the memory
> use for every index build?

Thanks, but I expected better than that. Was it a collated text
column? The C collation will put the patch in a much better light
(more strcoll() calls are needed with this new approach -- it's still
well worth it, but it is a downside that makes collated text not
especially sympathetic). Just sorting on an integer attribute is also
a good sympathetic case, FWIW.

How much time did the sort take in each case? How many runs? How much
time was spent merging? trace_sort output is very interesting here.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Greg Stark

Date:

19 November 2015, 01:23:40

On Wed, Nov 18, 2015 at 11:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Other systems expose this explicitly, and, as I said, say in an
> unqualified way that a multi-pass merge should be avoided. Maybe the
> warning isn't the right way of communicating that message to the DBA
> in detail, but I am confident that it ought to be communicated to the
> DBA fairly clearly.

I'm pretty convinced warnings from DML are a categorically bad idea.
In any OLTP load they're effectively fatal errors since they'll fill
up log files or client output or cause other havoc. Or they'll cause
no problem because nothing is reading them. Neither behaviour is
useful.

Perhaps the right thing to do is report a statistic to pg_stats so
DBAs can see how often sorts are in memory, how often they're on disk,
and how often the on disk sort requires n passes. That would put them
in the same category as "sequential scans" for DBAs that expect the
application to only run index-based OLTP queries for example. The
problem with this is that sorts are not tied to a particular relation
and without something to group on the stat will be pretty hard to act
on.

-- 
greg

Re: Using quicksort for every external sort run

From

Robert Haas

Date:

19 November 2015, 02:19:48

On Wed, Nov 18, 2015 at 6:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
> In principle, I have no problem with doing that. Through testing, I
> cannot see any actual upside, though. Perhaps I just missed something.
> Even 8MB is enough to avoid the multipass merge in the event of a
> surprisingly high volume of data (my work laptop is elsewhere, so I
> don't have my notes on this in front of me, but I figured out the
> crossover point for a couple of cases).

I'd be interested in seeing this analysis in some detail.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

19 November 2015, 18:56:33

On Wed, Nov 18, 2015 at 5:22 PM, Greg Stark <stark@mit.edu> wrote:
> On Wed, Nov 18, 2015 at 11:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Other systems expose this explicitly, and, as I said, say in an
>> unqualified way that a multi-pass merge should be avoided. Maybe the
>> warning isn't the right way of communicating that message to the DBA
>> in detail, but I am confident that it ought to be communicated to the
>> DBA fairly clearly.
>
> I'm pretty convinced warnings from DML are a categorically bad idea.
> In any OLTP load they're effectively fatal errors since they'll fill
> up log files or client output or cause other havoc. Or they'll cause
> no problem because nothing is reading them. Neither behaviour is
> useful.

To be clear, this is a LOG level message, not a WARNING.

I think that if the DBA ever sees the multipass_warning message, he or
she does not have an OLTP workload. If you experience what might be
considered log spam due to multipass_warning, then the log spam is the
least of your problems. Besides, log_temp_files is a very similar
setting (albeit one that is not enabled by default), so I tend to
doubt that your view that that style of log message is categorically
bad is widely shared. Having said that, I'm not especially attached to
the idea of communicating the concern to the DBA using the mechanism
of a checkpoint_warning-style LOG message (multipass_warning).

Yes, I really do mean it when I say that the DBA is not supposed to
see this message, no matter how much or how little memory or data is
involved. There is no nuance intended here; it isn't sensible to allow
a multi-pass sort, just as it isn't sensible to allow checkpoints
every 5 seconds. Both of those things can be thought of as thrashing.

> Perhaps the right thing to do is report a statistic to pg_stats so
> DBAs can see how often sorts are in memory, how often they're on disk,
> and how often the on disk sort requires n passes.

That might be better than what I came up with, but I hesitate to track
more things using the statistics collector in the absence of a clear
consensus to do so. I'd be more worried about the overhead of what you
suggest than the overhead of a LOG message, seen only in the case of
something that's really not supposed to happen.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Simon Riggs

Date:

19 November 2015, 20:13:04

On 19 November 2015 at 01:22, Greg Stark <stark@mit.edu> wrote:

Perhaps the right thing to do is report a statistic to pg_stats so
DBAs can see how often sorts are in memory, how often they're on disk,
and how often the on disk sort requires n passes. That would put them
in the same category as "sequential scans" for DBAs that expect the
application to only run index-based OLTP queries for example. The
problem with this is that sorts are not tied to a particular relation
and without something to group on the stat will be pretty hard to act
on.

We don't have a message appear when hash joins use go weird, and we definitely don't want anything like that for sorts either.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Using quicksort for every external sort run

From

Greg Stark

Date:

19 November 2015, 20:36:09

On Thu, Nov 19, 2015 at 6:56 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Yes, I really do mean it when I say that the DBA is not supposed to
> see this message, no matter how much or how little memory or data is
> involved. There is no nuance intended here; it isn't sensible to allow
> a multi-pass sort, just as it isn't sensible to allow checkpoints
> every 5 seconds. Both of those things can be thought of as thrashing.

Hm. So a bit of back-of-envelope calculation. If we have want to
buffer at least 1MB for each run -- I think we currently do more
actually -- and say that a 1GB work_mem ought to be enough to run
reasonably (that's per sort after all and there might be multiple
sorts to say nothing of other users on the system). That means we can
merge about 1,000 runs in the final merge. Each run will be about 2GB
currently but 1GB if we quicksort the runs. So the largest table we
can sort in a single pass is 1-2 TB.

If we go above those limits we have the choice of buffering less per
run or doing a whole second pass through the data. I suspect we would
get more horsepower out of buffering less though I'm not sure where
the break-even point is. Certainly if we did random I/O for every I/O
that's much more expensive than a factor of 2 over sequential I/O. We
could probably do the math based on random_page_cost and
sequential_page_cost to calculate the minimum amount of buffering
before it's worth doing an extra pass.

So I think you're kind of right and kind of wrong. The vast majority
of use cases are either sub 1TB or are in work environments designed
specifically for data warehouse queries where a user can obtain much
more memory for their queries. However I think it's within the
intended use cases that Postgres should be able to handle a few
terabytes of data on a moderately sized machine in a shared
environment too.

Our current defaults are particularly bad for this though. If you
initdb a new Postgres database today and create a table even a few
gigabytes and try to build an index on it it takes forever. The last
time I did a test I canceled it after it had run for hours, raised
maintenance_work_mem and built the index in a few minutes. The problem
is that if we just raise those limits then people will use more
resources when they don't need it. If it were safer for to have those
limits be much higher then we could make the defaults reflect what
people want when they do bigger jobs rather than just what they want
for normal queries or indexes.

> I think that if the DBA ever sees the multipass_warning message, he or she does not have an OLTP workload.

Hm, that's pretty convincing. I guess this isn't the usual sort of
warning due to the time it would take to trigger.

-- 
greg

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

19 November 2015, 20:43:58

On Wed, Nov 18, 2015 at 6:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Nov 18, 2015 at 6:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> In principle, I have no problem with doing that. Through testing, I
>> cannot see any actual upside, though. Perhaps I just missed something.
>> Even 8MB is enough to avoid the multipass merge in the event of a
>> surprisingly high volume of data (my work laptop is elsewhere, so I
>> don't have my notes on this in front of me, but I figured out the
>> crossover point for a couple of cases).
>
> I'd be interested in seeing this analysis in some detail.

Sure. Jeff mentioned 8MB as a work_mem setting, so let's examine a
case where that's the work_mem setting, and see experimentally where
the crossover point for a multi-pass sort ends up.

If this table is created:

postgres=# create unlogged table bar as select (random() * 1e9)::int4
idx, 'payload xyz'::text payload from generate_series(1, 10100000) i;
SELECT 10100000

Then, on my system, a work_mem setting of 8MB *just about* avoids
seeing the multipass_warning message with this query:

postgres=# select count(distinct idx) from bar ;
  count
------------10,047,433
(1 row)

A work_mem setting of 235MB is just enough to make the query's sort
fully internal.

Let's see how things change with a higher work_mem setting of 16MB. I
mentioned quadratic growth: Having doubled work_mem, let's *quadruple*
the number of tuples, to see where this leaves a 16MB setting WRT a
multi-pass merge:

postgres=# drop table bar ;
DROP TABLE
postgres=# create unlogged table bar as select (random() * 1e9)::int4
idx, 'payload xyz'::text payload from generate_series(1, 10100000 * 4)
i;
SELECT 40400000

Further experiments show that this is the exact point at which the
16MB work_mem setting similarly narrowly avoids a multi-pass warning.
This should be the dominant consideration, because now a fully
internal sort requires 4X the work_mem of my original 16MB work_mem
example table/query.

The quadratic growth in a simple hybrid sort-merge strategy's ability
to avoid a multi-pass merge phase (growth relative to linear increases
in work_mem) can be demonstrated with simple experiments.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Greg Stark

Date:

19 November 2015, 20:52:54

On Thu, Nov 19, 2015 at 8:35 PM, Greg Stark <stark@mit.edu> wrote:
> Hm. So a bit of back-of-envelope calculation. If we have want to
> buffer at least 1MB for each run -- I think we currently do more
> actually -- and say that a 1GB work_mem ought to be enough to run
> reasonably (that's per sort after all and there might be multiple
> sorts to say nothing of other users on the system). That means we can
> merge about 1,000 runs in the final merge. Each run will be about 2GB
> currently but 1GB if we quicksort the runs. So the largest table we
> can sort in a single pass is 1-2 TB.


For the sake of pedantry I fact checked myself. We calculate the
number of tapes based on wanting to buffer 32 blocks plus overhead so
about 256kB. So the actual maximum you can handle with 1GB of sort_mem
without multiple merges is on the order 4-8TB.

-- 
greg

Re: Using quicksort for every external sort run

From

Robert Haas

Date:

19 November 2015, 22:35:21

On Thu, Nov 19, 2015 at 3:43 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I'd be interested in seeing this analysis in some detail.
>
> Sure. Jeff mentioned 8MB as a work_mem setting, so let's examine a
> case where that's the work_mem setting, and see experimentally where
> the crossover point for a multi-pass sort ends up.
>
> If this table is created:
>
> postgres=# create unlogged table bar as select (random() * 1e9)::int4
> idx, 'payload xyz'::text payload from generate_series(1, 10100000) i;
> SELECT 10100000
>
> Then, on my system, a work_mem setting of 8MB *just about* avoids
> seeing the multipass_warning message with this query:
>
> postgres=# select count(distinct idx) from bar ;
>
>    count
> ------------
>  10,047,433
> (1 row)
>
> A work_mem setting of 235MB is just enough to make the query's sort
> fully internal.
>
> Let's see how things change with a higher work_mem setting of 16MB. I
> mentioned quadratic growth: Having doubled work_mem, let's *quadruple*
> the number of tuples, to see where this leaves a 16MB setting WRT a
> multi-pass merge:
>
> postgres=# drop table bar ;
> DROP TABLE
> postgres=# create unlogged table bar as select (random() * 1e9)::int4
> idx, 'payload xyz'::text payload from generate_series(1, 10100000 * 4)
> i;
> SELECT 40400000
>
> Further experiments show that this is the exact point at which the
> 16MB work_mem setting similarly narrowly avoids a multi-pass warning.
> This should be the dominant consideration, because now a fully
> internal sort requires 4X the work_mem of my original 16MB work_mem
> example table/query.
>
> The quadratic growth in a simple hybrid sort-merge strategy's ability
> to avoid a multi-pass merge phase (growth relative to linear increases
> in work_mem) can be demonstrated with simple experiments.

OK, so reversing this analysis, with the default work_mem of 4MB, we'd
need a multi-pass merge for more than 235MB/4 = 58MB of data.  That is
very, very far from being a can't-happen scenario, and I would not at
all think it would be acceptable to ignore such a case.  Even ignoring
the possibility that someone with work_mem = 8MB will try to sort
235MB of data strikes me as out of the question.  Those seem like
entirely reasonable things for users to do.  Greg's example of someone
with work_mem = 1GB trying to sort 4TB does not seem like a crazy
thing to me.  Yeah, in all of those cases you might think that users
should set work_mem higher, but that doesn't mean that they actually
do.  Most systems have to set work_mem very conservatively to make
sure they don't start swapping under heavily load.

I think you need to revisit your assumptions here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

19 November 2015, 22:43:00

On Thu, Nov 19, 2015 at 12:35 PM, Greg Stark <stark@mit.edu> wrote:
> So I think you're kind of right and kind of wrong. The vast majority
> of use cases are either sub 1TB or are in work environments designed
> specifically for data warehouse queries where a user can obtain much
> more memory for their queries. However I think it's within the
> intended use cases that Postgres should be able to handle a few
> terabytes of data on a moderately sized machine in a shared
> environment too.

Maybe I've made this more complicated than it needs to be. The fact is
that my recent 16MB example is still faster than the master branch
when a multiple pass merge is performed (e.g. when work_mem is 15MB,
or even 12MB). More on that later.

> Our current defaults are particularly bad for this though. If you
> initdb a new Postgres database today and create a table even a few
> gigabytes and try to build an index on it it takes forever. The last
> time I did a test I canceled it after it had run for hours, raised
> maintenance_work_mem and built the index in a few minutes. The problem
> is that if we just raise those limits then people will use more
> resources when they don't need it.

I think that the bigger problems are:

* There is a harsh discontinuity in the cost function -- performance
suddenly falls off a cliff when a sort must be performed externally.

* Replacement selection is obsolete. It's very slow on machines from
the last 20 years.

> If it were safer for to have those
> limits be much higher then we could make the defaults reflect what
> people want when they do bigger jobs rather than just what they want
> for normal queries or indexes.

Or better yet, make it so that it doesn't really matter that much,
even while you're still using the same amount of memory as before.

If you're saying that the whole work_mem model isn't a very good one,
then I happen to agree. It would be very nice to have some fancy
admission control feature, but I'd still appreciate a cost model that
dynamically sets work_mem. The model avoids an excessively high
setting where there is only about half the memory available for a 10GB
sort. You should probably have 5 runs sized 2GB, rather than 2 runs
sized 5GB, even if you can afford the memory for the latter. It would
still make sense to have very high work_mem settings when you can
dynamically set it so high that the sort does complete internally,
though.

>> I think that if the DBA ever sees the multipass_warning message, he or she does not have an OLTP workload.
>
> Hm, that's pretty convincing. I guess this isn't the usual sort of
> warning due to the time it would take to trigger.

I would like more opinions on the multipass_warning message. I can
write a patch that creates a new system view, detailing how sort were
completed, if there is demand.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

19 November 2015, 22:53:54

On Thu, Nov 19, 2015 at 2:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> OK, so reversing this analysis, with the default work_mem of 4MB, we'd
> need a multi-pass merge for more than 235MB/4 = 58MB of data.  That is
> very, very far from being a can't-happen scenario, and I would not at
> all think it would be acceptable to ignore such a case.

> I think you need to revisit your assumptions here.

Which assumption? Are we talking about multipass_warning, or my patch
series in general? Obviously those are two very differently things. As
I've said, we could address the visibility aspect of this differently.
I'm fine with that.

I'll now talk about my patch series in general -- the actual
consequences of not avoiding a single pass merge phase when the master
branch would have done so.

The latter 16MB work_mem example query/table is still faster with a
12MB work_mem than master, even with multiple passes. Quite a bit
faster, in fact: about 37 seconds on master, to about 24.7 seconds
with the patches (same for higher settings short of 16MB).

Now, that's probably slightly unfair on the master branch, because the
patches still have the benefit of the memory pooling during the merge
phase, which is nothing to do with what we're talking about, and
because my laptop still has plenty of ram.

I should point out that there is no evidence that any case has been
regressed, let alone written off entirely or ignored. I looked. I
probably have not been completely exhaustive, and I'd be willing to
believe there is something that I've missed, but it's still quite
possible that there is no downside to any of this.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

20 November 2015, 00:54:43

On Thu, Nov 19, 2015 at 2:53 PM, Peter Geoghegan <pg@heroku.com> wrote:
> The latter 16MB work_mem example query/table is still faster with a
> 12MB work_mem than master, even with multiple passes. Quite a bit
> faster, in fact: about 37 seconds on master, to about 24.7 seconds
> with the patches (same for higher settings short of 16MB).

I made the same comparison with work_mem sizes of 2MB and 6MB for
master/patch, and the patch *still* came out ahead, often by over 10%.
This was more than fair, though, because sometimes the final
on-the-fly merge for the master branch starting at a point at which
the patch series has already completed its sort. (Of course, I don't
believe that any user would ever be well served with such a low
work_mem setting for these queries -- I'm looking for a bad case,
though).

I guess this is a theoretical downside of my approach, that is more
than made up for elsewhere (even leaving aside the final, unrelated
patch in the series, addressing the merge bottleneck directly). So, to
summarize such downsides (downsides of a hybrid sort-merge strategy as
compared to replacement selection):

* As mentioned just now, the fact that there are more runs -- merging
can be slower (although tuples can be returned earlier, which could
also help with CREATE INDEX). This is more of a problem when random
I/O is expensive, and less of a problem when the OS cache buffers
things nicely.

* One run can be created with replacement selection, where a
hyrbid-sort merge strategy needs to create and then merge many runs.
When I started work on this patch, I was pretty sure that case would
be noticeably regressed. I was wrong.

* Abbreviated key comparisons are used less because runs are smaller.
This is why sorts of types like numeric are not especially sympathetic
to the patch. Still, we manage to come out well ahead overall.

You can perhaps show the patch to be almost as slow as the master
branch with a very unsympathetic case involving the union of all these
3. I couldn't regress a case with integers with just the first two,
though.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Greg Stark

Date:

20 November 2015, 01:33:25

On Fri, Nov 20, 2015 at 12:54 AM, Peter Geoghegan <pg@heroku.com> wrote:
> * One run can be created with replacement selection, where a
> hyrbid-sort merge strategy needs to create and then merge many runs.
> When I started work on this patch, I was pretty sure that case would
> be noticeably regressed. I was wrong.

Hm. Have you tested a nearly-sorted input set around 1.5x the size of
work_mem? That should produce a single run using the heap to generate
runs but generate two runs if, AIUI, you're just filling work_mem,
running quicksort, dumping that run entirely and starting fresh.

I don't mean to say it's representative but if you're looking for a
worst case...

-- 
greg

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

20 November 2015, 01:42:56

On Thu, Nov 19, 2015 at 5:32 PM, Greg Stark <stark@mit.edu> wrote:
> Hm. Have you tested a nearly-sorted input set around 1.5x the size of
> work_mem? That should produce a single run using the heap to generate
> runs but generate two runs if, AIUI, you're just filling work_mem,
> running quicksort, dumping that run entirely and starting fresh.

Yes. Actually, even with a random ordering, on average replacement
selection sort will produce runs twice as long as the patch series.
With nearly ordered input, there is no limit to how log runs can be --
you could definitely have cases where *no* merge step is required. We
just return tuples from one long run. And yet, it isn't worth it in
cases that I tested.

Please don't take my word for it -- try yourself.
-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Robert Haas

Date:

20 November 2015, 20:50:17

On Thu, Nov 19, 2015 at 5:42 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I would like more opinions on the multipass_warning message. I can
> write a patch that creates a new system view, detailing how sort were
> completed, if there is demand.

I think a warning message is a terrible idea, and a system view is a
needless complication.  If the patch is as fast or faster than what we
have now in all cases, then we should adopt it (assuming it's also
correct and well-commented and all that other good stuff).  If it's
not, then we need to analyze the cases where it's slower and decide
whether they are significant enough to care about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using quicksort for every external sort run

From

Robert Haas

Date:

20 November 2015, 20:52:23

On Thu, Nov 19, 2015 at 5:53 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I'll now talk about my patch series in general -- the actual
> consequences of not avoiding a single pass merge phase when the master
> branch would have done so.

That's what I was asking about.  It seemed to me that you were saying
we could ignore those cases, which doesn't seem to me to be true.

> The latter 16MB work_mem example query/table is still faster with a
> 12MB work_mem than master, even with multiple passes. Quite a bit
> faster, in fact: about 37 seconds on master, to about 24.7 seconds
> with the patches (same for higher settings short of 16MB).

Is this because we save enough by quicksorting rather than heapsorting
to cover the cost of the additional merge phase?

If not, then why is it happening like this?

> I should point out that there is no evidence that any case has been
> regressed, let alone written off entirely or ignored. I looked. I
> probably have not been completely exhaustive, and I'd be willing to
> believe there is something that I've missed, but it's still quite
> possible that there is no downside to any of this.

If that's so, it's excellent news.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

20 November 2015, 22:44:59

On Fri, Nov 20, 2015 at 12:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Nov 19, 2015 at 5:42 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I would like more opinions on the multipass_warning message. I can
>> write a patch that creates a new system view, detailing how sort were
>> completed, if there is demand.
>
> I think a warning message is a terrible idea, and a system view is a
> needless complication.  If the patch is as fast or faster than what we
> have now in all cases, then we should adopt it (assuming it's also
> correct and well-commented and all that other good stuff).  If it's
> not, then we need to analyze the cases where it's slower and decide
> whether they are significant enough to care about.

Maybe I was mistaken to link the idea to this patch, but I think it
(or something involving a view) is a good idea. I linked it to the
patch because the patch makes it slightly more important than before.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

20 November 2015, 22:59:01

On Fri, Nov 20, 2015 at 12:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> That's what I was asking about.  It seemed to me that you were saying
> we could ignore those cases, which doesn't seem to me to be true.

I've been around for long enough to know that there are very few cases
that can be ignored.  :-)

>> The latter 16MB work_mem example query/table is still faster with a
>> 12MB work_mem than master, even with multiple passes. Quite a bit
>> faster, in fact: about 37 seconds on master, to about 24.7 seconds
>> with the patches (same for higher settings short of 16MB).
>
> Is this because we save enough by quicksorting rather than heapsorting
> to cover the cost of the additional merge phase?
>
> If not, then why is it happening like this?

I think it's because of caching effects alone, but I am not 100% sure
of that. I concede that it might not be enough to make up for the
additional I/O on some systems or platforms. The fact remains,
however, that the patch was faster on the unsympathetic case I ran on
the machine I had available (which has an SSD), and that I really have
not managed to find a case that is regressed after some effort.

>> I should point out that there is no evidence that any case has been
>> regressed, let alone written off entirely or ignored. I looked. I
>> probably have not been completely exhaustive, and I'd be willing to
>> believe there is something that I've missed, but it's still quite
>> possible that there is no downside to any of this.
>
> If that's so, it's excellent news.

As I mentioned up-thread, maybe I shouldn't have brought all the
theoretical justifications for killing replacement selection into the
discussion so early. Those observations on replacement selection
(which are not my own original insights) happen to be what spurred
this work. I spent so much time talking about how irrelevant
multi-pass merging was that people imagined that that was severely
regressed, when it really was not. That just happened to be the way I
came at the problem.

The numbers speak for themselves here. I just want to be clear about
the disadvantages of what I propose, even if it's well worth it
overall in most (all?) cases.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

22 November 2015, 23:40:48

On Fri, Nov 20, 2015 at 2:58 PM, Peter Geoghegan <pg@heroku.com> wrote:
> The numbers speak for themselves here. I just want to be clear about
> the disadvantages of what I propose, even if it's well worth it
> overall in most (all?) cases.

There is a paper called "Critical Evaluation of Existing External
Sorting Methods in the Perspective of Modern Hardware":

http://ceur-ws.org/Vol-1343/paper8.pdf

This paper was not especially influential, and I don't agree with
every detail, or I at least don't think that every recommendation
should be adopted to Postgres. Even still, the paper is the best
summary I have seen so far. It clearly explains why there is plenty to
recommend a simple hybrid sort-merge strategy over replacement
selection, despite the fact that replacement selection is faster when
using 1970s hardware.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

24 November 2015, 23:02:11

On Wed, Nov 18, 2015 at 3:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Overall this is very nice.  Doing some real world index builds of
>> short text (~20 bytes ascii) identifiers, I could easily get speed ups
>> of 40% with your patch if I followed the philosophy of "give it as
>> much maintenance_work_mem as I can afford".  If I fine-tuned the
>> maintenance_work_mem so that it was optimal for each sort method, then
>> the speed up quite a bit less, only 22%.  But 22% is still very
>> worthwhile, and who wants to spend their time fine-tuning the memory
>> use for every index build?
>
> Thanks, but I expected better than that.

It also might have been that you used a "quicksort with spillover".
That still uses a heap to some degree, in order to avoid most I/O, but
with a single backend sorting that can often be slower than the
(greatly overhauled) "external merge" sort method (both of these
algorithms are what you'll see in EXPLAIN ANALYZE, which can be a
little confusing because it isn't clear what the distinction is in
some cases).

You might also very occasionally see an "external sort" (this is also
a description from EXPLAIN ANALYZE), which is generally slower (it's a
case where we were unable to do a final on-the-fly merge, either
because random access is requested by the caller, or because multiple
passes were required -- thankfully this doesn't happen most of the
time).

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Simon Riggs

Date:

24 November 2015, 23:32:27

On 20 November 2015 at 22:58, Peter Geoghegan <pg@heroku.com> wrote:

The numbers speak for themselves here. I just want to be clear about
the disadvantages of what I propose, even if it's well worth it
overall in most (all?) cases.

My feeling is that numbers rarely speak for themselves, without LSD. (Which numbers?)

How are we doing here? Keen to see this work get committed, so we can move onto parallel sort. What's the summary?

How about we commit it with a sort_algorithm = 'foo' parameter so we can compare things before release of 9.6?

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

25 November 2015, 00:33:24

On Tue, Nov 24, 2015 at 3:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> My feeling is that numbers rarely speak for themselves, without LSD. (Which
> numbers?)

Guffaw.

> How are we doing here? Keen to see this work get committed, so we can move
> onto parallel sort. What's the summary?

I showed a test case where a CREATE INDEX sort involving 5 runs and a
merge only took about 18% longer than an equivalent fully internal
sort [1] using over 5 times the memory. That's about 2.5X faster than
the 9.5 performance on the same system with the same amount of memory.

Overall, the best cases I saw were the original "quicksort with
spillover" cases [2]. They were just under 4X faster. I care about
that less, though, because that will happen way less often, and won't
help with larger sorts that are even more CPU bound.

There is a theoretical possibility that this is slower on systems
where multiple merge passes are required as a consequence of not
having runs as long as possible (due to not using replacement
selection heap). That will happen very infrequently [3], and is very
probably still worth it.

So, the bottom line is: This patch seems very good, is unlikely to
have any notable downside (no case has been shown to be regressed),
but has yet to receive code review. I am working on a new version with
the first two commits consolidated, and better comments, but that will
have the same code, unless I find bugs or am dissatisfied. It mostly
needs thorough code review, and to a lesser extent some more
performance testing.

Parallel sort is very important. Robert, Amit and I had a call about
this earlier today. We're all in agreement that this should be
extended in that direction, and have a rough idea about how it ought
to fit together with the parallelism primitives. Parallel sort in 9.6
could certainly happen -- that's what I'm aiming for. I haven't really
done preliminary research yet; I'll know more in a little while.

> How about we commit it with a sort_algorithm = 'foo' parameter so we can
> compare things before release of 9.6?

I had a debug GUC (like the existing one to disable top-N heapsorts)
that disabled "quicksort with spillover". That's almost the opposite
of what you're asking for, though, because that makes us never use a
heap. You're asking for me to write a GUC to always use a heap.

That's not a good way of testing this patch, because it's inconvenient
to consider the need to use a heap beyond the first run (something
that now exists solely for the benefit of "quicksort with spillover";
a heap will often never be used even for the first run). Besides, the
merge optimization is a big though independent part of this, and
doesn't make sense to control with the same GUC.

If I haven't gotten this right, we should not commit the patch. If the
patch isn't superior to the existing approach in virtually every way,
then there is no point in making it possible for end-users to disable
with messy GUCs -- it should be reverted.

[1] Message: http://www.postgresql.org/message-id/CAM3SWZRiHaF7jdf923ZZ2qhDJiErqP5uU_+JPuMvUmeD0z9fFA@mail.gmail.com
Attachment:
http://www.postgresql.org/message-id/attachment/39660/quicksort_external_test.txt

[2] http://www.postgresql.org/message-id/CAM3SWZTzLT5Y=VY320NznAyz2z_em3us6x=7rXMEUma9Z9yN6Q@mail.gmail.com

[3] http://www.postgresql.org/message-id/CAM3SWZTX5=nHxPpogPirQsH4cR+BpQS6r7Ktax0HMQiNLf-1qA@mail.gmail.com
-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Simon Riggs

Date:

25 November 2015, 00:47:03

On 25 November 2015 at 00:33, Peter Geoghegan <pg@heroku.com> wrote:

Parallel sort is very important. Robert, Amit and I had a call about
this earlier today. We're all in agreement that this should be
extended in that direction, and have a rough idea about how it ought
to fit together with the parallelism primitives. Parallel sort in 9.6
could certainly happen -- that's what I'm aiming for. I haven't really
done preliminary research yet; I'll know more in a little while.

Glad to hear it, I was hoping to see that.

> How about we commit it with a sort_algorithm = 'foo' parameter so we can
> compare things before release of 9.6?

I had a debug GUC (like the existing one to disable top-N heapsorts)
that disabled "quicksort with spillover". That's almost the opposite
of what you're asking for, though, because that makes us never use a
heap. You're asking for me to write a GUC to always use a heap.

I'm asking for a parameter to confirm results from various algorithms, so we can get many eyeballs to confirm your work across its breadth. This is similar to the original trace_sort parameter which we used to confirm earlier sort improvements. I trust it will show this is good and can be removed prior to release of 9.6.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

25 November 2015, 01:24:28

On Tue, Nov 24, 2015 at 4:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> I had a debug GUC (like the existing one to disable top-N heapsorts)
>> that disabled "quicksort with spillover". That's almost the opposite
>> of what you're asking for, though, because that makes us never use a
>> heap. You're asking for me to write a GUC to always use a heap.
>
>
> I'm asking for a parameter to confirm results from various algorithms, so we
> can get many eyeballs to confirm your work across its breadth. This is
> similar to the original trace_sort parameter which we used to confirm
> earlier sort improvements. I trust it will show this is good and can be
> removed prior to release of 9.6.
My patch updates trace_sort messages. trace_sort doesn't change the
behavior of anything. The only time we've ever done anything like this
was for Top-N heap sorts.

This is significantly more inconvenient than you think. See the
comments in the new dumpbatch() function.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Greg Stark

Date:

25 November 2015, 01:42:57

On Wed, Nov 25, 2015 at 12:33 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Nov 24, 2015 at 3:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> My feeling is that numbers rarely speak for themselves, without LSD. (Which
>> numbers?)
>
> Guffaw.

Actually I kind of agree. What I would like to see is a series of
numbers for increasing sizes of sorts plotted against the same series
for the existing algorithm. Specifically with the sort size varying to
significantly more than the physical memory on the machine. For
example on a 16GB machine sorting data ranging from 1GB to 128GB.

There's a lot more information in a series of numbers than individual
numbers. We'll be able to see whether all our pontificating about the
rates of growth of costs of different algorithms or which costs
dominate at which scales are actually borne out in reality. And see
where the break points are where I/O overtakes memory costs. And it'll
be clearer where to look for problematic cases where the new algorithm
might not dominate the old one.

-- 
greg

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

25 November 2015, 02:31:47

On Tue, Nov 24, 2015 at 5:42 PM, Greg Stark <stark@mit.edu> wrote:
> Actually I kind of agree. What I would like to see is a series of
> numbers for increasing sizes of sorts plotted against the same series
> for the existing algorithm. Specifically with the sort size varying to
> significantly more than the physical memory on the machine. For
> example on a 16GB machine sorting data ranging from 1GB to 128GB.

There already was a test case involving a 1TB/16 billion tuple sort
[1] (well, a 1TB gensort Postgres table [2]). Granted, I don't have a
large number of similar test cases across a variety of scales, but
there are only so many hours in the day. Disappointingly, the results
at that scale were merely good, not great, but there was probably
various flaws in how representative the hardware used was.

> There's a lot more information in a series of numbers than individual
> numbers. We'll be able to see whether all our pontificating about the
> rates of growth of costs of different algorithms or which costs
> dominate at which scales are actually borne out in reality.

You yourself said that 1GB is sufficient to get a single-pass merge
phase for a sort of about 4TB - 8TB, so I think the discussion of the
growth in costs tells us plenty about what can happen at the high end.
My approach might help less overall, but it certainly won't falter.

See the 1TB test case -- output from trace_sort is all there.

> And see
> where the break points are where I/O overtakes memory costs. And it'll
> be clearer where to look for problematic cases where the new algorithm
> might not dominate the old one.

I/O doesn't really overtake memory cost -- if it does, then it should
be worthwhile to throw more sequential I/O bandwidth at the problem,
which is a realistic, economical solution with a mature implementation
(unlike buying more memory bandwidth). I didn't do that with the 1TB
test case.

If you assume, as cost_sort() does, that it takes N log2(N)
comparisons to sort some tuples, then it breaks down like this:

10 items require 33 comparisons, ratio 3.32192809489
100 items require 664 comparisons, ratio 6.64385618977
1,000 items require 9,965 comparisons, ratio 9.96578428466
1,000,000 items require 19,931,568 comparisons, ratio 19.9315685693
1,000,000,000 items require 29,897,352,853 comparisons, ratio 29.897352854
16,000,000,000 items require 542,357,645,663 comparisons, ratio 33.897352854

The cost of writing out and reading runs should be more or less in
linear proportion to their size, which is a totally different story.
That's the main reason why "quicksort with spillover" is aimed at
relatively small sorts, which we expect more of overall.

I think the big issue is that a non-parallel sort is significantly
under-powered when you go to sort 16 billion tuples. It's probably not
very sensible to do so if you have a choice of parallelizing the sort.
There is no plausible way to do replacement selection in parallel,
since you cannot know ahead of time with any accuracy where to
partition workers, as runs can end up arbitrarily larger than memory
with presorted inputs. That might be the single best argument for what
I propose to do here.

This is what Corey's case showed for the final run with 30GB
maintenance_work_mem:

LOG:  starting quicksort of run 40: CPU 1815.99s/19339.80u sec elapsed
24910.38 sec
LOG:  finished quicksorting run 40: CPU 1820.09s/19565.94u sec elapsed
25140.69 sec
LOG:  finished writing run 40 to tape 39: CPU 1833.76s/19642.11u sec
elapsed 25234.44 sec

(Note that the time taken to copy tuples comprising the final run is
not displayed or accounted for)

This is the second last run, run 40, so it uses the full 30GB of
maintenance_work_mem. We spend 00:01:33.75 writing the run. However,
we spent 00:03:50.31 just sorting the run. That's roughly the same
ratio that I see on my laptop with far smaller runs. I think the
difference isn't wider because the server is quite I/O bound -- but we
could fix that by adding more disks.

[1] http://www.postgresql.org/message-id/CAM3SWZQtdd=Q+EF1xSZaYG1CiOYQJ7sZFcL08GYqChpJtGnKMg@mail.gmail.com
[2] https://github.com/petergeoghegan/gensort
-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

25 November 2015, 02:32:21

On Tue, Nov 24, 2015 at 6:31 PM, Peter Geoghegan <pg@heroku.com> wrote:
> (Note that the time taken to copy tuples comprising the final run is
> not displayed or accounted for)

I mean, comprising the second last run, the run shown, run 40.


-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Greg Stark

Date:

25 November 2015, 12:11:35

On Wed, Nov 25, 2015 at 2:31 AM, Peter Geoghegan <pg@heroku.com> wrote:
>
> There already was a test case involving a 1TB/16 billion tuple sort
> [1] (well, a 1TB gensort Postgres table [2]). Granted, I don't have a
> large number of similar test cases across a variety of scales, but
> there are only so many hours in the day. Disappointingly, the results
> at that scale were merely good, not great, but there was probably
> various flaws in how representative the hardware used was.

That's precisely why it's valuable to see a whole series of data
points rather than just one. Often when you see the shape of the
curve, especially any breaks or changes in the behaviour that helps
understand the limitations of the model. Perhaps it would be handy to
find a machine with a very small amount of physical memory so you
could run more reasonably sized tests on it. A VM would be fine if you
could be sure the storage layer isn't caching.

In short, I think you're right in theory and I want to make sure
you're right in practice. I'm afraid if we just look at a few data
points we'll miss out on a bug or a factor we didn't anticipate that
could have been addressed.

Just to double check though. My understanding is that your quicksort
algorithm is to fill work_mem with tuples, quicksort them, write out a
run, and repeat. When the inputs are done read work_mem/runs worth of
tuples from each run into memory and run a merge (using a heap?) like
we do currently. Is that right?

Incidentally one of the reasons abandoning the heap to generate runs
is attractive is that it opens up other sorting algorithms for us.
Instead of quicksort we might be able to plug in a GPU sort for
example.

-- 
greg

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

25 November 2015, 19:15:30

On Wed, Nov 25, 2015 at 4:10 AM, Greg Stark <stark@mit.edu> wrote:
> That's precisely why it's valuable to see a whole series of data
> points rather than just one. Often when you see the shape of the
> curve, especially any breaks or changes in the behaviour that helps
> understand the limitations of the model. Perhaps it would be handy to
> find a machine with a very small amount of physical memory so you
> could run more reasonably sized tests on it. A VM would be fine if you
> could be sure the storage layer isn't caching.

I have access to the Power7 system that Robert and others sometimes
use for this stuff. I'll try to come up a variety of tests.

> In short, I think you're right in theory and I want to make sure
> you're right in practice. I'm afraid if we just look at a few data
> points we'll miss out on a bug or a factor we didn't anticipate that
> could have been addressed.

I am in favor of being comprehensive.

> Just to double check though. My understanding is that your quicksort
> algorithm is to fill work_mem with tuples, quicksort them, write out a
> run, and repeat. When the inputs are done read work_mem/runs worth of
> tuples from each run into memory and run a merge (using a heap?) like
> we do currently. Is that right?

Yes, that's basically what I'm doing.

There are basically two extra bits:

* Without changing how merging actually works, I am clever about
allocating memory for the final on-the-fly merge. Allocation is done
once, in one huge batch. Importantly, I exploit locality by having
every "tuple proper" (e.g. IndexTuple) in contiguous memory, in sorted
(tape) order, per tape. This also greatly reduces palloc() overhead
for the final on-the-fly merge step.

* We do something special when we're just over work_mem, to avoid most
I/O -- "quicksort with spillover". This is a nice trick, but it's
certain way less important than the basic idea of simply always
quicksorting runs. I could easily not do this. This is why the heap
code was not significantly simplified to only cover the merge cases,
though -- this uses essentially the same replacement selection style
heap to incrementally spill to get us enough memory to mostly complete
the sort internally.

> Incidentally one of the reasons abandoning the heap to generate runs
> is attractive is that it opens up other sorting algorithms for us.
> Instead of quicksort we might be able to plug in a GPU sort for
> example.

Yes, it's true that we automatically benefit from optimizations for
the internal sort case now. That's already happening with the patch,
actually -- the "onlyKey" optimization (a more specialized quicksort
specialization, used in the one attribute heap case, and datum case)
is now automatically used. That was where the best 2012 numbers for
SortSupport were seen, so that makes a significant difference. As you
say, something like that could easily happen again.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Jeff Janes

Date:

28 November 2015, 22:04:30

On Wed, Nov 18, 2015 at 3:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Nov 18, 2015 at 10:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
>> I agree we don't want to optimize for low memory, but I don't think we
>> should throw it under the bus, either.  Right now we are effectively
>> saying the CPU-cache problems with the heap start exceeding the larger
>> run size benefits at 64kb (the smallest allowed setting for work_mem).
>> While any number we pick is going to be a guess that won't apply to
>> all hardware, surely we can come up with a guess better than 64kb.
>> Like, 8 MB, say.  If available memory for the sort is 8MB or smaller
>> and the predicted size anticipates a multipass merge, then we can use
>> the heap method rather than the quicksort method.  Would a rule like
>> that complicate things much?
>
> I'm already using replacement selection for the first run when it is
> predicted by my new ad-hoc cost model that we can get away with a
> "quicksort with spillover", avoiding almost all I/O. We only
> incrementally spill as many tuples as needed right now, but it would
> be pretty easy to not quicksort the remaining tuples, but continue to
> incrementally spill everything. So no, it wouldn't be too hard to hang
> on to the old behavior sometimes, if it looked worthwhile.
>
> In principle, I have no problem with doing that. Through testing, I
> cannot see any actual upside, though. Perhaps I just missed something.
> Even 8MB is enough to avoid the multipass merge in the event of a
> surprisingly high volume of data (my work laptop is elsewhere, so I
> don't have my notes on this in front of me, but I figured out the
> crossover point for a couple of cases).

For me very large sorts (100,000,000 ints) with work_mem below 4MB do
better with unpatched than with your patch series, by about 5%.  Not a
big deal, but also if it is easy to keep the old behavior then I think
we should.  Yes, it is dumb to do large sorts with work_mem below 4MB,
but if you have canned apps which do a mixture of workloads it is not
so easy to micromanage their work_mem.  Especially as there are no
easy tools that let me as the DBA say "if you connect from this IP
address, you get this work_mem".

I didn't collect trace_sort on those ones because of the high volume
it would generate.

>
>>> In theory, the answer could be "yes", but it seems highly unlikely.
>>> Not only is very little memory required to avoid a multi-pass merge
>>> step, but as described above the amount required grows very slowly
>>> relative to linear growth in input. I propose to add a
>>> checkpoint_warning style warning (with a checkpoint_warning style GUC
>>> to control it).
>>
>> I'm skeptical about a warning for this.
>
> Other systems expose this explicitly, and, as I said, say in an
> unqualified way that a multi-pass merge should be avoided. Maybe the
> warning isn't the right way of communicating that message to the DBA
> in detail, but I am confident that it ought to be communicated to the
> DBA fairly clearly.

I thinking about how many other places in the code could justify a
similar type of warning "If you just gave me 15% more memory, this
hash join would be much faster", and what that would make the logs
look like if future work went along with this precedence.  If there
were some mechanism to put the warning in a system view counter
instead of the log file, that would be much cleaner.  Or a way to
separate the server log file into streams.  But since we don't have
those, I guess I can't really object much to the proposed behavior.

>
>> One idea would be to stop and write out a just-sorted partition
>> whenever that partition is contiguous to the already-written portion.
>> If the qsort is tweaked to recurse preferentially into the left
>> partition first, this would result in tuples being written out at a
>> pretty study pace.  If the qsort was unbalanced and the left partition
>> was always the larger of the two, then that approach would have to be
>> abandoned at some point.  But I think there are already defenses
>> against that, and at worst you would give up and revert to the
>> sort-them-all then write-them-all behavior.
>
> Seems kind of invasive.

I agree, but I wonder if it won't become much more important at 30GB
of work_mem.  Of course if there is no reason to ever set work_mem
that high, then it wouldn't matter--but there is always a reason to do
so, if you have so much memory to spare.  So better than that invasive
work, I guess would be to make sort use less than work_mem if it gets
no benefit from using all of it.  Anyway, ideas for future work,
either way.

>
>> Overall this is very nice.  Doing some real world index builds of
>> short text (~20 bytes ascii) identifiers, I could easily get speed ups
>> of 40% with your patch if I followed the philosophy of "give it as
>> much maintenance_work_mem as I can afford".  If I fine-tuned the
>> maintenance_work_mem so that it was optimal for each sort method, then
>> the speed up quite a bit less, only 22%.  But 22% is still very
>> worthwhile, and who wants to spend their time fine-tuning the memory
>> use for every index build?
>
> Thanks, but I expected better than that. Was it a collated text
> column? The C collation will put the patch in a much better light
> (more strcoll() calls are needed with this new approach -- it's still
> well worth it, but it is a downside that makes collated text not
> especially sympathetic). Just sorting on an integer attribute is also
> a good sympathetic case, FWIW.

It was UTF8 encoded (although all characters were actually ASCII), but
C collated.

I've never seen improvements of 3 fold or more like you saw, under any
conditions, so I wonder if your test machine doesn't have unusually
slow main memory.

>
> How much time did the sort take in each case? How many runs? How much
> time was spent merging? trace_sort output is very interesting here.

My largest test, which took my true table and extrapolated it out for
a few years growth, had about 500,000,000 rows.

At 3GB maintainance_work_mem, it took 13 runs patched and 7 runs
unpatched to build the index, with timings of 3168.66 sec and 5713.07
sec.

The final merging is intermixed with whatever other work goes on to
build the actual index files out of the sorted data, so I don't know
exactly what the timing of just the merge part was.  But it was
certainly a minority of the time, even if you assume the actual index
build were free.  For the patched code, the majority of the time goes
to the quick sorting stages.

When I test each version of the code at its own most efficient
maintenance_work_mem, I get
3007.2 seconds at 1GB for patched and 3836.46 seconds at 64MB for unpatched.

I'm attaching the trace_sort output from the client log for all 4 of
those scenarios.  "sort_0005" means all 5 of your patches were
applied, "origin" means none of them were.

Cheers,

Jeff

On Mon, Nov 30, 2015 at 12:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I'm kind of curious as to why the optimal for the patched code appears
>> at 1GB and not lower.  If I get a chance to rebuild the test, I will
>> look into that more.
>
> I think that the availability of abbreviated keys (or something that
> allows most comparisons made by quicksort/the heap to be resolved at
> the SortTuple level) could make a big difference for things like this.

Using the Hydra POWER7 server [1] + the gensort benchmark [2], which
uses the C collation, and has abbreviated keys that have lots of
entropy, I see benefits with higher and higher maintenance_work_mem
settings.

I will present a variety of cases, which seemed like something Greg
Stark is particularly interested in. On the whole, I am quite pleased
with how things are shown to be improved in a variety of different
scenarios.

Looking at CREATE INDEX build times on an (unlogged) gensort table
with 50 million, 100 million, 250 million, and 500 million tuples,
with maintenance_work_mem settings of 512MB, 1GB, 10GB, and 15GB,
there are sustained improvements as more memory is made available. I'm
not saying that that would be the case with low cardinality leading
attribute tuples -- probably not -- but it seems pretty nice that this
case can sustain improvements as more memory is made available. The
server used here has reasonably good disks (Robert goes into this in
his blogpost), but nothing spectacular.

This is what a 500 million tuple gensort table looks like:

postgres=# \dt+
                    List of relations
 Schema |   Name    | Type  | Owner | Size  | Description
--------+-----------+-------+-------+-------+-------------
 public | sort_test | table | pg    | 32 GB |
(1 row)

Results:

50 million tuple table (best of 3):
------------------------------------------

512MB: (8-way final merge) external sort ended, 171058 disk blocks
used: CPU 4.11s/79.30u sec elapsed 83.60 sec
1GB: (4-way final merge) external sort ended, 171063 disk blocks used:
CPU 4.29s/71.34u sec elapsed 75.69 sec
10GB: N/A
15GB: N/A
1GB (master branch): (3-way final merge) external sort ended, 171064
disk blocks used: CPU 6.19s/163.00u sec elapsed 170.84 sec

100 million tuple table (best of 3):
--------------------------------------------

512MB: (16-way final merge) external sort ended, 342114 disk blocks
used: CPU 8.61s/177.77u sec elapsed 187.03 sec
1GB: (8-way final merge) external sort ended, 342124 disk blocks used:
CPU 8.07s/165.15u sec elapsed 173.70 sec
10GB: N/A
15GB: N/A
1GB (master branch): (5-way final merge) external sort ended, 342129
disk blocks used: CPU 11.68s/358.17u sec elapsed 376.41 sec

250 million tuple table (best of 3):
--------------------------------------------

512MB:  (39-way final merge) external sort ended, 855284 disk blocks
used: CPU 19.96s/486.57u sec elapsed 507.89 sec
1GB: (20-way final merge) external sort ended, 855306 disk blocks
used: CPU 22.63s/475.33u sec elapsed 499.09 sec
10GB: (2-way final merge) external sort ended, 855326 disk blocks
used: CPU 21.99s/341.34u sec elapsed 366.15 sec
15GB: (2-way final merge) external sort ended, 855326 disk blocks
used: CPU 23.23s/322.18u sec elapsed 346.97 sec
1GB (master branch): (11-way final merge) external sort ended, 855315
disk blocks used: CPU 30.56s/973.00u sec elapsed 1015.63 sec

500 million tuple table (best of 3):
--------------------------------------------

512MB: (77-way final merge) external sort ended, 1710566 disk blocks
used: CPU 45.70s/1016.70u sec elapsed 1069.02 sec
1GB: (39-way final merge) external sort ended, 1710613 disk blocks
used: CPU 44.34s/1013.26u sec elapsed 1067.16 sec
10GB: (4-way final merge) external sort ended, 1710649 disk blocks
used: CPU 46.46s/772.97u sec elapsed 841.35 sec
15GB: (3-way final merge) external sort ended, 1710652 disk blocks
used: CPU 51.55s/729.88u sec elapsed 809.68 sec
1GB (master branch): (20-way final merge) external sort ended, 1710632
disk blocks used: CPU 69.35s/2013.21u sec elapsed 2113.82 sec

I attached a detailed account of these benchmarks, for those that
really want to see the nitty-gritty. This includes a 1GB case for
patch without memory prefetching (which is not described in this
message).

[1] http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html
[2] https://github.com/petergeoghegan/gensort
--
Peter Geoghegan

Attachment

sort-benchmarks.tar.gz

Re: Using quicksort for every external sort run

From

Greg Stark

Date:

01 December 2015, 01:12:59

Hm. Here is a log-log chart of those results (sorry for html mail). I'm not really sure if log-log is the right tool to use for a O(nlog(n)) curve though.

I think the take-away is that this is outside the domain where any interesting break points occur. Maybe run more tests on the low end to find where the tapesort can generate a single tape and avoid the merge and see where the discontinuity is with quicksort for the various work_mem sizes.

And can you calculate an estimate where the domain would be where multiple passes would be needed for this table at these work_mem sizes? Is it feasible to test around there?

greg

Attachment

image.png

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

01 December 2015, 01:21:47

On Mon, Nov 30, 2015 at 5:12 PM, Greg Stark <stark@mit.edu> wrote:
> I think the take-away is that this is outside the domain where any interesting break points occur.

I think that these are representative of what people want to do with
external sorts. We have already had Jeff look for a regression. He
found one only with less than 4MB of work_mem (the default), with over
100 million tuples.

What exactly are we looking for?

> And can you calculate an estimate where the domain would be where multiple passes would be needed for this table at
thesework_mem sizes? Is it feasible to test around there?
 

Well, you said that 1GB of work_mem was enough to avoid that within
about 4TB - 8TB of data. So, I believe the answer is "no":

[pg@hydra ~]$ df -h
Filesystem                 Size  Used Avail Use% Mounted on
rootfs                      20G   19G  519M  98% /
devtmpfs                    31G  128K   31G   1% /dev
tmpfs                       31G  384K   31G   1% /dev/shm
/dev/mapper/vg_hydra-root   20G   19G  519M  98% /
tmpfs                       31G  127M   31G   1% /run
tmpfs                       31G     0   31G   0% /sys/fs/cgroup
tmpfs                       31G     0   31G   0% /media
/dev/md0                   497M  145M  328M  31% /boot
/dev/mapper/vg_hydra-data 1023G  322G  651G  34% /data

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Robert Haas

Date:

02 December 2015, 18:03:07

On Sat, Nov 28, 2015 at 7:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Sat, Nov 28, 2015 at 2:04 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> For me very large sorts (100,000,000 ints) with work_mem below 4MB do
>> better with unpatched than with your patch series, by about 5%.  Not a
>> big deal, but also if it is easy to keep the old behavior then I think
>> we should.  Yes, it is dumb to do large sorts with work_mem below 4MB,
>> but if you have canned apps which do a mixture of workloads it is not
>> so easy to micromanage their work_mem.  Especially as there are no
>> easy tools that let me as the DBA say "if you connect from this IP
>> address, you get this work_mem".
>
> I'm not very concerned about a regression that is only seen when
> work_mem is set below the (very conservative) postgresql.conf default
> value of 4MB when sorting 100 million integers.

Perhaps surprisingly, I tend to agree.  I'm cautious of regressions
here, but large sorts in queries are relatively uncommon.  You're
certainly not going to want to return a 100 million tuples to the
client.  If you're trying to do a merge join with 100 million tuples,
well, 100 million integers @ 32 bytes per tuple is 3.2GB, and that's
the size of a tuple with a 4 byte integer and at most 4 bytes of other
data being carried along with it.  So in practice you'd probably need
to have at least 5-10GB of data, which means you are trying to sort
data over a million times larger than the amount of memory you allowed
for the sort.   With or without that patch, you should really consider
raising work_mem.  And maybe create some indexes so that the planner
doesn't choose a merge join any more.  The aggregate case is perhaps
with a little more thought: maybe you are sorting 100 million tuples
so that you can GroupAggregate them.  But, there again, the benefits
of raising work_mem are quite large with or without this patch.  Heck,
if you're lucky, a little more work_mem might switch you to a
HashAggregate.  I'm not sure it's worth complicating the code to cater
to those cases.

While large sorts are uncommon in queries, they are much more common
in index builds.  Therefore, I think we ought to be worrying more
about regressions at 64MB than at 4MB, because we ship with
maintenance_work_mem = 64MB and a lot of people probably don't change
it before trying to build an index.  If we make those index builds go
faster, users will be happy.  If we make them go slower, users will be
sad.  So I think it's worth asking the question "are there any CREATE
INDEX commands that someone might type on a system on which they've
done no other configuration that will be slower with this patch"?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

02 December 2015, 19:26:05

On Wed, Dec 2, 2015 at 10:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I'm not very concerned about a regression that is only seen when
>> work_mem is set below the (very conservative) postgresql.conf default
>> value of 4MB when sorting 100 million integers.
>
> Perhaps surprisingly, I tend to agree.  I'm cautious of regressions
> here, but large sorts in queries are relatively uncommon.  You're
> certainly not going to want to return a 100 million tuples to the
> client.

Right. The fact that it was only a 5% regression is also a big part of
what made me unconcerned. I am glad that we've characterized the
regression that I assumed was there, though -- I certainly knew that
Knuth and so on were not wrong to emphasize increasing run size in the
1970s. Volume 3 of The Art of Computer Programming literally has a
pull-out chart showing the timing of external sorts. This includes the
time it takes for a human operator to switch magnetic tapes, and
rewind those tapes. The underlying technology has changed rather a lot
since, of course.

> While large sorts are uncommon in queries, they are much more common
> in index builds.  Therefore, I think we ought to be worrying more
> about regressions at 64MB than at 4MB, because we ship with
> maintenance_work_mem = 64MB and a lot of people probably don't change
> it before trying to build an index.  If we make those index builds go
> faster, users will be happy.  If we make them go slower, users will be
> sad.  So I think it's worth asking the question "are there any CREATE
> INDEX commands that someone might type on a system on which they've
> done no other configuration that will be slower with this patch"?

I certainly agree that that's a good place to focus. I think that it's
far, far less likely that anything will be slowed down when you take
this as a cut-off point. I don't want to overemphasize it, but the
analysis of how many more passes are needed because of lack of a
replacement selection heap (the "quadratic growth" thing) gives me
confidence. A case with less than 4MB of work_mem is where we actually
saw *some* regression.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Jeff Janes

Date:

06 December 2015, 23:59:32

On Sun, Nov 29, 2015 at 2:01 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Sat, Nov 28, 2015 at 4:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> So there was 27.76 seconds spent copying tuples into local memory
>> ahead of the quicksort, 2 minutes 56.68 seconds spent actually
>> quicksorting, and a trifling 10.32 seconds actually writing the run! I
>> bet that the quicksort really didn't use up too much memory bandwidth
>> on the system as a whole, since abbreviated keys are used with a cache
>> oblivious internal sorting algorithm.
>
> Uh, actually, that isn't so:
>
> LOG:  begin index sort: unique = f, workMem = 1048576, randomAccess = f
> LOG:  bttext_abbrev: abbrev_distinct after 160: 1.000489
> (key_distinct: 40.802210, norm_abbrev_card: 0.006253, prop_card:
> 0.200000)
> LOG:  bttext_abbrev: aborted abbreviation at 160 (abbrev_distinct:
> 1.000489, key_distinct: 40.802210, prop_card: 0.200000)
>
> Abbreviation is aborted in all cases that you tested. Arguably this
> should happen significantly less frequently with the "C" locale,
> possibly almost never, but it makes this case less than representative
> of most people's workloads. I think that at least the first several
> hundred leading attribute tuples are duplicates.

I guess I wasn't paying sufficient attention to that part of
trace_sort, I was not familiar enough with the abbreviation feature to
interpret what it meant.    I had thought we used 16 bytes for
abbreviation, but now I see it is only 8 bytes.

My column has the format of ABC-123-456-789-0

The name-space identifier ("ABC-")  is the same in 99.99% of the
cases.  And to date, as well as in my extrapolation, the first two
digits of the numeric part are leading zeros and the third one is
mostly 0,1,2.  So the first 8 bytes really have less than 2 bits worth
of information.  So yeah, not surprising abbreviation was not useful.

(When I created the system, I did tests that showed it doesn't make
much difference whether I used the format natively, or stripped it to
something more compact on input and reformatted it on output.  That
was before abbreviation features existed)

>
> BTW, roughly what does this CREATE INDEX look like? Is it a composite
> index, for example?

Nope, just a single column index.  In the extrapolated data set, each
distinct value shows up a couple hundred times on average.  I'm
thinking of converting it to a btree_gin index once I've tested them a
bit more, as the compression benefits are substantial.

Cheers,

Jeff

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

07 December 2015, 00:04:15

On Sun, Dec 6, 2015 at 3:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> My column has the format of ABC-123-456-789-0
>
> The name-space identifier ("ABC-")  is the same in 99.99% of the
> cases.  And to date, as well as in my extrapolation, the first two
> digits of the numeric part are leading zeros and the third one is
> mostly 0,1,2.  So the first 8 bytes really have less than 2 bits worth
> of information.  So yeah, not surprising abbreviation was not useful.

I think that given you're using the "C" collation, abbreviation should
still go ahead. I posted a patch to do that, which I need to further
justify per Robert's request (currently, we do nothing special based
on collation). Abbreviation should help in surprisingly marginal
cases, since far fewer memory accesses will be required in the early
stages of the sort with only (say) 5 distinct abbreviated keys. Once
abbreviated comparisons start to not help at all (with quicksort, at
some partition), there's a good chance that the full keys can be
reused to some extent, before being evicted from CPU caches.

>> BTW, roughly what does this CREATE INDEX look like? Is it a composite
>> index, for example?
>
> Nope, just a single column index.  In the extrapolated data set, each
> distinct value shows up a couple hundred times on average.  I'm
> thinking of converting it to a btree_gin index once I've tested them a
> bit more, as the compression benefits are substantial.

Unfortunately, that cannot use tuplesort.c at all.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

07 December 2015, 00:25:32

On Tue, Nov 24, 2015 at 4:33 PM, Peter Geoghegan <pg@heroku.com> wrote:
> So, the bottom line is: This patch seems very good, is unlikely to
> have any notable downside (no case has been shown to be regressed),
> but has yet to receive code review. I am working on a new version with
> the first two commits consolidated, and better comments, but that will
> have the same code, unless I find bugs or am dissatisfied. It mostly
> needs thorough code review, and to a lesser extent some more
> performance testing.

I'm currently spending a lot of time working on parallel CREATE INDEX.
I should not delay posting a new version of my patch series any
further, though. I hope to polish up parallel CREATE INDEX to be able
to show people something in a couple of weeks.

This version features consolidated commits, the removal of the
multipass_warning parameter, and improved comments and commit
messages. It has almost entirely unchanged functionality.

The only functional changes are:

* The function useselection() is taught to distrust an obviously bogus
caller reltuples hint (when it's already less than half of what we
know to be the minimum number of tuples that the sort must sort,
immediately after LACKMEM() first becomes true -- this is probably a
generic estimate).

* Prefetching only occurs when writing tuples. Explicit prefetching
appears to hurt in some cases, as David Rowley has shown over on the
dedicated thread. But it might still be that writing tuples is a case
that is simple enough to benefit consistently, due to the relatively
uniform processing that memory latency can hide behind for that case
(before, the same prefetching instructions were used for CREATE INDEX
and for aggregates, for example).

Maybe we should consider trying to get patch 0002 (the memory
pool/merge patch) committed first, something Greg Stark suggested
privately. That might actually be an easier way of integrating this
work, since it changes nothing about the algorithm we use for merging
(it only improves memory locality), and so is really an independent
piece of work (albeit one that makes a huge overall difference due to
the other patches increasing the time spent merging in absolute terms,
and especially as a proportion of the total).

--
Peter Geoghegan

Attachment

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

07 December 2015, 00:45:13

On Tue, Nov 24, 2015 at 4:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Parallel sort is very important. Robert, Amit and I had a call about
>> this earlier today. We're all in agreement that this should be
>> extended in that direction, and have a rough idea about how it ought
>> to fit together with the parallelism primitives. Parallel sort in 9.6
>> could certainly happen -- that's what I'm aiming for. I haven't really
>> done preliminary research yet; I'll know more in a little while.
>
> Glad to hear it, I was hoping to see that.

As I mentioned just now, I'm working on parallel CREATE INDEX
currently, which seems like a good proving ground for parallel sort,
as its where the majority of really expensive sorts occur. It would be
nice to get parallel-aware sort nodes in 9.6, but I don't think I'll
be able to make that happen in time. The required work in the
optimizer is just too complicated.

The basic idea is that we use the parallel heapam interface, and have
backends sort and write runs as with an external sort (if those runs
are would-be internal sorts, we still write them to tape in the manner
of external sorts). When done, worker processes release memory, but
not tapes, initially. The leader reassembles an in-memory
representation of the tapes that is basically consistent with it
having generated those runs itself (using my new approach to external
sorting). Then, it performs an on-the-fly merge, as before.

At the moment, I have the sorting of runs within workers using the
parallel heapam interface more or less working, with workers dumping
out the runs to tape. I'll work on reassembling the state of the tapes
within the leader in the coming week. It's all still rather rough, but
I think I'll have benchmarks before people start taking time off later
in the month, and possibly even code. Cutting the scope of parallel
sort in 9.6 to only cover parallel CREATE INDEX will make it likely
that I'll be able to deliver something acceptable for that release.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Greg Stark

Date:

07 December 2015, 13:18:15

So incidentally I've been running some benchmarks myself. Mostly to understand the current scaling behaviour of sorting to better judge whether Peter's analysis of where the pain points are and why we should not worry about optimizing for the multiple merge pass case were on target. I haven't actually benchmarked his patch at all, just stock head so far.

The really surprising result (for me) so far is that apparently merge passes spent actually very little time doing I/O. I had always assumed most of the time was spent waiting on I/O and that's why we spend so much effort ensuring sequential I/O and trying to maximize run lengths. I was expecting to see a huge step increase in the total time whenever there was an increase in merge passes. However I see hardly any increase, sometimes even a decrease despite the extra pass. The time generally increases as work_mem decreases but the slope is pretty moderate and gradual with no big steps due to extra passes.

On further analysis I'm less surprised by this than previously. The larger benchmarks I'm running are on a 7GB table which only actually generates 2.6GB of sort data so even writing all that out and then reading it all back in on a 100MB/s disk would only take an extra 50s. That won't make a big dent when the whole sort takes about 30 minutes. Even if you assume there's a substantial amount of random I/O it'll only be a 10% difference or so which is more or less in line with what I'm seeing.

I haven't actually got to benchmarking Peter's patch at all but this is reinforcing his argument dramatically. If the worst case for using quicksort is that the shorter runs might push us into doing an extra merge and that might add an extra 10% to the run-time that will be easily counter-balanced by the faster quicksort and in any case it only affects people who for some reason can't just increase work_mem to allow the single merge mode.

Table Size	Sort Size	128MB	64MB	32MB	16MB	8MB	4MB
6914MB	2672 MB	3392.29	3102.13	3343.53	4081.23	4727.74	5620.77
3457MB	1336 MB	1669.16	1593.85	1444.22	1654.27	2076.74	2266.84
2765MB	1069 MB	1368.92	1250.44	1117.2	1293.45	1431.64	1772.18
1383MB	535 MB	716.48	625.06	557.14	575.67	644.2	721.68
691MB	267 MB	301.08	295.87	266.84	256.29	283.82	292.24
346MB	134 MB	145.48	149.48	133.23	130.69	127.67	137.74
35MB	13 MB	3.58	16.77	11.23	11.93	13.97	3.17

The colours are to give an idea of the number of merge passes. Grey, is an internal sort. White is a single merge. Yellow and red are successively more merges (though the exact boundary between yellow and red may not be exactly meaningful due to my misunderstanding polyphase merge).

The numbers here are seconds taken from the "elapsed" in the following log statements when running queries like the following with trace_sort enabled:

LOG: external sort ended, 342138 disk blocks used: CPU 276.04s/3173.04u sec elapsed 5620.77 sec

STATEMENT: select count(*) from (select * from n200000000 order by r offset 99999999999) AS x;

This was run on the smallest size VM on Google Compute Engine with 600MB of virtual RAM and a 100GB virtual network block device.

Re: Using quicksort for every external sort run

From

Jeff Janes

Date:

07 December 2015, 17:01:59

On Wed, Dec 2, 2015 at 10:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

>
> While large sorts are uncommon in queries, they are much more common
> in index builds.  Therefore, I think we ought to be worrying more
> about regressions at 64MB than at 4MB, because we ship with
> maintenance_work_mem = 64MB and a lot of people probably don't change
> it before trying to build an index.

You have more sympathy for people who don't tune their settings than I do.

Especially now that auovacuum_work_mem exists, there is much less
constraint on increasing maintance_work_mem than there is on work_mem.
Unless, perhaps, you have a lot of user-driven temp tables which get
indexes created on them.

> If we make those index builds go
> faster, users will be happy.  If we make them go slower, users will be
> sad.  So I think it's worth asking the question "are there any CREATE
> INDEX commands that someone might type on a system on which they've
> done no other configuration that will be slower with this patch"?

I found a regression on my 2nd attempt.  I am indexing random md5
hashes (so they should get the full benefit of key abbreviation), and
in this case 400,000,000 of them:

create table foobar as select md5(random()::text) as x, random() as y
from generate_series(1,100000000);
insert into foobar select * from foobar ;
insert into foobar select * from foobar ;

Gives a 29GB table.

with the index:

create index on foobar (x);

With 64MB maintenance_work_mem, I get (best time of 7 or 8):

unpatched 2,436,483.834 ms
allpatches 3,964,875.570 ms    62% slower
not_0005   3,794,716.331 ms

The unpatched sort ends with a 118-way merge followed by a 233-way merge:

LOG:  finished 118-way merge step: CPU 98.65s/835.67u sec elapsed 1270.61 sec
LOG:  performsort done (except 233-way final merge): CPU
98.75s/835.88u sec elapsed 1276.14 sec
LOG:  external sort ended, 2541465 disk blocks used: CPU
194.02s/1635.12u sec elapsed 2435.46 sec

The patched one ends with a 2-way, two sequential 233-way merges, and
a final 233-way merge:

LOG:  finished 2-way merge step: CPU 62.08s/435.70u sec elapsed 587.52 sec
LOG:  finished 233-way merge step: CPU 77.94s/660.11u sec elapsed 897.51 sec
LOG:  a multi-pass external merge sort is required (234 tape maximum)
HINT:  Consider increasing the configuration parameter "maintenance_work_mem".
LOG:  finished 233-way merge step: CPU 94.55s/884.63u sec elapsed 1185.17 sec
LOG:  performsort done (except 233-way final merge): CPU
94.76s/884.69u sec elapsed 1192.01 sec
LOG:  external sort ended, 2541656 disk blocks used: CPU
202.65s/1771.50u sec elapsed 3963.90 sec

If you just look at the final merges of each, they should have the
same number of tuples going through them (i.e. all of the tuples) but
the patched one took well over twice as long, and all that time was IO
time, not CPU time.

I reversed out the memory pooling patch, and that shaved some time
off, but nowhere near bringing it back to parity.

I think what is going on here is that the different numbers of runs
with the patched code just makes it land in an anti-sweat spot in the
tape emulation and buffering algorithm.

Each tape gets 256kB of buffer.  But two tapes have one third of the
tuples each other third are spread over all the other tapes almost
equally (or maybe one tape has 2/3 of the tuples, if the output of one
233-way nonfinal merge was selected as the input of the other one).
Once the large tape(s) has depleted its buffer, the others have had
only slightly more than 1kB each depleted.  Yet when it goes to fill
the large tape, it also tops off every other tape while it is there,
which is not going to get much read-ahead performance on them, leading
to a lot of random IO.

Now, I'm not sure why this same logic wouldn't apply to the unpatched
code with 118-way merge too.  So maybe I am all wet here.  It seems
like that imbalance would be enough to also cause the problem.

I have seen this same type if things years ago, but was never able to
analyze it to my satisfaction (as I haven't been able to do now,
either).

So if this patch with this exact workload just happens to land on a
pre-existing infelicity, how big of a deal is that?  It wouldn't be
creating a regression, just shoving the region that experiences the
problem around in such a way that it affects a different group of use
cases.

And perhaps more importantly, can anyone else reproduce this, or understand it?

Cheers,

Jeff

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

07 December 2015, 18:18:26

On Mon, Dec 7, 2015 at 9:01 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> So if this patch with this exact workload just happens to land on a
> pre-existing infelicity, how big of a deal is that?  It wouldn't be
> creating a regression, just shoving the region that experiences the
> problem around in such a way that it affects a different group of use
> cases.
>
> And perhaps more importantly, can anyone else reproduce this, or understand it?

That's odd. I've never seen anything like that in the field myself,
but then I've never really been a professional DBA.

If possible, could you try using the ioreplay tool to correlate I/O
with a point in the trace_sort timeline? For both master, and the
patch, for comparison? The tool is available from here:

https://code.google.com/p/ioapps/

There is also a tool available to graph the recorded I/O requests over
time called ioprofiler.

This is the only way that I've been able to graph I/O over time
successfully before. Maybe there is a better way, using perf blockio
or something like that, but this is the way I know to work.

While I'm quite willing to believe that there are oddities about our
polyphase merge implementation that can result in what you call
anti-sweetspots (sourspots?), I have a much harder time imagining why
reverting my merge patch could make things better, unless the system
was experiencing some kind of memory pressure. I mean, it doesn't
change the algorithm at all, except to make more memory available from
the merge by avoiding palloc() fragmentation. How could that possibly
hurt?

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Jeff Janes

Date:

09 December 2015, 00:02:49

On Mon, Dec 7, 2015 at 9:01 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> The patched one ends with a 2-way, two sequential 233-way merges, and
> a final 233-way merge:
>
> LOG:  finished 2-way merge step: CPU 62.08s/435.70u sec elapsed 587.52 sec
> LOG:  finished 233-way merge step: CPU 77.94s/660.11u sec elapsed 897.51 sec
> LOG:  a multi-pass external merge sort is required (234 tape maximum)
> HINT:  Consider increasing the configuration parameter "maintenance_work_mem".
> LOG:  finished 233-way merge step: CPU 94.55s/884.63u sec elapsed 1185.17 sec
> LOG:  performsort done (except 233-way final merge): CPU
> 94.76s/884.69u sec elapsed 1192.01 sec
> LOG:  external sort ended, 2541656 disk blocks used: CPU
> 202.65s/1771.50u sec elapsed 3963.90 sec
>
>
> If you just look at the final merges of each, they should have the
> same number of tuples going through them (i.e. all of the tuples) but
> the patched one took well over twice as long, and all that time was IO
> time, not CPU time.
>
> I reversed out the memory pooling patch, and that shaved some time
> off, but nowhere near bringing it back to parity.
>
> I think what is going on here is that the different numbers of runs
> with the patched code just makes it land in an anti-sweat spot in the
> tape emulation and buffering algorithm.
>
> Each tape gets 256kB of buffer.  But two tapes have one third of the
> tuples each other third are spread over all the other tapes almost
> equally (or maybe one tape has 2/3 of the tuples, if the output of one
> 233-way nonfinal merge was selected as the input of the other one).
> Once the large tape(s) has depleted its buffer, the others have had
> only slightly more than 1kB each depleted.  Yet when it goes to fill
> the large tape, it also tops off every other tape while it is there,
> which is not going to get much read-ahead performance on them, leading
> to a lot of random IO.

The final merge only refills each tape buffer as that buffer gets
depleted, rather than refilling all of them whenever any is depleted,
so my explanation doesn't work. But move it back one layer.  There are
3 sequential 233-way merges.  The first one produces a giant tape run.
The second one consumes that giant tape run along with 232 small tape
runs.  At this point, the logic I describe above does come into play,
refilling each of the buffers for the small runs much too often,
freeing blocks on the tape emulation for those runs in dribs and
drabs.  Those free blocks get re-used by the giant output tape run, in
a scattered fashion.

Then in the next (final) merge, it is has to read in this huge
fragmented tape run emulation, generating a lot of random IO to read
it.

With the patched code, the average length of reads on files in
pgsql_tmp between lseeks or changing to a different file descriptor is
8, while in the unpatched code it is 14.

>
> Now, I'm not sure why this same logic wouldn't apply to the unpatched
> code with 118-way merge too.  So maybe I am all wet here.  It seems
> like that imbalance would be enough to also cause the problem.

So my current theory is that it takes one large merge to generate an
unbalanced tape, one large merge where that large unbalanced tape
leads to fragmenting the output tape, and one final merge to be slowed
down by this fragmentation.

I looked at https://code.google.com/p/ioapps/ as Peter recommended,
but couldn't figure out what do with it.  The only conclusion I got
from ioprofiler was that it spend a lot of time reading files in
pgsql_tmp.   I found just doing
strace -y -ttt -T -p <pid>
And then analyzing with perl one liners to work better, but it could
just be the learning curve.

Re: Using quicksort for every external sort run

From

Greg Stark

Date:

09 December 2015, 02:40:01

On Wed, Dec 9, 2015 at 12:02 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
>
> Then in the next (final) merge, it is has to read in this huge
> fragmented tape run emulation, generating a lot of random IO to read
> it.

This seems fairly plausible. Logtape.c is basically implementing a
small filesystem and doesn't really make any attempt to avoid
fragmentation. The reason it does this is so that we can reuse blocks
and avoid needing to store 2x disk space for the temporary space. I
wonder if we're no longer concerned about keeping the number of tapes
down if it makes sense to give up on this goal too and just write out
separate files for each tape letting the filesystem avoid
fragmentation. I suspect it would also be better for filesystems like
ZFS and SSDs where rewriting blocks can be expensive.

> With the patched code, the average length of reads on files in
> pgsql_tmp between lseeks or changing to a different file descriptor is
> 8, while in the unpatched code it is 14.

I don't think Peter did anything to the scheduling of the merges so I
don't see how this would be different. It might just have hit a
preexisting case by changing the number and size of tapes.

I also don't think the tapes really ought to be so unbalanced. I've
noticed some odd things myself -- like what does a 1-way merge mean
here?

LOG:  finished writing run 56 to tape 2 (9101313 blocks): CPU
0.19s/10.97u sec elapsed 16.68 sec
LOG:  finished writing run 57 to tape 3 (9084929 blocks): CPU
0.19s/11.14u sec elapsed 19.08 sec
LOG:  finished writing run 58 to tape 4 (9101313 blocks): CPU
0.20s/11.31u sec elapsed 19.26 sec
LOG:  performsort starting: CPU 0.20s/11.48u sec elapsed 19.44 sec
LOG:  finished writing run 59 to tape 5 (9109505 blocks): CPU
0.20s/11.49u sec elapsed 19.44 sec
LOG:  finished writing final run 60 to tape 6 (8151041 blocks): CPU
0.20s/11.55u sec elapsed 19.50 sec
LOG:  finished 1-way merge step (1810433 blocks): CPU 0.20s/11.58u sec
elapsed 19.54 sec   <-------------------------=========
LOG:  finished 10-way merge step (19742721 blocks): CPU 0.20s/12.23u
sec elapsed 20.19 sec
LOG:  finished 13-way merge step (23666689 blocks): CPU 0.20s/13.15u
sec elapsed 21.11 sec
LOG:  finished 13-way merge step (47333377 blocks): CPU 0.22s/14.07u
sec elapsed 23.13 sec
LOG:  finished 14-way merge step (47333377 blocks): CPU 0.24s/15.65u
sec elapsed 24.74 sec
LOG:  performsort done (except 14-way final merge): CPU 0.24s/15.66u
sec elapsed 24.75 sec

I wonder if something's wrong with the merge scheduling.

Fwiw attached are two patches for perusal. One is a trivial patch to
add the size of the tape to trace_sort output. I guess I'll just apply
that without discussion. The other replaces the selection sort with an
open coded sort network for cases up to 8 elements. (Only in the perl
generated qsort for the moment). I don't have the bandwidth to
benchmark this for the moment but if anyone's interested in trying I
suspect it'll make a small but noticeable difference. I'm guessing
2-5%.

--
greg

On Sat, Dec 12, 2015 at 7:42 PM, Ants Aasma <ants.aasma@eesti.ee> wrote:
> As the open coding doesn't help with eliminating control flow
> dependencies, so my idea is to encode the sort network comparison
> order in an array and use that to drive a simple loop. The code size
> would be pretty similar to insertion sort and the loop overhead should
> mostly be hidden by the CPU OoO machinery. Probably won't help much,
> but would be interesting and simple enough to try out. Can you share
> you code for the benchmark so I can try it out?

I can. But the further results showing the number of comparisons is
higher than for insertion sort have dampened my enthusiasm for the
change. I'm assuming that even if it's faster for a simple integer or
sort it'll be much slower for anything that requires calling out to
the datatype comparator. I also hadn't actually measured what
percentage of the sort was being spent in the insertion sort. I had
guessed it would be higher.

The test is attached. qsort_tuple.c is copied from tuplesort (with the
ifdef for NOPRESORT added, but you could skip that if you want.).
Compile with something like:

gcc -DNOPRESORT -O3 -DCOUNTS -Wall -Wno-unused-function simd-sort-test.c

--
greg

I ran sorts with various parameters on my small NAS server. This is a fairly slow CPU and limited memory machine with lots of disk so I thought it would actually make a good test case for smaller servers. The following is the speedup (for values < 100%) or slowdown (values > 100%) for the first patch only, the "quicksort all runs" without the extra memory optimizations.

At first glance it's a clear pattern that the extra runs does cause a slowdown whenever it causes more polyphase merges which is bad news. But on further inspection look just how low work_mem had to be to have a significant effect. Only the 4MB and 8MB work_mem cases were significantly impacted and only when sorting over a GB of data (which was 2.7 - 7GB with the tuple overhead). The savings when work_mem was 64MB or 128MB was substantial.

Table Size	Sort Size	128MB	64MB	32MB	16MB	8MB	4MB
6914MB	2672 MB	64%	70%	93%	110%	133%	137%
3457MB	1336 MB	64%	67%	90%	92%	137%	120%
2765MB	1069 MB	68%	66%	84%	95%	111%	137%
1383MB	535 MB	66%	70%	72%	92%	99%	96%
691MB	267 MB	65%	69%	70%	86%	99%	98%
346MB	134 MB	65%	69%	73%	67%	90%	87%

The raw numbers in seconds. I've only run the test once so far on the NAS and there are some other things running on it so I really should rerun it a few more times at least.

HEAD:

Table Size	Sort Size	128MB	64MB	32MB	16MB	8MB	4MB
6914MB	2672 MB	1068.07	963.23	1041.94	1246.54	1654.35	2472.79
3457MB	1336 MB	529.34	482.3	450.77	555.76	657.34	1027.57
2765MB	1069 MB	404.02	394.36	348.31	414.48	507.38	657.17
1383MB	535 MB	196.48	194.26	173.48	182.57	214.42	258.05
691MB	267 MB	95.93	93.79	87.73	80.4	93.67	105.24
346MB	134 MB	45.6	44.24	42.39	44.22	46.17	49.85

With the quicksort patch:

Table Size	Sort Size	128MB	64MB	32MB	16MB	8MB	4MB
6914MB	2672 MB	683.6	679.0	969.4	1366.2	2193.6	3379.3
3457MB	1336 MB	339.1	325.1	404.9	509.8	902.2	1229.1
2765MB	1069 MB	275.3	260.1	292.4	395.4	561.9	898.7
1383MB	535 MB	129.9	136.4	124.6	167.5	213.2	247.1
691MB	267 MB	62.3	64.3	61.4	69.2	92.3	103.2
346MB	134 MB	29.8	30.7	30.9	29.4	41.6	43.4

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

15 December 2015, 03:22:44

On Mon, Dec 14, 2015 at 6:58 PM, Greg Stark <stark@mit.edu> wrote:
> I ran sorts with various parameters on my small NAS server.

...

> without the extra memory optimizations.

Thanks for taking the time to benchmark the patch!

While I think it's perfectly fair that you didn't apply the final
on-the-fly merge "memory pool" patch, I also think that it's quite
possible that the regression you see at the very low end would be
significantly ameliorated or even eliminated by applying that patch,
too. After all, Jeff Janes had a much harder time finding a
regression, probably because he benchmarked all patches together.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

15 December 2015, 03:33:15

On Mon, Dec 14, 2015 at 7:22 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Thanks for taking the time to benchmark the patch!

Also, I should point out that you didn't add work_mem past the point
where the master branch will get slower, while the patch continues to
get faster. This seems to happen fairly reliably, certainly if
work_mem is sized at about 1GB, and often at lower settings. With the
POWER7 "Hydra" server, external sorting for a CREATE INDEX operation
could put any possible maintenance_work_mem setting to good use -- my
test case got faster with a 15GB maintenance_work_mem setting (the
server has 64GB of ram). I think I tried 25GB as a
maintenance_work_mem setting next, but started to get OOM errors at
that point.

Again, I point this out because I want to account for why my numbers
were better (for the benefit of other people -- I think you get this,
and are being fair).

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Robert Haas

Date:

18 December 2015, 18:12:45

On Sat, Dec 12, 2015 at 5:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Sat, Dec 12, 2015 at 12:10 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> I have a question about the terminology used in this patch.  What is a
>> tuple proper?  What is it in contradistinction to?  I would think that
>> a tuple which is located in its own palloc'ed space is the "proper"
>> one, leaving a tuple allocated in the bulk memory pool to be
>> called...something else.  I don't know what the
>> non-judgmental-sounding antonym of postpositive "proper" is.
>
> "Tuple proper" is a term that appears 5 times in tuplesort.c today. As
> it says at the top of that file:
>
> /*
>  * The objects we actually sort are SortTuple structs.  These contain
>  * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
>  * which is a separate palloc chunk --- we assume it is just one chunk and
>  * can be freed by a simple pfree().  SortTuples also contain the tuple's
>  * first key column in Datum/nullflag format, and an index integer.

I see only three.  In each case, "the tuple proper" could be replaced
by "the tuple itself" or "the actual tuple" without changing the
meaning, at least according to my understanding of the meaning.  If
that's causing confusion, perhaps we should just change the existing
wording.

Anyway, I agree with Jeff that this terminology shouldn't creep into
function and structure member names.

I don't really like the term "memory pool" either.  We're growing a
bunch of little special-purpose allocators all over the code base
because of palloc's somewhat dubious performance and memory usage
characteristics, but if any of those are referred to as memory pools
it has thus far escaped my notice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

18 December 2015, 18:55:30

On Fri, Dec 18, 2015 at 10:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Anyway, I agree with Jeff that this terminology shouldn't creep into
> function and structure member names.

Okay.

> I don't really like the term "memory pool" either.  We're growing a
> bunch of little special-purpose allocators all over the code base
> because of palloc's somewhat dubious performance and memory usage
> characteristics, but if any of those are referred to as memory pools
> it has thus far escaped my notice.

It's a widely accepted term: https://en.wikipedia.org/wiki/Memory_pool

But, sure, I'm not attached to it.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

18 December 2015, 19:58:00

On Fri, Dec 18, 2015 at 10:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't really like the term "memory pool" either.  We're growing a
> bunch of little special-purpose allocators all over the code base
> because of palloc's somewhat dubious performance and memory usage
> characteristics, but if any of those are referred to as memory pools
> it has thus far escaped my notice.

BTW, I'm not necessarily determined to make the new special-purpose
allocator work exactly as proposed. It seemed useful to prioritize
simplicity, and currently so there is one big "huge palloc()" with
which we blow our memory budget, and that's it. However, I could
probably be more clever about "freeing ranges" initially preserved for
a now-exhausted tape. That kind of thing.

With the on-the-fly merge memory patch, I'm improving locality of
access (for each "tuple proper"/"tuple itself"). If I also happen to
improve the situation around palloc() fragmentation at the same time,
then so much the better, but that's clearly secondary.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Robert Haas

Date:

18 December 2015, 20:50:23

On Fri, Dec 18, 2015 at 2:57 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Dec 18, 2015 at 10:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't really like the term "memory pool" either.  We're growing a
>> bunch of little special-purpose allocators all over the code base
>> because of palloc's somewhat dubious performance and memory usage
>> characteristics, but if any of those are referred to as memory pools
>> it has thus far escaped my notice.
>
> BTW, I'm not necessarily determined to make the new special-purpose
> allocator work exactly as proposed. It seemed useful to prioritize
> simplicity, and currently so there is one big "huge palloc()" with
> which we blow our memory budget, and that's it. However, I could
> probably be more clever about "freeing ranges" initially preserved for
> a now-exhausted tape. That kind of thing.

What about the case where we think that there will be a lot of data
and have a lot of work_mem available, but then the user sends us 4
rows because of some mis-estimation?

> With the on-the-fly merge memory patch, I'm improving locality of
> access (for each "tuple proper"/"tuple itself"). If I also happen to
> improve the situation around palloc() fragmentation at the same time,
> then so much the better, but that's clearly secondary.

I don't really understand this comment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

18 December 2015, 21:10:53

On Fri, Dec 18, 2015 at 12:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> BTW, I'm not necessarily determined to make the new special-purpose
>> allocator work exactly as proposed. It seemed useful to prioritize
>> simplicity, and currently so there is one big "huge palloc()" with
>> which we blow our memory budget, and that's it. However, I could
>> probably be more clever about "freeing ranges" initially preserved for
>> a now-exhausted tape. That kind of thing.
>
> What about the case where we think that there will be a lot of data
> and have a lot of work_mem available, but then the user sends us 4
> rows because of some mis-estimation?

The memory patch only changes the final on-the-fly merge phase. There
is no estimate involved there.

I continue to use whatever "slots" (memtuples) are available for the
final on-the-fly merge. However, I allocate all remaining memory that
I have budget for at once. My remarks about the efficient use of that
memory was only really about each tape's use of their part of that
over time.

Again, to emphasize, this is only for the final on-the-fly merge phase.

>> With the on-the-fly merge memory patch, I'm improving locality of
>> access (for each "tuple proper"/"tuple itself"). If I also happen to
>> improve the situation around palloc() fragmentation at the same time,
>> then so much the better, but that's clearly secondary.
>
> I don't really understand this comment.

I just mean that I wrote the memory patch with memory locality in
mind, not palloc() fragmentation or other overhead.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Robert Haas

Date:

22 December 2015, 17:10:11

On Sun, Dec 6, 2015 at 7:25 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Nov 24, 2015 at 4:33 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> So, the bottom line is: This patch seems very good, is unlikely to
>> have any notable downside (no case has been shown to be regressed),
>> but has yet to receive code review. I am working on a new version with
>> the first two commits consolidated, and better comments, but that will
>> have the same code, unless I find bugs or am dissatisfied. It mostly
>> needs thorough code review, and to a lesser extent some more
>> performance testing.
>
> I'm currently spending a lot of time working on parallel CREATE INDEX.
> I should not delay posting a new version of my patch series any
> further, though. I hope to polish up parallel CREATE INDEX to be able
> to show people something in a couple of weeks.
>
> This version features consolidated commits, the removal of the
> multipass_warning parameter, and improved comments and commit
> messages. It has almost entirely unchanged functionality.
>
> The only functional changes are:
>
> * The function useselection() is taught to distrust an obviously bogus
> caller reltuples hint (when it's already less than half of what we
> know to be the minimum number of tuples that the sort must sort,
> immediately after LACKMEM() first becomes true -- this is probably a
> generic estimate).
>
> * Prefetching only occurs when writing tuples. Explicit prefetching
> appears to hurt in some cases, as David Rowley has shown over on the
> dedicated thread. But it might still be that writing tuples is a case
> that is simple enough to benefit consistently, due to the relatively
> uniform processing that memory latency can hide behind for that case
> (before, the same prefetching instructions were used for CREATE INDEX
> and for aggregates, for example).
>
> Maybe we should consider trying to get patch 0002 (the memory
> pool/merge patch) committed first, something Greg Stark suggested
> privately. That might actually be an easier way of integrating this
> work, since it changes nothing about the algorithm we use for merging
> (it only improves memory locality), and so is really an independent
> piece of work (albeit one that makes a huge overall difference due to
> the other patches increasing the time spent merging in absolute terms,
> and especially as a proportion of the total).

So I was looking at the 0001 patch and came across this code:

+    /*
+     * Crossover point is somewhere between where memtuples is between 40%
+     * and all-but-one of total tuples to sort.  This weighs approximate
+     * savings in I/O, against generic heap sorting cost.
+     */
+    avgTupleSize = (double) memNowUsed / (double) state->memtupsize;
+
+    /*
+     * Starting from a threshold of 90%, refund 7.5% per 32 byte
+     * average-size-increment.
+     */
+    increments = MAXALIGN_DOWN((int) avgTupleSize) / 32;
+    crossover = 0.90 - (increments * 0.075);
+
+    /*
+     * Clamp, making either outcome possible regardless of average size.
+     *
+     * 40% is about the minimum point at which "quicksort with spillover"
+     * can still occur without a logical/physical correlation.
+     */
+    crossover = Max(0.40, Min(crossover, 0.85));
+
+    /*
+     * The point where the overhead of maintaining the heap invariant is
+     * likely to dominate over any saving in I/O is somewhat arbitrarily
+     * assumed to be the point where memtuples' size exceeds MaxAllocSize
+     * (note that overall memory consumption may be far greater).  Past
+     * this point, only the most compelling cases use replacement selection
+     * for their first run.
+     *
+     * This is not about cache characteristics so much as the O(n log n)
+     * cost of sorting larger runs dominating over the O(n) cost of
+     * writing/reading tuples.
+     */
+    if (sizeof(SortTuple) * state->memtupcount > MaxAllocSize)
+        crossover = avgTupleSize > 32 ? 0.90 : 0.95;

This looks like voodoo to me.  I assume you tested it and maybe it
gives correct answers, but it's got to be some kind of world record
for number of arbitrary constants per SLOC, and there's no real
justification for any of it.  The comments say, essentially, well, we
do this because it works.  But suppose I try it on some new piece of
hardware and it doesn't work well.  What do I do?  Email the author
and ask him to tweak the arbitrary constants?

The dependency on MaxAllocSize seems utterly bizarre to me.  If we
decide to modify our TOAST infrastructure so that we support datums up
to 2GB in size, or alternatively datums of up to only 512MB in size,
do you expect that to change the behavior of tuplesort.c?  I bet not,
but that's a major reason why MaxAllocSize is defined the way it is.

I wonder if there's a way to accomplish what you're trying to do here
that avoids the need to have a cost model at all.  As I understand it,
and please correct me wherever I go off the rails, the situation is:

1. If we're sorting a large amount of data, such that we can't fit it
all in memory, we will need to produce a number of sorted runs and
then merge those runs.  If we generate each run using a heap with
replacement selection, rather than quicksort, we will produce runs
that are, on the average, about twice as long, which means that we
will have fewer runs to merge at the end.

2. Replacement selection is slower than quicksort on a per-tuple
basis.  Furthermore, merging more runs isn't necessarily any slower
than merging fewer runs.  Therefore, building runs via replacement
selection tends to lose even though it tends to reduce the number of
runs to merge.  Even when having a larger number of runs results in an
increase in the number merge passes, we save so much time building the
runs that we often (maybe not always) still come out ahead.

3. However, when replacement selection would result in a single run,
and quicksort results in multiple runs, using quicksort loses.  This
is especially true when we the amount of data we have is between one
and two times work_mem.  If we fit everything into one run, we do not
need to write any data to tape, but if we overflow by even a single
tuple, we have to write a lot of data to tape.

If this is correct so far, then I wonder if we could do this: Forget
replacement selection.  Always build runs by quicksorting.  However,
when dumping the first run to tape, dump it a little at a time rather
than all at once.  If the input ends before we've completely written
the run, then we've got all of run 1 in memory and run 0 split between
memory and tape.  So we don't need to do any extra I/O; we can do a
merge between run 1 and the portion of run 0 which is on tape.  When
the tape is exhausted, we only need to finish merging the in-memory
tails of the two runs.

I also wonder if you've thought about the case where we are asked to
sort an enormous amount of data that is already in order, or very
nearly in order (2,1,4,3,6,5,8,7,...).  It seems worth including a
check to see whether the low value of run N+1 is higher than the high
value of run N, and if so, append it to the existing run rather than
starting a new one.  In some cases this could completely eliminate the
final merge pass at very low cost, which seems likely to be
worthwhile.

Unfortunately, it's possible to fool this algorithm pretty easily -
suppose the data is as in the parenthetical note in the previous
paragraph, but the number of tuples that fits in work_mem is odd.  I
wonder if we can find instances where such cases regress significantly
as compared with the replacement selection approach, which might be
able to produce a single run out of an arbitrary amount of data.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

22 December 2015, 21:37:54

On Tue, Dec 22, 2015 at 9:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> So I was looking at the 0001 patch

Thanks. I'm going to produce a revision of 0002 shortly, so perhaps
hold off on that one. The big change there will be to call
grow_memtuples() to allow us to increase the number of slots without
palloc() overhead spuriously being weighed (since the memory for the
final on-the-fly merge phase doesn't have palloc() overhead). Also,
will incorporate what Jeff and you wanted around terminology.

> This looks like voodoo to me.  I assume you tested it and maybe it
> gives correct answers, but it's got to be some kind of world record
> for number of arbitrary constants per SLOC, and there's no real
> justification for any of it.  The comments say, essentially, well, we
> do this because it works.  But suppose I try it on some new piece of
> hardware and it doesn't work well.  What do I do?  Email the author
> and ask him to tweak the arbitrary constants?

That's not fair. DEFAULT_EQ_SEL, DEFAULT_RANGE_INEQ_SEL, and
DEFAULT_NUM_DISTINCT are each about as arbitrary. We have to do
something, though.

MaxAllocHugeSize is used fairly arbitrarily in pg_stat_statements.c.
And that part (the MaxAllocSize part of my patch) only defines a point
after which we require a really favorable case for replacement
selection/quicksort with spillover to proceed. It's a safety valve. We
try to err on the side of not using replacement selection.

> I wonder if there's a way to accomplish what you're trying to do here
> that avoids the need to have a cost model at all.  As I understand it,
> and please correct me wherever I go off the rails, the situation is:
>
> 1. If we're sorting a large amount of data, such that we can't fit it
> all in memory, we will need to produce a number of sorted runs and
> then merge those runs.  If we generate each run using a heap with
> replacement selection, rather than quicksort, we will produce runs
> that are, on the average, about twice as long, which means that we
> will have fewer runs to merge at the end.
>
> 2. Replacement selection is slower than quicksort on a per-tuple
> basis.  Furthermore, merging more runs isn't necessarily any slower
> than merging fewer runs.  Therefore, building runs via replacement
> selection tends to lose even though it tends to reduce the number of
> runs to merge.  Even when having a larger number of runs results in an
> increase in the number merge passes, we save so much time building the
> runs that we often (maybe not always) still come out ahead.

I'm with you so far. I'll only add: doing multiple passes ought to be
very rare anyway.

> 3. However, when replacement selection would result in a single run,
> and quicksort results in multiple runs, using quicksort loses.  This
> is especially true when we the amount of data we have is between one
> and two times work_mem.  If we fit everything into one run, we do not
> need to write any data to tape, but if we overflow by even a single
> tuple, we have to write a lot of data to tape.

No, this is where you lose me. I think that it's basically not true
that replacement selection can ever be faster than quicksort, even in
the cases where the conventional wisdom would have you believe so
(e.g. what you say here). Unless you have very little memory relative
to data size, or something along those lines. The conventional wisdom
obviously needs some revision, but it was perfectly correct in the
1970s and 1980s.

However, where replacement selection can still help is avoiding I/O
*entirely*. If we can avoid spilling 95% of tuples in the first place,
and quicksort the remaining (heapified) tuples that were not spilled,
and merge an in-memory run with an on-tape run, then we can win big.
Quicksort is not amenable to incremental spilling at all. I call this
"quicksort with spillover" (it is a secondary optimization that the
patch adds). This shows up in EXPLAIN ANALYZE, and avoids a stark
discontinuity in the cost function of sorts. That could really help
with admission control, and simplifying the optimizer, making merge
joins less scary. So with the patch, "quicksort with spillover" and
"replacement selection" are almost synonymous, except that we
acknowledge the historic importance of replacement selection to some
degree. The patch completely discards the conventional use of
replacement selection -- it just preserves its priority queue (heap)
implementation where incrementalism is thought to be particularly
useful (avoiding I/O entirely).

But this comparison has nothing to do with comparing the master branch
with my patch, since the master branch never attempts to avoid I/O
having committed to an external sort. It uses replacement selection in
a way that is consistent with the conventional wisdom, wisdom which
has now been shown to be obsolete.

BTW, I think that abandoning incrementalism (replacement selection)
will have future benefits for memory management. I bet we can get away
with one big palloc() for second or subsequent runs that are
quicksorted, greatly reducing palloc() overhead and waste there, too.

> If this is correct so far, then I wonder if we could do this: Forget
> replacement selection.  Always build runs by quicksorting.  However,
> when dumping the first run to tape, dump it a little at a time rather
> than all at once.  If the input ends before we've completely written
> the run, then we've got all of run 1 in memory and run 0 split between
> memory and tape.  So we don't need to do any extra I/O; we can do a
> merge between run 1 and the portion of run 0 which is on tape.  When
> the tape is exhausted, we only need to finish merging the in-memory
> tails of the two runs.

My first attempt at this -- before I realized that replacement
selection was just not a very good algorithm, due to the upsides not
remotely offsetting the downsides on modern hardware -- was a hybrid
between quicksort and replacement selection.

The problem is that there is too much repeated work. If you spill like
this, you have to quicksort everything again. The replacement
selection queue keeps track of a currentRun and nextRun, to avoid
this, but quicksort can't really do that well.

In general, the replacement selection heap will create a new run that
cannot be spilled (nextRun -- there won't be one initially) if there
is a value less than any of those values already spilled to tape. So
it is built to avoid redundant work in a way that quicksort really
cannot be.

> I also wonder if you've thought about the case where we are asked to
> sort an enormous amount of data that is already in order, or very
> nearly in order (2,1,4,3,6,5,8,7,...).  It seems worth including a
> check to see whether the low value of run N+1 is higher than the high
> value of run N, and if so, append it to the existing run rather than
> starting a new one.  In some cases this could completely eliminate the
> final merge pass at very low cost, which seems likely to be
> worthwhile.

While I initially shared this intuition -- that replacement selection
could hardly be beaten by a simple hybrid sort-merge strategy for
almost sorted input -- I changed my mind. I simply did not see any
evidence for it. I may have missed something, but it really does not
appear to be worth while. The quicksort fallback to insertion sort
also does well with presorted input. The merge is very cheap (over and
above reading one big run off disk) for presorted input under most
circumstances. A cost model adds a lot of complexity, which I hesitate
to add without clear benefits.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Robert Haas

Date:

22 December 2015, 22:57:15

On Tue, Dec 22, 2015 at 4:37 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> This looks like voodoo to me.  I assume you tested it and maybe it
>> gives correct answers, but it's got to be some kind of world record
>> for number of arbitrary constants per SLOC, and there's no real
>> justification for any of it.  The comments say, essentially, well, we
>> do this because it works.  But suppose I try it on some new piece of
>> hardware and it doesn't work well.  What do I do?  Email the author
>> and ask him to tweak the arbitrary constants?
>
> That's not fair. DEFAULT_EQ_SEL, DEFAULT_RANGE_INEQ_SEL, and
> DEFAULT_NUM_DISTINCT are each about as arbitrary. We have to do
> something, though.
>
> MaxAllocHugeSize is used fairly arbitrarily in pg_stat_statements.c.
> And that part (the MaxAllocSize part of my patch) only defines a point
> after which we require a really favorable case for replacement
> selection/quicksort with spillover to proceed. It's a safety valve. We
> try to err on the side of not using replacement selection.

Sure, there are arbitrary numbers all over the code, driven by
empirical observations about what factors are important to model.  But
this is not that.  You don't have a thing called seq_page_cost and a
thing called cpu_tuple_cost and then say, well, empirically the ratio
is about 100:1, so let's make the former 1 and the latter 0.01.  You
just have some numbers, and it's not clear what, if anything, they
actually represent.  In the space of 7 lines of code, you introduce 9
nameless constants:

The crossover point is clamped to a minimum of 40% [constant #1] and a
maximum of 85% [constant #2] when the size of the SortTuple array is
no more than MaxAllocSize.  Between those bounds, the crossover point
is 90% [constant #3] minus 7.5% [constant #4] per 32-byte increment
[constant #5] of estimated average tuple size.  On the other hand,
when the estimated average tuple size exceeds MaxAllocSize, the
crossover point is either 90% [constant #6] or 95% [constant #7]
depending on whether the average tuple size is greater than 32 bytes
[constant #8].  But if the row count hit is less than 50% [constant
#9] of the rows we've already seen, then we ignore it and do not use
selection.

You make no attempt to justify why any of these numbers are correct,
or what underlying physical reality they represent.  The comment which
describes the manner in which crossover point is computed for
SortTuple volumes under 1GB says "Starting from a threshold of 90%,
refund 7.5% per 32 byte average-size-increment."  That is a precise
restatement of what the code does, but it doesn't attempt to explain
why it's a good idea.  Perhaps the reader should infer that the
crossover point drops as the tuples get bigger, except that in the
over-1GB case, a larger tuple size causes the crossover point to go
*up* while in the under-1GB case, a larger tuple size causes the
crossover point to go *down*.  Concretely, if we're sorting 44,739,242
224-byte tuples, the estimated crossover point is 40%.  If we're
sorting 44,739,243 244-byte tuples, the estimated crossover point is
95%.  That's an extremely sharp discontinuity, and it seems very
unlikely that any real system behaves that way.

I'm prepared to concede that constant #9 - ignoring the input row
estimate if we've already seen twice that many rows - probably doesn't
need a whole lot of justification here, and what justification it does
need is provided by the fact that (we think) replacement selection
only wins when there are going to be less than 2 quicksorted runs.
But the other 8 constants here have to have reasons why they exist,
what they represent, and why they have the values they do, and that
explanation needs to be something that can be understood by people
besides you.  The overall cost model needs some explanation of the
theory of operation, too.

In my opinion, reasoning in terms of a crossover point is a strange
way of approaching the problem.  What would be more typical at least
in our code, and I suspect in general, is do a cost estimate of using
selection and a cost estimate of not using selection and compare them.
Replacement selection has a CPU cost and an I/O cost, each of which is
estimable based on the tuple count, chosen comparator, and expected
I/O volume.  Quicksort has those same costs, in different amounts.  If
those respective costs are accurately estimated, then you can pick the
strategy with the lower cost and expect to win.

>> I wonder if there's a way to accomplish what you're trying to do here
>> that avoids the need to have a cost model at all.  As I understand it,
>> and please correct me wherever I go off the rails, the situation is:
>>
>> 1. If we're sorting a large amount of data, such that we can't fit it
>> all in memory, we will need to produce a number of sorted runs and
>> then merge those runs.  If we generate each run using a heap with
>> replacement selection, rather than quicksort, we will produce runs
>> that are, on the average, about twice as long, which means that we
>> will have fewer runs to merge at the end.
>>
>> 2. Replacement selection is slower than quicksort on a per-tuple
>> basis.  Furthermore, merging more runs isn't necessarily any slower
>> than merging fewer runs.  Therefore, building runs via replacement
>> selection tends to lose even though it tends to reduce the number of
>> runs to merge.  Even when having a larger number of runs results in an
>> increase in the number merge passes, we save so much time building the
>> runs that we often (maybe not always) still come out ahead.
>
> I'm with you so far. I'll only add: doing multiple passes ought to be
> very rare anyway.
>
>> 3. However, when replacement selection would result in a single run,
>> and quicksort results in multiple runs, using quicksort loses.  This
>> is especially true when we the amount of data we have is between one
>> and two times work_mem.  If we fit everything into one run, we do not
>> need to write any data to tape, but if we overflow by even a single
>> tuple, we have to write a lot of data to tape.
>
> No, this is where you lose me. I think that it's basically not true
> that replacement selection can ever be faster than quicksort, even in
> the cases where the conventional wisdom would have you believe so
> (e.g. what you say here). Unless you have very little memory relative
> to data size, or something along those lines. The conventional wisdom
> obviously needs some revision, but it was perfectly correct in the
> 1970s and 1980s.
>
> However, where replacement selection can still help is avoiding I/O
> *entirely*. If we can avoid spilling 95% of tuples in the first place,
> and quicksort the remaining (heapified) tuples that were not spilled,
> and merge an in-memory run with an on-tape run, then we can win big.

That's pretty much what I was trying to say, except that I'm curious
to know whether replacement selection can win when it manages to
generate a vastly longer run than what we get from quicksorting.  Say
quicksorting produces 10, or 100, or 1000 tapes, and replacement
selection produces 1 due to a favorable data distribution.

> Quicksort is not amenable to incremental spilling at all. I call this
> "quicksort with spillover" (it is a secondary optimization that the
> patch adds). This shows up in EXPLAIN ANALYZE, and avoids a stark
> discontinuity in the cost function of sorts. That could really help
> with admission control, and simplifying the optimizer, making merge
> joins less scary. So with the patch, "quicksort with spillover" and
> "replacement selection" are almost synonymous, except that we
> acknowledge the historic importance of replacement selection to some
> degree. The patch completely discards the conventional use of
> replacement selection -- it just preserves its priority queue (heap)
> implementation where incrementalism is thought to be particularly
> useful (avoiding I/O entirely).
>
> But this comparison has nothing to do with comparing the master branch
> with my patch, since the master branch never attempts to avoid I/O
> having committed to an external sort. It uses replacement selection in
> a way that is consistent with the conventional wisdom, wisdom which
> has now been shown to be obsolete.
>
> BTW, I think that abandoning incrementalism (replacement selection)
> will have future benefits for memory management. I bet we can get away
> with one big palloc() for second or subsequent runs that are
> quicksorted, greatly reducing palloc() overhead and waste there, too.
>
>> If this is correct so far, then I wonder if we could do this: Forget
>> replacement selection.  Always build runs by quicksorting.  However,
>> when dumping the first run to tape, dump it a little at a time rather
>> than all at once.  If the input ends before we've completely written
>> the run, then we've got all of run 1 in memory and run 0 split between
>> memory and tape.  So we don't need to do any extra I/O; we can do a
>> merge between run 1 and the portion of run 0 which is on tape.  When
>> the tape is exhausted, we only need to finish merging the in-memory
>> tails of the two runs.
>
> My first attempt at this -- before I realized that replacement
> selection was just not a very good algorithm, due to the upsides not
> remotely offsetting the downsides on modern hardware -- was a hybrid
> between quicksort and replacement selection.
>
> The problem is that there is too much repeated work. If you spill like
> this, you have to quicksort everything again. The replacement
> selection queue keeps track of a currentRun and nextRun, to avoid
> this, but quicksort can't really do that well.

I agree, but that's not what I proposed.  You don't want to keep
re-sorting to incorporate new tuples into the run, but if you've got
1010 tuples and you can fit 1000 tuples in, you can (a) quicksort the
first 1000 tuples, (b) read in 10 more tuples, dumping the first 10
tuples from run 0 to disk, (c) quicksort the last 10 tuples to create
run 1, and then (d) merge run 0 [which is mostly in memory] with run 1
[which is entirely in memory].  In other words, yes, quicksorting
doesn't let you add things to the sort incrementally, but you can
still write out the run incrementally, writing only as many tuples as
you need to dump to get the rest of the input data into memory.

>> I also wonder if you've thought about the case where we are asked to
>> sort an enormous amount of data that is already in order, or very
>> nearly in order (2,1,4,3,6,5,8,7,...).  It seems worth including a
>> check to see whether the low value of run N+1 is higher than the high
>> value of run N, and if so, append it to the existing run rather than
>> starting a new one.  In some cases this could completely eliminate the
>> final merge pass at very low cost, which seems likely to be
>> worthwhile.
>
> While I initially shared this intuition -- that replacement selection
> could hardly be beaten by a simple hybrid sort-merge strategy for
> almost sorted input -- I changed my mind. I simply did not see any
> evidence for it. I may have missed something, but it really does not
> appear to be worth while. The quicksort fallback to insertion sort
> also does well with presorted input. The merge is very cheap (over and
> above reading one big run off disk) for presorted input under most
> circumstances. A cost model adds a lot of complexity, which I hesitate
> to add without clear benefits.

I don't think you need any kind of cost model to implement the
approach of appending to an existing run when the values in the new
run are strictly greater.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

23 December 2015, 01:10:15

On Tue, Dec 22, 2015 at 2:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Dec 22, 2015 at 4:37 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> That's not fair. DEFAULT_EQ_SEL, DEFAULT_RANGE_INEQ_SEL, and
>> DEFAULT_NUM_DISTINCT are each about as arbitrary. We have to do
>> something, though.
>>

> Sure, there are arbitrary numbers all over the code, driven by
> empirical observations about what factors are important to model.  But
> this is not that.  You don't have a thing called seq_page_cost and a
> thing called cpu_tuple_cost and then say, well, empirically the ratio
> is about 100:1, so let's make the former 1 and the latter 0.01.  You
> just have some numbers, and it's not clear what, if anything, they
> actually represent.

What I find difficult to accept about what you say here is that at
*this* level, something like cost_sort() has little to recommend it.
It costs a sort of a text attribute at the same level as the cost of
sorting the same tuples using an int4 attribute (based on the default
cpu_operator_cost for C functions -- without any attempt to
differentiate text and int4).

Prior to 9.5, sorting text took about 5 - 10 times longer that this
similar int4 sort. That's a pretty big difference, and yet I recall no
complaints. The cost of a comparison in a sort can hardly be
considered in isolation, anyway -- cache efficiency is at least as
important.

Of course, the point is that the goal of a cost model is not to
simulate reality as closely as possible -- it's to produce a good
outcome for performance purposes under realistic assumptions.
Realistic assumptions include that you can't hope to account for
certain differences in cost. Avoiding a terrible outcome is very
important, but the worst case for useselection() is no worse than
today's behavior (or a lost opportunity to do better than today's
behavior).

Recently, the paper that was posted to the list about the Postgres
optimizer stated formally what I know I had a good intuitive sense of
for a long time: that better selectivity estimates are much more
important than better cost models in practice. The "empirical
observations" driving something like DEFAULT_EQ_SEL are very weak --
but what are you gonna do?

> The crossover point is clamped to a minimum of 40% [constant #1] and a
> maximum of 85% [constant #2] when the size of the SortTuple array is
> no more than MaxAllocSize.  Between those bounds, the crossover point
> is 90% [constant #3] minus 7.5% [constant #4] per 32-byte increment
> [constant #5] of estimated average tuple size.  On the other hand,
> when the estimated average tuple size exceeds MaxAllocSize, the
> crossover point is either 90% [constant #6] or 95% [constant #7]
> depending on whether the average tuple size is greater than 32 bytes
> [constant #8].  But if the row count hit is less than 50% [constant
> #9] of the rows we've already seen, then we ignore it and do not use
> selection.
>
> You make no attempt to justify why any of these numbers are correct,
> or what underlying physical reality they represent.

Just like selfuncs.h for the most part, then.

> The comment which
> describes the manner in which crossover point is computed for
> SortTuple volumes under 1GB says "Starting from a threshold of 90%,
> refund 7.5% per 32 byte average-size-increment."  That is a precise
> restatement of what the code does, but it doesn't attempt to explain
> why it's a good idea.  Perhaps the reader should infer that the
> crossover point drops as the tuples get bigger, except that in the
> over-1GB case, a larger tuple size causes the crossover point to go
> *up* while in the under-1GB case, a larger tuple size causes the
> crossover point to go *down*.  Concretely, if we're sorting 44,739,242
> 224-byte tuples, the estimated crossover point is 40%.  If we're
> sorting 44,739,243 244-byte tuples, the estimated crossover point is
> 95%.  That's an extremely sharp discontinuity, and it seems very
> unlikely that any real system behaves that way.

Again, the goal of the cost model is not to model reality as such.
This cost model is conservative about using replacement selection. It
makes sense when you consider that there tends to be a lot fewer
external sorts on a realistic workload -- if we can cut that number in
half, which seems quite possible, that's pretty good, especially from
a DBA's practical perspective. I want to buffer DBAs against suddenly
incurring more I/O, but not at the risk of having a far longer sort
for the first run. Or with minimal exposure to that risk. The cost
model weighs the cost of the hint being wrong to some degree (which is
indeed novel). I think it makes sense in light of the cost and
benefits in this case, although I will add that I'm not entirely
comfortable with it. I just don't imagine that there is a solution
that I will be fully comfortable with. There may be one that
superficially looks correct, but I see little point in that.

> I'm prepared to concede that constant #9 - ignoring the input row
> estimate if we've already seen twice that many rows - probably doesn't
> need a whole lot of justification here, and what justification it does
> need is provided by the fact that (we think) replacement selection
> only wins when there are going to be less than 2 quicksorted runs.
> But the other 8 constants here have to have reasons why they exist,
> what they represent, and why they have the values they do, and that
> explanation needs to be something that can be understood by people
> besides you.  The overall cost model needs some explanation of the
> theory of operation, too.

The cost model is extremely fudged. I think that the greatest problem
that it has is that it isn't explicit enough about that.

But yes, let me concede more clearly: the cost model is based on
frobbing. But at least it's relatively honest about that, and is
relatively simple. I think it might be possible to make it simpler,
but I have a feeling that anything we can come up with will basically
have the same quality that you so dislike. I don't know how to do
better. Frankly, I'd rather be roughly correct than exactly wrong.

> In my opinion, reasoning in terms of a crossover point is a strange
> way of approaching the problem.  What would be more typical at least
> in our code, and I suspect in general, is do a cost estimate of using
> selection and a cost estimate of not using selection and compare them.
> Replacement selection has a CPU cost and an I/O cost, each of which is
> estimable based on the tuple count, chosen comparator, and expected
> I/O volume.  Quicksort has those same costs, in different amounts.  If
> those respective costs are accurately estimated, then you can pick the
> strategy with the lower cost and expect to win.

If you instrument the number of comparisons, I expect you'll find that
master is very competitive with the patch in terms of number of
comparisons performed in total. I think it might even win (Knuth
specifically addresses this, actually). Where does that leave your
theory of how to build a cost model? Also, the disadvantage of
replacement selection's heap is smaller with smaller work_mem settings
-- this has been shown many times to make a *huge* difference. Can the
alternative cost model be reasonably expected to incorporate that,
too? Heap sort isn't cache oblivious, which is why we see these weird
effects, so don't forget to have CPU cache size as an input into your
cost model (or maybe use a magic value based on something like
MaxAllocSize!). How do you propose to way the distributed cost of a
lost opportunity to reduce I/O against the distributed cost of
heapsort wasting system memory bandwidth?

And so on, and so on...believe me, I could go on.

By the way, I think that there needs to be a little work done to
cost_sort() too, which so far I've avoided.

>> However, where replacement selection can still help is avoiding I/O
>> *entirely*. If we can avoid spilling 95% of tuples in the first place,
>> and quicksort the remaining (heapified) tuples that were not spilled,
>> and merge an in-memory run with an on-tape run, then we can win big.
>
> That's pretty much what I was trying to say, except that I'm curious
> to know whether replacement selection can win when it manages to
> generate a vastly longer run than what we get from quicksorting.  Say
> quicksorting produces 10, or 100, or 1000 tapes, and replacement
> selection produces 1 due to a favorable data distribution.

I believe the answer is probably no, but if there is a counter
example, it probably isn't worth pursuing. To repeat myself, I started
out with exactly the same intuition as you on that question, but
changed my mind when my efforts to experimentally verify the intuition
were not successful.

> I agree, but that's not what I proposed.  You don't want to keep
> re-sorting to incorporate new tuples into the run, but if you've got
> 1010 tuples and you can fit 1000 tuples in, you can (a) quicksort the
> first 1000 tuples, (b) read in 10 more tuples, dumping the first 10
> tuples from run 0 to disk, (c) quicksort the last 10 tuples to create
> run 1, and then (d) merge run 0 [which is mostly in memory] with run 1
> [which is entirely in memory].  In other words, yes, quicksorting
> doesn't let you add things to the sort incrementally, but you can
> still write out the run incrementally, writing only as many tuples as
> you need to dump to get the rest of the input data into memory.

Merging is still sorting. The 10 tuples are not very cheap to merge
against the 1000 tuples, because you'll probably still end up reading
most of the 1000 tuples to do so. Perhaps you anticipate that there
will be roughly disjoint ranges of values in each run due to a
logical/physical correlation, and so you won't have to read that many
of the 1000 tuples, but this approach has no ability to buffer even
one outlier value (unlike replacement selection, in particular my
approach within mergememruns()).

The cost of heapification of 1.01 million tuples to spill 0.01 million
tuples is pretty low (relative to the cost of sorting them in
particular). The only difference between what you say here and what I
actually do is that the remaining tuples are heapified rather than
sorted, and I quicksort everything together to "merge run 1 and run 0"
rather than doing two quicksorts and a merge. I believe that this can
be demonstrated to be cheaper.

Another factor is that the heap could be useful for other stuff in the
future. As Simon Riggs pointed out, for deduplicating values as
they're read in by tuplesort. (Okay, that's really the only other
thing, but it's a good one).

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Robert Haas

Date:

23 December 2015, 17:37:47

On Tue, Dec 22, 2015 at 8:10 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Sure, there are arbitrary numbers all over the code, driven by
>> empirical observations about what factors are important to model.  But
>> this is not that.  You don't have a thing called seq_page_cost and a
>> thing called cpu_tuple_cost and then say, well, empirically the ratio
>> is about 100:1, so let's make the former 1 and the latter 0.01.  You
>> just have some numbers, and it's not clear what, if anything, they
>> actually represent.
>
> What I find difficult to accept about what you say here is that at
> *this* level, something like cost_sort() has little to recommend it.
> It costs a sort of a text attribute at the same level as the cost of
> sorting the same tuples using an int4 attribute (based on the default
> cpu_operator_cost for C functions -- without any attempt to
> differentiate text and int4).
>
> Prior to 9.5, sorting text took about 5 - 10 times longer that this
> similar int4 sort. That's a pretty big difference, and yet I recall no
> complaints. The cost of a comparison in a sort can hardly be
> considered in isolation, anyway -- cache efficiency is at least as
> important.
>
> Of course, the point is that the goal of a cost model is not to
> simulate reality as closely as possible -- it's to produce a good
> outcome for performance purposes under realistic assumptions.
> Realistic assumptions include that you can't hope to account for
> certain differences in cost. Avoiding a terrible outcome is very
> important, but the worst case for useselection() is no worse than
> today's behavior (or a lost opportunity to do better than today's
> behavior).

I agree with that.  So, the question for any given cost model is: does
it model the effects that matter?

If you think that the cost of sorting integers vs. sorting text
matters to the crossover point, then that should be modeled here.  If
it doesn't matter, then don't include it.

The point is, nobody can tell WHAT effects this is modeling.
Increasing the tuple size makes the crossover go up.  Or down.

> Recently, the paper that was posted to the list about the Postgres
> optimizer stated formally what I know I had a good intuitive sense of
> for a long time: that better selectivity estimates are much more
> important than better cost models in practice. The "empirical
> observations" driving something like DEFAULT_EQ_SEL are very weak --
> but what are you gonna do?

This analogy is faulty.  It's true that when we run across a qual
whose selectivity we cannot estimate in any meaningful way, we have to
just take a stab in the dark and hope for the best.  Similarly, if we
have no information about what the crossover point for a given sort
is, we'd have to take some arbitrary estimate, like 75%, and hope for
the best.  But in this case, we DO have information.  We have an
estimated row count and an estimated row width.  And those values are
not being ignored, they are getting used.  The problem is that they
are being used in an arbitrary way that is not justified by any chain
of reasoning.

> But yes, let me concede more clearly: the cost model is based on
> frobbing. But at least it's relatively honest about that, and is
> relatively simple. I think it might be possible to make it simpler,
> but I have a feeling that anything we can come up with will basically
> have the same quality that you so dislike. I don't know how to do
> better. Frankly, I'd rather be roughly correct than exactly wrong.

Sure, but the fact that the model has huge discontinuities - perhaps
most notably a case where adding a single tuple to the estimated
cardinality changes the crossover point by a factor of two - suggests
that you are probably wrong.  The actual behavior does not change
sharply when the size of the SortTuple array crosses 1GB, but the
estimates do.  That means that either the estimates are wrong for
44,739,242 tuples or they are wrong for 44,739,243 tuples.  The
behavior cannot be right in both cases unless that one extra tuple
changes the behavior radically, or unless the estimate doesn't matter
in the first place.

> By the way, I think that there needs to be a little work done to
> cost_sort() too, which so far I've avoided.

Yeah, I agree, but that can be a separate topic.

>> I agree, but that's not what I proposed.  You don't want to keep
>> re-sorting to incorporate new tuples into the run, but if you've got
>> 1010 tuples and you can fit 1000 tuples in, you can (a) quicksort the
>> first 1000 tuples, (b) read in 10 more tuples, dumping the first 10
>> tuples from run 0 to disk, (c) quicksort the last 10 tuples to create
>> run 1, and then (d) merge run 0 [which is mostly in memory] with run 1
>> [which is entirely in memory].  In other words, yes, quicksorting
>> doesn't let you add things to the sort incrementally, but you can
>> still write out the run incrementally, writing only as many tuples as
>> you need to dump to get the rest of the input data into memory.
>
> Merging is still sorting. The 10 tuples are not very cheap to merge
> against the 1000 tuples, because you'll probably still end up reading
> most of the 1000 tuples to do so.

You're going to read all of the 1000 tuples no matter what, because
you need to return them, but you will also need to make comparisons on
most of them, unless the data distribution is favorable.   Assuming no
special good luck, it'll take something close to X + Y - 1 comparisons
to do the merge, so something around 1009 comparisons here.
Maintaining the heap property is not free either, but it might be
cheaper.

> The cost of heapification of 1.01 million tuples to spill 0.01 million
> tuples is pretty low (relative to the cost of sorting them in
> particular). The only difference between what you say here and what I
> actually do is that the remaining tuples are heapified rather than
> sorted, and I quicksort everything together to "merge run 1 and run 0"
> rather than doing two quicksorts and a merge. I believe that this can
> be demonstrated to be cheaper.
>
> Another factor is that the heap could be useful for other stuff in the
> future. As Simon Riggs pointed out, for deduplicating values as
> they're read in by tuplesort. (Okay, that's really the only other
> thing, but it's a good one).

Not sure how that would work?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

23 December 2015, 20:31:39

On Wed, Dec 23, 2015 at 9:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> The point is, nobody can tell WHAT effects this is modeling.
> Increasing the tuple size makes the crossover go up.  Or down.

There are multiple, competing considerations.

> This analogy is faulty.  It's true that when we run across a qual
> whose selectivity we cannot estimate in any meaningful way, we have to
> just take a stab in the dark and hope for the best.  Similarly, if we
> have no information about what the crossover point for a given sort
> is, we'd have to take some arbitrary estimate, like 75%, and hope for
> the best.  But in this case, we DO have information.  We have an
> estimated row count and an estimated row width.  And those values are
> not being ignored, they are getting used.  The problem is that they
> are being used in an arbitrary way that is not justified by any chain
> of reasoning.

There is a chain of reasoning. It's not particularly satisfactory that
it's so fuzzy, certainly, but the competing considerations here are
substantive (and include erring towards not proceeding with
replacement selection/"quicksort with spillover" when the benefits are
low relative to the costs, which, to repeat myself, is itself novel).

I am more than open to suggestions on alternatives. As I said, I don't
particularly care for my current approach, either. But doing something
analogous to cost_sort() for our private "Do we quicksort with
spillover?"/useselection() model is going to be strictly worse than
what I have proposed.

Any cost model will have to be sensitive to different types of CPU
costs at the level that matters here -- such as the size of the heap,
and its cache efficiency. That's really important, but very
complicated, and variable enough that erring against using replacement
selection seems like a good idea with bigger heaps especially. That
(cache efficiency) is theoretically the only difference that matters
here (other than I/O, of course, but avoiding I/O is only the upside
of proceeding, and if we only weigh that then the cost model always
gives the same answer).

Perhaps you can suggest an alternative model that weighs these
factors. Most sorts are less than 1GB, and it seems worthwhile to
avoid I/O at the level where an internal sort is just out of reach.
Really big CREATE INDEX sorts are not really what I have in mind with
"quicksort with spillover".

This cost_sort() code seems pretty bogus to me, FWIW:
       /* Assume 3/4ths of accesses are sequential, 1/4th are not */       startup_cost += npageaccesses *
(seq_page_cost* 0.75 + random_page_cost * 0.25);

I think we can afford to be a lot more optimistic about the proportion
of sequential accesses.

>> Merging is still sorting. The 10 tuples are not very cheap to merge
>> against the 1000 tuples, because you'll probably still end up reading
>> most of the 1000 tuples to do so.
>
> You're going to read all of the 1000 tuples no matter what, because
> you need to return them, but you will also need to make comparisons on
> most of them, unless the data distribution is favorable.   Assuming no
> special good luck, it'll take something close to X + Y - 1 comparisons
> to do the merge, so something around 1009 comparisons here.
> Maintaining the heap property is not free either, but it might be
> cheaper.

I'm pretty sure that it's cheaper. Some of the really good cases for
"quicksort with spillover" where only a little bit slower than a fully
internal sort when the work_mem threshold was just crossed.

>> Another factor is that the heap could be useful for other stuff in the
>> future. As Simon Riggs pointed out, for deduplicating values as
>> they're read in by tuplesort. (Okay, that's really the only other
>> thing, but it's a good one).
>
> Not sure how that would work?

Tuplesort would have license to discard tuples with matching existing
values, because the caller gave it permission to. This is something
that you can easily imagine occurring with ordered set aggregates, for
example. It would work in a way not unlike a top-N heapsort does
today. This would work well when it can substantially lower the use of
memory (initially heapification when the threshold is crossed would
probably measure the number of duplicates, and proceed only when it
looked like a promising strategy).

By the way, I think the heap currently does quite badly with many
duplicated values. That case seemed significantly slower than a
similar case with high cardinality tuples.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Jeff Janes

Date:

23 December 2015, 21:03:32

On Mon, Dec 14, 2015 at 7:22 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Dec 14, 2015 at 6:58 PM, Greg Stark <stark@mit.edu> wrote:
>> I ran sorts with various parameters on my small NAS server.
>
> ...
>
>> without the extra memory optimizations.
>
> Thanks for taking the time to benchmark the patch!
>
> While I think it's perfectly fair that you didn't apply the final
> on-the-fly merge "memory pool" patch, I also think that it's quite
> possible that the regression you see at the very low end would be
> significantly ameliorated or even eliminated by applying that patch,
> too. After all, Jeff Janes had a much harder time finding a
> regression, probably because he benchmarked all patches together.

The regression I found when building an index on a column of
400,000,000 md5(random()::text) with 64MB maintenance_work_mem was not
hard to find at all.  I still don't understand what is going on with
it, but it is reproducible.  Perhaps it is very unlikely and I just
got very lucky in finding it immediately after switching to that
data-type for my tests, but I wouldn't assume that on current
evidence.

If we do think it is important to almost never cause regressions at
the default maintenance_work_mem (I am agnostic on the importance of
that), then I think we have more work to do here.  I just don't know
what that work is.

Cheers,

Jeff

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

23 December 2015, 21:08:42

On Wed, Dec 23, 2015 at 1:03 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> If we do think it is important to almost never cause regressions at
> the default maintenance_work_mem (I am agnostic on the importance of
> that), then I think we have more work to do here.  I just don't know
> what that work is.

My next revision will use grow_memtuples() in advance of the final
on-the-fly merge step, in a way that considers that we won't be losing
out to palloc() overhead (so it'll mostly be the memory patch that is
revised). This can make a large difference to the number of slots
(memtuples) available. I think I measured a 6% or 7% additional
improvement for a case with a fairly small number of runs to merge. It
might help significantly more when there are more runs to merge.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Robert Haas

Date:

23 December 2015, 21:16:16

On Wed, Dec 23, 2015 at 3:31 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Dec 23, 2015 at 9:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> The point is, nobody can tell WHAT effects this is modeling.
>> Increasing the tuple size makes the crossover go up.  Or down.
>
> There are multiple, competing considerations.

Please explain what they are and how they lead you to believe that the
cost factors you have chosen are good ones.

My point here is: even if I were to concede that your cost model
yields perfect answers in every case, the patch needs to give at least
some hint as to why.  Right now, it really doesn't.

>>> Another factor is that the heap could be useful for other stuff in the
>>> future. As Simon Riggs pointed out, for deduplicating values as
>>> they're read in by tuplesort. (Okay, that's really the only other
>>> thing, but it's a good one).
>>
>> Not sure how that would work?
>
> Tuplesort would have license to discard tuples with matching existing
> values, because the caller gave it permission to. This is something
> that you can easily imagine occurring with ordered set aggregates, for
> example. It would work in a way not unlike a top-N heapsort does
> today. This would work well when it can substantially lower the use of
> memory (initially heapification when the threshold is crossed would
> probably measure the number of duplicates, and proceed only when it
> looked like a promising strategy).

It's not clear to me how having a heap helps with that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

23 December 2015, 23:44:21

On Wed, Dec 23, 2015 at 1:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Dec 23, 2015 at 3:31 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Wed, Dec 23, 2015 at 9:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> The point is, nobody can tell WHAT effects this is modeling.
>>> Increasing the tuple size makes the crossover go up.  Or down.
>>
>> There are multiple, competing considerations.
>
> Please explain what they are and how they lead you to believe that the
> cost factors you have chosen are good ones.

Alright.

I've gone on at length about how I'm blurring the distinction between
internal and external sorting, or about how modern hardware
characteristics allow that. There are several reasons for that. Now,
we all know that main memory sizes have increased dramatically since
the 1970s, and storage characteristics are very different, and that
CPU caching effects have become very important, and that everyone has
lots more data.

There is one thing that hasn't really become bigger in all that time,
though: the width of tuples. So, as I go into in comments within
useselection(), that's the main reason why avoiding I/O isn't all that
impressive, especially at the high end. It's just not that big of a
cost at the high end. Beyond that, as linear costs go, palloc() is a
much bigger concern to me at this point. I think we can waste a lot
less time by amortizing that more extensively (to say nothing of the
saving in memory). This is really obvious by just looking at
trace_sort output with my patch applied when dealing with many runs,
sorting millions of tuples: There just isn't that much time spent on
I/O at all, and it's well hidden by foreground processing that is CPU
bound. With smaller work_mem sizes and far fewer tuples, a case much
more common within sort nodes (as opposed to utility statements), this
is less true. Sorting 1,000 or 10,000 tuples is an entirely different
thing to sorting 1,000,000 tuples.

So, first of all, the main consideration is that saving I/O turns out
to not matter that much at the high end. That's why we get very
conservative past the fairly arbitrary MaxAllocSize memtuples
threshold (which has a linear relationship to the number of tuples --
*not* the amount of memory used or disk space that may be used).

A second consideration is how much I/O we can save -- one would hope
it would be a lot, certainly the majority, to make up for the downside
of using a cache inefficient technique. That is a different thing to
the number of memtuples. If you had really huge tuples, there would be
a really big saving in I/O, often without a corresponding degradation
in cache performance (since there still many not be that many
memtuples, which is more the problem for the heap than anything else).
This distinction is especially likely to matter for the CLUSTER case,
where wide heap tuples (including heap tuple headers, visibility info)
are kind of along for the ride, which is less true elsewhere,
particularly for the CREATE INDEX case.

The cache inefficiency of spilling incrementally from a heap isn't so
bad if we only end up sorting a small number of tuples that way. So as
the number of tuples that we end up actually sorting that way
increases, the cache inefficiency becomes worse, while at the same
time, we save less I/O. The former is a bigger problem than the
latter, by a wide margin, I believe.

This code is an attempt to credit cases with really wide tuples:
   /*    * Starting from a threshold of 90%, refund 7.5% per 32 byte    * average-size-increment.    */   increments =
MAXALIGN_DOWN((int)avgTupleSize) / 32;   crossover = 0.90 - (increments * 0.075);

Most cases won't get too many "increments" of credit (although CLUSTER
sorts will probably get relatively many).

A third consideration is that we should be stingy about giving too
much credit to wider tuples because the cache inefficiency hurts more
as we achieve mere linear savings in I/O. So, most of the savings off
a 99.99% theoretical baseline threshold are fixed (you usually save
9.99% off that up-front).

A forth consideration is that the heap seems to do really badly past
1GB in general, due to cache characteristics. This is certainly not
something that I know how to model well.

I don't blame you for calling this voodoo, because to some extent it
is. But I remind you that the consequences of making the wrong
decision here are still better than the status quo today -- probably
far better, overall. I also remind you that voodoo code is something
you'll find in well regarded code bases at times. Have you ever
written networking code? Packet switching is based on some handwavy
observations about the real world. Practical implementations often
contain voodoo magic numbers. So, to answer your earlier question:
Yes, maybe it wouldn't be so bad, all things considered, to let
someone complain about this if they have a real-world problem with it.
The complexity of what we're talking about makes me modest about my
ability to get it exactly right. At the same time, the consequences of
getting it somewhat wrong are really not that bad. This is basically
the same tension that you get with more rigorous cost models anyway
(where greater rigor happens to be possible).

I will abandon this cost model at the first sign of a better
alternative -- I'm really not the least bit attached to it. I had
hoped that we'd be able to do a bit better than this through
discussion on list, but not far better. In any case, "quicksort with
spillover" is of secondary importance here (even though it just so
happens that I started with it).

>>>> Another factor is that the heap could be useful for other stuff in the
>>>> future. As Simon Riggs pointed out, for deduplicating values as
>>>> they're read in by tuplesort. (Okay, that's really the only other
>>>> thing, but it's a good one).
>>>
>>> Not sure how that would work?
>>
>> Tuplesort would have license to discard tuples with matching existing
>> values, because the caller gave it permission to. This is something
>> that you can easily imagine occurring with ordered set aggregates, for
>> example. It would work in a way not unlike a top-N heapsort does
>> today. This would work well when it can substantially lower the use of
>> memory (initially heapification when the threshold is crossed would
>> probably measure the number of duplicates, and proceed only when it
>> looked like a promising strategy).
>
> It's not clear to me how having a heap helps with that.

The immediacy of detecting a duplicate could be valuable. We could
avoid allocating tuplesort-owned memory entirely much of the time.
Basically, this is another example (quicksort with spillover being the
first) where incrementalism helps rather than hurts. Another
consideration is that we could thrash if we misjudge the frequency at
which to eliminate duplicates if we quicksort + periodically dedup.
This is especially of concern in the common case where there are big
clusters of the same value, and big clusters of heterogeneous values.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Michael Paquier

Date:

24 December 2015, 02:39:32

On Thu, Dec 24, 2015 at 8:44 AM, Peter Geoghegan <pg@heroku.com> wrote:
> [long blahblah]

(Patch moved to next CF, work is going on. Thanks to people here to be active)
-- 
Michael

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

24 December 2015, 03:48:24

On Wed, Dec 23, 2015 at 1:03 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> The regression I found when building an index on a column of
> 400,000,000 md5(random()::text) with 64MB maintenance_work_mem was not
> hard to find at all.  I still don't understand what is going on with
> it, but it is reproducible.  Perhaps it is very unlikely and I just
> got very lucky in finding it immediately after switching to that
> data-type for my tests, but I wouldn't assume that on current
> evidence.

Well, that is a lot of tuples to sort with such a small amount of memory.

I have a new theory. Maybe part of the problem here is that in very
low memory conditions, the tape overhead really is kind of wasteful,
and we're back to having to worry about per-tape overhead (6 tapes may
have been far too miserly as a universal number back before that was
fixed [1], but that doesn't mean that the per-tape overhead is
literally zero). You get a kind of thrashing, perhaps. Also, more
tapes results in more random I/O, and that's an added cost, too; the
cure may be worse than the disease.

I also think that this might be a problem in your case:

 * In this calculation we assume that each tape will cost us about 3 blocks
 * worth of buffer space (which is an underestimate for very large data
 * volumes, but it's probably close enough --- see logtape.c).

I wonder, what's the situation here like with the attached patch
applied on top of what you were testing? I think that we might be
better off with more merge steps when under enormous memory pressure
at the low end, in order to be able to store more tuples per tape (and
do more sorting using quicksort). I also think that under conditions
such as you describe, this code may play havoc with memory accounting:

    /*
     * Decrease availMem to reflect the space needed for tape buffers; but
     * don't decrease it to the point that we have no room for tuples. (That
     * case is only likely to occur if sorting pass-by-value Datums; in all
     * other scenarios the memtuples[] array is unlikely to occupy more than
     * half of allowedMem.  In the pass-by-value case it's not important to
     * account for tuple space, so we don't care if LACKMEM becomes
     * inaccurate.)
     */
    tapeSpace = (int64) maxTapes *TAPE_BUFFER_OVERHEAD;

    if (tapeSpace + GetMemoryChunkSpace(state->memtuples) < state->allowedMem)
        USEMEM(state, tapeSpace);

Remember, this is after the final grow_memtuples() call that uses your
intelligent resizing logic [2], so we'll USEMEM() in a way that
effectively makes some non-trivial proportion of our optimal memtuples
sizing unusable. Again, that could be really bad for cases like yours,
with very little memory relatively to data volume.

Thanks

[1] Commit df700e6b4
[2] Commit 8ae35e918
--
Peter Geoghegan

Attachment

maxorder_theory.patch

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

24 December 2015, 05:27:04

On Wed, Dec 23, 2015 at 7:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I wonder, what's the situation here like with the attached patch
> applied on top of what you were testing? I think that we might be
> better off with more merge steps when under enormous memory pressure
> at the low end, in order to be able to store more tuples per tape (and
> do more sorting using quicksort).

Actually, now that I look into it, I think your 64MB work_mem setting
would have 234 tapes in total, so my patch won't do anything for your
case. Maybe change MAXORDER to 100 within the patch, to see where that
leaves things? I want to see if there is any improvement.

234 tapes means that approximately 5.7MB of memory would go to just
using tapes (for accounting purposes, which is mostly my concern
here). However, for a case like this, where you're well short of being
able to do everything in one pass, there is no benefit to having more
than about 6 tapes (I guess that's probably still true these days).
That 5.7MB of tape space for accounting purposes (and also in reality)
may not only increase the amount of random I/O required, and not only
throw off the memtuples estimate within grow_memtuples() (its balance
against everything else), but also decrease the cache efficiency in
the final on-the-fly merge (the efficiency in accessing tuples).

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

28 December 2015, 23:03:38

On Fri, Dec 18, 2015 at 11:57 AM, Peter Geoghegan <pg@heroku.com> wrote:
> BTW, I'm not necessarily determined to make the new special-purpose
> allocator work exactly as proposed. It seemed useful to prioritize
> simplicity, and currently so there is one big "huge palloc()" with
> which we blow our memory budget, and that's it. However, I could
> probably be more clever about "freeing ranges" initially preserved for
> a now-exhausted tape. That kind of thing.

Attached is a revision that significantly overhauls the memory patch,
with several smaller changes.

We can now grow memtuples to rebalance the size of the array
(memtupsize) against the need for memory for tuples. Doing this makes
a big difference with a 500MB work_mem setting in this datum sort
case, as my newly expanded trace_sort instrumentation shows:

LOG:  grew memtuples 1.40x from 9362286 (219429 KB) to 13107200
(307200 KB) for final merge
LOG:  tape 0 initially used 34110 KB of 34110 KB batch (1.000) and
13107200 slots remaining
LOG:  tape 1 initially used 34110 KB of 34110 KB batch (1.000) and has
1534 slots remaining
LOG:  tape 2 initially used 34110 KB of 34110 KB batch (1.000) and has
1535 slots remaining
LOG:  tape 3 initially used 34110 KB of 34110 KB batch (1.000) and has
1533 slots remaining
LOG:  tape 4 initially used 34110 KB of 34110 KB batch (1.000) and has
1534 slots remaining
LOG:  tape 5 initially used 34110 KB of 34110 KB batch (1.000) and has
1535 slots remaining

This is a big improvement. With the new batchmemtuples() call
commented out (i.e. no new grow_memtuples() call), the LOG output
around the same point is:

LOG:  tape 0 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG:  tape 1 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG:  tape 2 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG:  tape 3 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG:  tape 4 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG:  tape 5 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining

(I actually added a bit more detail to what you see here during final clean-up)

Obviously we're using memory a lot more efficiently here as compared
to my last revision (or the master branch -- it always has palloc()
overhead, of course). With no grow_memtuples, we're not wasting ~1530
slots per tape anymore (which is a tiny fraction of 1% of the total),
but we are wasting 50% of all batch memory, or almost 30% of all
work_mem.

Note that this improvement is possible despite the fact that memory is
still MAXALIGN()'d -- I'm mostly just clawing back what I can, having
avoided much STANDARDCHUNKHEADERSIZE overhead for the final on-the-fly
merge. I tend to think that the bigger problem here is that we use so
many memtuples when merging in the first place though (e.g. 60% in the
above case), because memtuples are much less useful than something
like a simple array of pointers when merging; I can certainly see why
you'd need 6 memtuples here, for the merge heap, but the other ~13
million seem mostly unnecessary. Anyway, what I have now is as far as
I want to go to accelerate merging for 9.6, since parallel CREATE
INDEX is where the next big win will come from. As wasteful as this
can be, I think it's of secondary importance.

With this revision, I've given up on the idea of trying to map
USEMEM()/FREEMEM() to "logical" allocations and deallocations that
consume from each tape's batch. The existing merge code in the master
branch is concerned exclusively with making each tape's use of memory
fair; each tape only gets so many "slots" (memtuples), and so much
memory, and that's it (there is never any shuffling of those resource
budgets between tapes). I get the same outcome from simply only
allowing tapes to get memory from their own batch allocation, which
isn't much complexity, because only READTUP() routines regularly need
memory. We detect when memory has been exhausted within
mergeprereadone() in a special way, not using LACKMEM() at all -- this
seems simpler. (Specifically, we use something called overflow
allocations for this purpose. This means that there are still a very
limited number of retail palloc() calls.)

This new version somewhat formalizes the idea that batch allocation
may one day have uses beyond the final on-the-fly merge phase, which
makes a lot of sense. We should really be saving a significant amount
of memory when initially sorting runs, too. This revision also
pfree()s tape memory early if the tape is exhausted early, which will
help a lot when there is a logical/physical correlation.

Overall, I'm far happier with how memory is managed in this revision,
mostly because it's easier to reason about. trace_sort now closely
monitors where memory goes, and I think that's a good idea in general.
That makes production performance problems a lot easier to reason
about -- the accounting should be available to expert users (that
enable trace_sort). I'll have little sympathy for the suggestion that
this will overwhelm users, because trace_sort is already only suitable
for experts. Besides, it isn't that complicated to figure this stuff
out, or at least gain an intuition for what might be going on based on
differences seen in a problematic case. Getting a better picture of
what "bad" looks like can guide an investigation without the DBA
necessarily understanding the underlying algorithms. At worst, it
gives them something specific to complain about here.

Other changes:

* No longer use "tuple proper" terminology. Also, memory pools are now
referred to as batch memory allocations. This is at the request of
Jeff and Robert.

* Fixed silly bug in useselection() cost model that causes "quicksort
with spillover" to never be used. The cost model is otherwise
unchanged, because I didn't come up with any bright ideas about how to
do better there. Ideas from other people are very much welcome.

* Cap the maximum number of tapes to 500. I think it's silly that the
number of tapes is currently a function of work_mem, without any
further consideration of the details of the sort, but capping is a
simpler solution than making tuplesort_merge_order() smarter. I
previously saw quite a lot of waste with high work_mem settings, with
tens of thousands of tapes that will never be used, precisely because
we have lots of memory (the justification for having, say, 40k tapes
seems to be almost an oxymoron). Tapes (or the accounting for
never-allocated tapes) could take almost 10% of all memory. Also, less
importantly, we now refund/FREEMEM() unallocated tape memory ahead of
final on-the-fly merge preallocation of batch memory.

Note that we contemplated bounding the number of tapes in the past
several times. See the commit message of c65ab0bfa9, a commit from
almost a decade ago, for an example of this. That message also
describes how "slots" (memtuples) and memory for tuples must be kept
in balance while merging, which is very much relevant to my new
grow_memtuples() call.

--
Peter Geoghegan

Attachment

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

27 January 2016, 13:20:10

On Wed, Dec 23, 2015 at 9:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> But yes, let me concede more clearly: the cost model is based on
>> frobbing. But at least it's relatively honest about that, and is
>> relatively simple. I think it might be possible to make it simpler,
>> but I have a feeling that anything we can come up with will basically
>> have the same quality that you so dislike. I don't know how to do
>> better. Frankly, I'd rather be roughly correct than exactly wrong.
>
> Sure, but the fact that the model has huge discontinuities - perhaps
> most notably a case where adding a single tuple to the estimated
> cardinality changes the crossover point by a factor of two - suggests
> that you are probably wrong.  The actual behavior does not change
> sharply when the size of the SortTuple array crosses 1GB, but the
> estimates do.

Here is some fairly interesting analysis of Quicksort vs. Heapsort,
from Bentley, coauthor of our own Quicksort implementation:

https://youtu.be/QvgYAQzg1z8?t=16m15s

(This link picks up at the right point to see the comparison, complete
with an interesting graph).

It probably doesn't tell you much that you didn't already know, at
least at this exact point, but it's nice to see Bentley's graph. This
perhaps gives you some idea of why my "quicksort with spillover" cost
model had a cap of MaxAllocSize of SortTuples, past which we always
needed a very compelling case. That was my rough guess of where the
Heapsort graph takes a sharp upward turn. Before then, Bentley shows
that it's close enough to a straight line.

Correct me if I'm wrong, but I think that the only outstanding issue
with all patches posted here so far is the "quicksort with spillover"
cost model. Hopefully this can be cleared up soon. As I've said, I am
very receptive to other people's suggestions about how that should
work.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Mithun Cy

Date:

29 January 2016, 11:41:38

On Tue, Dec 29, 2015 at 4:33 AM, Peter Geoghegan <pg@heroku.com> wrote:

>Attached is a revision that significantly overhauls the memory patch,
>with several smaller changes.

I just ran some tests on above patch. Mainly to compare

how "longer sort keys" would behave with new(Qsort) and old Algo(RS) for sorting.

I have 8GB of ram and ssd storage.

Settings and Results.

----------------------------

Work_mem= DEFAULT (4mb).

key width = 520.

CASE 1. Data is pre-sorted as per sort key order.

CASE 2. Data is sorted in opposite order of sort key.

CASE 3. Data is randomly distributed.

Key length 520
Number of records	3200000	6400000	12800000	25600000
	1.7 GB	3.5GB	7 GB	14GB

CASE 1
RS	23654.677	35172.811	44965.442	106420.155
Qsort	14100.362	40612.829	101068.107	334893.391

CASE 2
RS	13427.378	36882.898	98492.644	310670.15
Qsort	12475.133	32559.074	100772.531	322080.602

CASE 3
RS	17202.966	45163.234	122323.299	337058.856
Qsort	12530.726	23343.753	59431.315	152862.837

If data is sorted as same as sort key order then current code performs better than proposed patch

as sort size increases.

It appears new algo do not seem have any major impact if rows are presorted in opposite order.

For randomly distributed order quick sort performs well when compared to current sort method (RS).

======================================================

Now Increase the work_mem to 64MB and for 14 GB of data to sort.

CASE 1: We can see Qsort is able to catchup with current sort method(RS).

CASE 2: No impact.

CASE 3: RS is able to catchup with Qsort.

CASE 1	RS	128822.735
	Qsort	90857.496

CSAE 2	RS	105631.775
	Qsort	105938.334

CASE 3	RS	152301.054
	Qsort	149649.347

I think for long keys both old (RS) and new (Qsort) sort method has its own characteristics

based on data distribution. I think work_mem is the key If properly set new method(Qsort) will

be able to fit most of the cases. If work_mem is not tuned right it, there are cases it can regress.

Thanks and Regards

Mithun C Y

EnterpriseDB: http://www.enterprisedb.com

Attachment

Test Queries.sql

Re: Using quicksort for every external sort run

From

Mithun Cy

Date:

29 January 2016, 12:03:37

On Fri, Jan 29, 2016 at 5:11 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote

>I just ran some tests on above patch. Mainly to compare
>how "longer sort keys" would behave with new(Qsort) and old Algo(RS) for sorting.
>I have 8GB of ram and ssd storage.

Key length 520

Number of records 3200000 6400000 12800000 25600000

1.7 GB 3.5GB 7 GB 14GB

CASE 1

RS 23654.677 35172.811 44965.442 106420.155
Qsort 14100.362 40612.829 101068.107 334893.391

CASE 2

RS 13427.378 36882.898 98492.644 310670.15
Qsort 12475.133 32559.074 100772.531 322080.602

CASE 3

RS 17202.966 45163.234 122323.299 337058.856
Qsort 12530.726 23343.753 59431.315 152862.837

CASE 1 RS 128822.735

Qsort 90857.496

CSAE 2 RS 105631.775

Qsort 105938.334

CASE 3 RS 152301.054

Qsort 149649.347

Sorry forgot to mention above data in table is in unit of ms, returned by psql client.

--

Thanks and Regards

Mithun C Y

EnterpriseDB: http://www.enterprisedb.com

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

29 January 2016, 16:53:53

On Fri, Jan 29, 2016 at 3:41 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> I just ran some tests on above patch. Mainly to compare
> how "longer sort keys" would behave with new(Qsort) and old Algo(RS) for sorting.
> I have 8GB of ram and ssd storage.
>
> Settings and Results.
> ----------------------------
> Work_mem= DEFAULT (4mb).
> key width = 520.

> If data is sorted as same as sort key order then current code performs better than proposed patch
> as sort size increases.
>
> It appears new algo do not seem have any major impact if rows are presorted in opposite order.
>
> For randomly distributed order quick sort performs well when compared to current sort method (RS).
>
>
> ======================================================
> Now Increase the work_mem to 64MB and for 14 GB of data to sort.
>
> CASE 1: We can see Qsort is able to catchup with current sort method(RS).
> CASE 2:  No impact.
> CASE 3: RS is able to catchup with Qsort.

I think that the basic method you're using to do these tests may have
additional overhead:

-- sort in ascending order.
CREATE FUNCTION test_orderby_asc( ) RETURNS int
AS $$
#print_strict_params on
DECLARE
gs int;
jk text;
BEGIN
SELECT string_4k, generate_series INTO  jk, gs       FROM so order by string_4k, generate_series;
   RETURN gs;
END
$$ LANGUAGE plpgsql;

Anyway, these test cases all remove much of the advantage of increased
cache efficiency.  No comparisons are *ever* resolved using the
leading attribute, which calls into question why anyone would sort on
that. It's 512 bytes, so artificially makes the comparisons themselves
the bottleneck, as opposed to cache efficiency. You can't even fit the
second attribute in the same cacheline as the first in the "tuple
proper" (MinimalTuple).

You are using a 4MB work_mem setting, but you almost certainly have a
CPU with an L3 cache size that's a multiple of that, even with cheap
consumer grade hardware. You have 8GB of ram; a 4MB work_mem setting
is very small setting (I mean in an absolute sense, less so than
relative to the size of data, although especially relative to the
data).

You mentioned "CASE 3: RS is able to catchup with Qsort", which
doesn't make much sense to me. The only way I think that is possible
is by making the increased work_mem sufficient to have much longer
runs, because there is in fact somewhat of a correlation in the data,
and an increased work_mem makes the critical difference, allowing
perhaps one long run to be used -- there is now enough memory to
"juggle" tuples without ever needing to start a new run. But, how
could that be? You said case 3 was totally random data, so I'd only
expect incremental improvement. It could also be some weird effect
from polyphase merge. A discontinuity.

I also don't understand why the patch ("Qsort") can be so much slower
between case 1 and case 3 on 3.5GB+ sizes, but not the 1.7GB size.
Even leaving aside the differences between "RS" and "Qsort", it makes
no sense to me that *both* are faster with random data ("CASE 3") than
with presorted data ("CASE 1").

Another weird thing is that the traditional best case for replacement
selection ("RS") is a strong correlation, and a traditional worst case
is an inverse correlation, where run size is bound strictly by memory.
But you show just the opposite here -- the inverse correlation is
faster with RS in the 1.7 GB data case. So, I have no idea what's
going on here, and find it all very confusing.

In order for these numbers to be useful, they need more detail --
"trace_sort" output. There are enough confounding factors in general,
and especially here, that not having that information makes raw
numbers very difficult to interpret.

> I think for long keys both old (RS) and new (Qsort) sort method has its own characteristics
> based on data distribution. I think work_mem is the key If properly set new method(Qsort) will
> be able to fit most of the cases. If work_mem is not tuned right it, there are cases it can regress.

work_mem is impossible to tune right with replacement selection.
That's a key advantage of the proposed new approach.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Robert Haas

Date:

29 January 2016, 17:24:28

On Wed, Jan 27, 2016 at 8:20 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Correct me if I'm wrong, but I think that the only outstanding issue
> with all patches posted here so far is the "quicksort with spillover"
> cost model. Hopefully this can be cleared up soon. As I've said, I am
> very receptive to other people's suggestions about how that should
> work.

I feel like this could be data driven.  I mean, the cost model is
based mainly on the tuple width and the size of the SortTuple array.
So, it should be possible to tests of both algorithms on 32, 64, 96,
128, ... byte tuples with a SortTuple array that is 256MB, 512MB,
768MB, 1GB, ...  Then we can judge how closely the cost model comes to
mimicking the actual behavior.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

29 January 2016, 17:46:53

On Fri, Jan 29, 2016 at 9:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I feel like this could be data driven.  I mean, the cost model is
> based mainly on the tuple width and the size of the SortTuple array.
> So, it should be possible to tests of both algorithms on 32, 64, 96,
> 128, ... byte tuples with a SortTuple array that is 256MB, 512MB,
> 768MB, 1GB, ...  Then we can judge how closely the cost model comes to
> mimicking the actual behavior.

You would also need to represent how much of the input actually ended
up being sorted with the heap in each case. Maybe that could be tested
at 50% (bad for "quicksort with spillover"), 25% (better), and 5%
(good).

An alternative approach that might be acceptable is to add a generic,
conservative 90% threshold (so 10% of tuples sorted by heap).

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Robert Haas

Date:

29 January 2016, 22:58:30

On Fri, Jan 29, 2016 at 12:46 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Jan 29, 2016 at 9:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I feel like this could be data driven.  I mean, the cost model is
>> based mainly on the tuple width and the size of the SortTuple array.
>> So, it should be possible to tests of both algorithms on 32, 64, 96,
>> 128, ... byte tuples with a SortTuple array that is 256MB, 512MB,
>> 768MB, 1GB, ...  Then we can judge how closely the cost model comes to
>> mimicking the actual behavior.
>
> You would also need to represent how much of the input actually ended
> up being sorted with the heap in each case. Maybe that could be tested
> at 50% (bad for "quicksort with spillover"), 25% (better), and 5%
> (good).
>
> An alternative approach that might be acceptable is to add a generic,
> conservative 90% threshold (so 10% of tuples sorted by heap).

I don't quite know what you mean by these numbers.  Add a generic,
conservative threshold to what?

Thinking about this some more, I really think we should think hard
about going back to the strategy which you proposed and discarded in
your original post: always generate the first run using replacement
selection, and every subsequent run by quicksorting.  In that post you
mention powerful advantages of this method: "even without a strong
logical/physical correlation, the algorithm tends to produce runs that
are about twice the size of work_mem. (It's also notable that
replacement selection only produces one run with mostly presorted
input, even where input far exceeds work_mem, which is a neat trick.)"You went on to dismiss that strategy, saying that
"despitethese

upsides, replacement selection is obsolete, and should usually be
avoided."  But I don't see that you've justified that statement.  It
seems pretty easy to construct cases where this technique regresses,
and a large percentage of those cases are precisely those where
replacement selection would have produced a single run, avoiding the
merge step altogether.  I think those cases are extremely important.
I'm quite willing to run somewhat more slowly than in other cases to
be certain of not regressing the case of completely or
almost-completely ordered input.  Even if that didn't seem like a
sufficient reason unto itself, I'd be willing to go that way just so
we don't have to depend on a cost model that might easily go wrong due
to bad input even if it were theoretically perfect in every other
respect (which I'm pretty sure is not true here anyway).

I also have another idea that might help squeeze more performance out
of your approach and avoid regressions.  Suppose that we add a new GUC
with a name like sort_mem_stretch_multiplier or something like that,
with a default value of 2.0 or 4.0 or whatever we think is reasonable.
When we've written enough runs that a polyphase merge will be
required, or when we're actually performing a polyphase merge, the
amount of memory we're allowed to use increases by this multiple.  The
idea is: we hope that people will set work_mem appropriately and
consequently won't experience polyphase merges at all, but it might.
However, it's almost certain not to happen very frequently.
Therefore, using extra memory in such cases should be acceptable,
because while you might have every backend in the system using 1 or
more copies of work_mem for something if the system is very busy, it
is extremely unlikely that you will have more than a handful of
processes doing polyphase merges.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

30 January 2016, 07:25:42

On Fri, Jan 29, 2016 at 2:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't quite know what you mean by these numbers.  Add a generic,
> conservative threshold to what?

I meant use "quicksort with spillover" simply because an estimated
90%+ of all tuples have already been consumed. Don't consider the
tuple width, etc.

> Thinking about this some more, I really think we should think hard
> about going back to the strategy which you proposed and discarded in
> your original post: always generate the first run using replacement
> selection, and every subsequent run by quicksorting.  In that post you
> mention powerful advantages of this method: "even without a strong
> logical/physical correlation, the algorithm tends to produce runs that
> are about twice the size of work_mem. (It's also notable that
> replacement selection only produces one run with mostly presorted
> input, even where input far exceeds work_mem, which is a neat trick.)"
>  You went on to dismiss that strategy, saying that "despite these
> upsides, replacement selection is obsolete, and should usually be
> avoided."  But I don't see that you've justified that statement.

Really? Just try it with a heap that is not tiny. Performance tanks.
The fact that replacement selection can produce one long run then
becomes a liability, not a strength. With a work_mem of something like
1GB, it's *extremely* painful.

> It seems pretty easy to construct cases where this technique regresses,
> and a large percentage of those cases are precisely those where
> replacement selection would have produced a single run, avoiding the
> merge step altogether.

...*and* where many passes are otherwise required (otherwise, the
merge is still cheap enough to leave us ahead). Typically with very
small work_mem settings, like 4MB, and far larger data volumes. It's
easy to construct those cases, but that doesn't mean that they
particularly matter. Using 4MB of work_mem to sort 10GB of data is
penny wise and pound foolish. The cases we've seen regressed are
mostly a concern because misconfiguration happens.

A compromise that may be acceptable is to always do a "quicksort with
spillover" when there is a very low work_mem setting and the estimate
of the number of input tuples is less than 10x of what we've seen so
far. Maybe less than 20MB. That will achieve the same thing.

> I'm quite willing to run somewhat more slowly than in other cases to
> be certain of not regressing the case of completely or
> almost-completely ordered input.  Even if that didn't seem like a
> sufficient reason unto itself, I'd be willing to go that way just so
> we don't have to depend on a cost model that might easily go wrong due
> to bad input even if it were theoretically perfect in every other
> respect (which I'm pretty sure is not true here anyway).

The consequences of being wrong either way are not severe (note that
making one long run isn't a goal of the cost model currently).

> I also have another idea that might help squeeze more performance out
> of your approach and avoid regressions.  Suppose that we add a new GUC
> with a name like sort_mem_stretch_multiplier or something like that,
> with a default value of 2.0 or 4.0 or whatever we think is reasonable.
> When we've written enough runs that a polyphase merge will be
> required, or when we're actually performing a polyphase merge, the
> amount of memory we're allowed to use increases by this multiple.  The
> idea is: we hope that people will set work_mem appropriately and
> consequently won't experience polyphase merges at all, but it might.
> However, it's almost certain not to happen very frequently.
> Therefore, using extra memory in such cases should be acceptable,
> because while you might have every backend in the system using 1 or
> more copies of work_mem for something if the system is very busy, it
> is extremely unlikely that you will have more than a handful of
> processes doing polyphase merges.

I'm not sure that that's practical. Currently, tuplesort decides on a
number of tapes ahead of time. When we're constrained on those, the
stretch multiplier would apply, but I think that that could be
invasive because the number of tapes ("merge order" + 1) was a
function of non-stretched work_mem.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Greg Stark

Date:

30 January 2016, 07:27:26

<p dir="ltr"><br /> On 29 Jan 2016 11:58 pm, "Robert Haas" <<a
href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br /> > It<br /> > seems pretty easy to
constructcases where this technique regresses,<br /> > and a large percentage of those cases are precisely those
where<br/> > replacement selection would have produced a single run, avoiding the<br /> > merge step altogether. 
<pdir="ltr">Now that avoiding the merge phase altogether didn't necessarily represent any actual advantage.<p
dir="ltr">Wedon't find out we've avoided the merge phase until the entire run has been spiked to disk. Then we need to
readit back in from disk to serve up those tuples.<p dir="ltr">If we have tapes to merge but can do then in a single
passwe do that lazily and merge as needed when we serve up the tuples. I doubt there's any speed difference in reading
twosequential streams with our buffering over one especially in the midst of a quiet doing other i/o. And N extra
comparisonsis less than the quicksort advantage.<p dir="ltr">If we could somehow predict that it'll be a single output
runthat would be a huge advantage. But having to spill all the tuples and then find out isn't really helpful.

Re: Using quicksort for every external sort run

From

Greg Stark

Date:

30 January 2016, 07:29:19

<p dir="ltr"><br /> On 30 Jan 2016 8:27 am, "Greg Stark" <<a href="mailto:stark@mit.edu">stark@mit.edu</a>>
wrote:<br/> ><br /> ><br /> > On 29 Jan 2016 11:58 pm, "Robert Haas" <<a
href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br /> > > It<br /> > > seems pretty
easyto construct cases where this technique regresses,<br /> > > and a large percentage of those cases are
preciselythose where<br /> > > replacement selection would have produced a single run, avoiding the<br /> >
>merge step altogether. <br /> ><br /> > Now that avoiding the merge phase altogether didn't necessarily
representany actual advantage.<br /> ><br /> > We don't find out we've avoided the merge phase until the entire
runhas been spiked to disk. <p dir="ltr">Hm, sorry about the phone typos. I thought I proofread it as I went but
obviouslynot that effectively...

Re: Using quicksort for every external sort run

From

Robert Haas

Date:

30 January 2016, 13:29:45

On Sat, Jan 30, 2016 at 2:25 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Jan 29, 2016 at 2:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't quite know what you mean by these numbers.  Add a generic,
>> conservative threshold to what?
>
> I meant use "quicksort with spillover" simply because an estimated
> 90%+ of all tuples have already been consumed. Don't consider the
> tuple width, etc.

Hmm, it's a thought.

>> Thinking about this some more, I really think we should think hard
>> about going back to the strategy which you proposed and discarded in
>> your original post: always generate the first run using replacement
>> selection, and every subsequent run by quicksorting.  In that post you
>> mention powerful advantages of this method: "even without a strong
>> logical/physical correlation, the algorithm tends to produce runs that
>> are about twice the size of work_mem. (It's also notable that
>> replacement selection only produces one run with mostly presorted
>> input, even where input far exceeds work_mem, which is a neat trick.)"
>>  You went on to dismiss that strategy, saying that "despite these
>> upsides, replacement selection is obsolete, and should usually be
>> avoided."  But I don't see that you've justified that statement.
>
> Really? Just try it with a heap that is not tiny. Performance tanks.
> The fact that replacement selection can produce one long run then
> becomes a liability, not a strength. With a work_mem of something like
> 1GB, it's *extremely* painful.

I'm not sure exactly what you think I should try.  I think a couple of
people have expressed the concern that your patch might regress things
on data that is all in order, but I'm not sure if you think I should
try that case or some case that is not-quite-in-order.  "I don't see
that you've justified that statement" is referring to the fact that
you presented no evidence in your original post that it's important to
sometimes use quicksorting even for run #1.  If you've provided some
test data illustrating that point somewhere, I'd appreciate a pointer
back to it.

> A compromise that may be acceptable is to always do a "quicksort with
> spillover" when there is a very low work_mem setting and the estimate
> of the number of input tuples is less than 10x of what we've seen so
> far. Maybe less than 20MB. That will achieve the same thing.

How about always starting with replacement selection, but limiting the
amount of memory that can be used with replacement selection to some
small value?  It could be a separate GUC, or a hard-coded constant
like 20MB if we're fairly confident that the same value will be good
for everyone.  If the tuples aren't in order, then we'll pretty
quickly come to the end of the first run and switch to quicksort.  If
we do end up using replacement selection for the whole sort, the
smaller heap is an advantage.  What I like about this sort of thing is
that it adds no reliance on any estimate; it's fully self-tuning.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

04 February 2016, 09:46:20

On Sat, Jan 30, 2016 at 5:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I meant use "quicksort with spillover" simply because an estimated
>> 90%+ of all tuples have already been consumed. Don't consider the
>> tuple width, etc.
>
> Hmm, it's a thought.

To be honest, it's a bit annoying that this is one issue we're stuck
on, because "quicksort with spillover" is clearly of less importance
overall. (This is a distinct issue from the issue of not using a
replacement selection style heap for the first run much of the time,
which seems to be a discussion about whether and to what extent the
*traditional* advantages of replacement selection hold today, as
opposed to a discussion about a very specific crossover point in my
patch.)

>> Really? Just try it with a heap that is not tiny. Performance tanks.
>> The fact that replacement selection can produce one long run then
>> becomes a liability, not a strength. With a work_mem of something like
>> 1GB, it's *extremely* painful.
>
> I'm not sure exactly what you think I should try.  I think a couple of
> people have expressed the concern that your patch might regress things
> on data that is all in order, but I'm not sure if you think I should
> try that case or some case that is not-quite-in-order.  "I don't see
> that you've justified that statement" is referring to the fact that
> you presented no evidence in your original post that it's important to
> sometimes use quicksorting even for run #1.  If you've provided some
> test data illustrating that point somewhere, I'd appreciate a pointer
> back to it.

I think that the answer to what you should try is simple: Any case
involving a large heap (say, a work_mem of 1GB). No other factor like
correlation seems to change the conclusion about that being generally
bad.

If you have a correlation, then that is *worse* if "quicksort with
spillover" always has us use a heap for the first run, because it
prolongs the pain of using the cache inefficient heap (note that this
is an observation about "quicksort with spillover" in particular, and
not replacement selection in general). The problem you'll see is that
there is a large heap which is __slow__ to spill from, and that's
pretty obvious with or without a correlation. In general it seems
unlikely that having one long run during the merge (i.e. no merge --
seen by having the heap build one long run because we got "lucky" and
"quicksort with spillover" encountered a correlation) can ever hope to
make up for this.

It *could* still make up for it if:

1. There isn't much to make up for in the first place, because the
heap is CPU cache resident. Testing this with a work_mem that is the
same size as CPU L3 cache seems a bit pointless to me, and I think
we've seen that a few times.

and:

2. There are many passes required without a replacement selection
heap, because the volume of data is just so much greater than the low
work_mem setting. Replacement selection makes the critical difference
because there is a correlation, perhaps strong enough to make it one
or two runs rather than, say, 10 or 20 or 100.

I've already mentioned many times that linear growth in the size of
work_mem sharply reduces the need for additional passes during the
merge phase (the observation about quadratic growth that I won't
repeat). These days, it's hard to recommend anything other than "use
more memory" to someone trying to use 4MB to sort 10GB of data. Yeah,
it would also be faster to use replacement selection for the first run
in the hope of getting lucky (actually lucky this time; no quotes),
but it's hard to imagine that that's going to be a better option, no
matter how frugal the user is. Helping users recognize when they could
use more memory effectively seems like the best strategy. That was the
idea behind multipass_warning, but you didn't like that (Greg Stark
was won over on the multipass_warning warning, though). I hope we can
offer something roughly like that at some point (a view?), because it
makes sense.

> How about always starting with replacement selection, but limiting the
> amount of memory that can be used with replacement selection to some
> small value?  It could be a separate GUC, or a hard-coded constant
> like 20MB if we're fairly confident that the same value will be good
> for everyone.  If the tuples aren't in order, then we'll pretty
> quickly come to the end of the first run and switch to quicksort.

This seems acceptable, although note that we don't have to decide
until we reach the work_mem limit, and not before.

If you want to use a heap for the first run, I'm not excited about the
idea, but if you insist then I'm glad that you at least propose to
limit it to the kind of cases that we *actually* saw regressed (i.e.
low work_mem settings -- like the default work_mem setting, 4MB).
We've seen no actual case with a larger work_mem that is advantaged by
using a heap, even *with* a strong correlation (this is actually
*worst of all*); that's where I am determined to avoid using a heap
automatically.

It wasn't my original insight that replacement selection has become
all but obsolete. It took me a while to come around to that point of
view. One 2014 SIGMOD paper says of replacement selection sort:

"Finally, there has been very little interest in replacement selection
sort and its variants over the last 15 years. This is easy to
understand when one considers that the previous goal of replacement
selection sort was to reduce the number of external memory passes to
2."

> If we do end up using replacement selection for the whole sort, the
> smaller heap is an advantage.  What I like about this sort of thing is
> that it adds no reliance on any estimate; it's fully self-tuning.

Fine, but the point of "quicksort with spillover" is that it avoids
I/O entirely. I'm not promoting it as useful for any of the reasons
that replacement selection was traditionally useful (on 1970s
hardware). So, we aren't much closer to working out a better cost
model for "quicksort with spillover" (I guess you weren't really
talking about that, though), an annoying sticking point (as already
mentioned).

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

04 February 2016, 11:14:12

On Thu, Feb 4, 2016 at 1:46 AM, Peter Geoghegan <pg@heroku.com> wrote:
> It wasn't my original insight that replacement selection has become
> all but obsolete. It took me a while to come around to that point of
> view.

Nyberg et al may have said it best in 1994, in the Alphasort Paper [1]:

"By comparison, OpenVMS sort uses a pure replacement-selection sort to
generate runs (Knuth, 1973). Replacement-selection is best for a
memory-constrained environment. On average, replacement-selection
generates runs that are twice as large as available memory, while the
QuickSort runs are typically less than half of available memory.
However, in a memory-rich environment, QuickSort is faster because it
is simpler, makes fewer exchanges on average, and has superior address
locality to exploit processor caching. "

(I believe that the authors state that "QuickSort runs are typically
less than half of available memory" because of the use of explicit
asynchronous I/O in each thread, which doesn't apply to us).

The paper also has very good analysis of the economics of sorting:

"Even for surprisingly large sorts, it is economical to perform the
sort in one pass."

Of course, memory capacities have scaled enormously in the 20 years
since this analysis was performed, so the analysis applies even at the
very low end these days. The high capacity memory system that they
advocate to get a one pass sort (instead of having faster disks) had
100MB of memory, which is of course tiny by contemporary standards. If
you pay Heroku $7 a month, you get a "Hobby Tier" database with 512MB
of memory. The smallest EC2 instance size, the t2.nano, costs about
$1.10 to run for one week, and has 0.5GB of memory.

The economics of using 4MB or even 20MB to sort 10GB of data are
already preposterously bad for everyone that runs a database server,
no matter how budget conscious they may be. I can reluctantly accept
that we need to still use a heap with very low work_mem settings to
avoid the risk of a regression (in the event of a strong correlation)
on general principle, but I'm well justified in proposing "just don't
do that" as the best practical advice.

I thought I had your agreement on that point, Robert; is that actually the case?

[1] http://www.cs.berkeley.edu/~rxin/db-papers/alphasort.pdf

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Robert Haas

Date:

05 February 2016, 17:31:21

On Thu, Feb 4, 2016 at 6:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
> The economics of using 4MB or even 20MB to sort 10GB of data are
> already preposterously bad for everyone that runs a database server,
> no matter how budget conscious they may be. I can reluctantly accept
> that we need to still use a heap with very low work_mem settings to
> avoid the risk of a regression (in the event of a strong correlation)
> on general principle, but I'm well justified in proposing "just don't
> do that" as the best practical advice.
>
> I thought I had your agreement on that point, Robert; is that actually the case?

Peter and I spent a few hours talking on Skype this morning about this
point and I believe we have agreed on an algorithm that I think will
address all of my concerns and hopefully also be acceptable to him.
Peter, please weigh in and let me know if I've gotten anything
incorrect here or if you think of other concerns afterwards.

The basic idea is that we will add a new GUC with a name like
replacement_sort_mem that will have a default value in the range of
20-30MB; or possibly we will hardcode this value, but for purposes of
this email I'm going to assume it's a GUC.  If the value of work_mem
or maintenance_work_mem, whichever applies, is smaller than the value
of replacement_sort_mem, then the latter has no effect.  However, if
replacement_sort_mem is the smaller value, then the amount of memory
that can be used for a heap with replacement selection is limited to
replacement_sort_mem: we can use more memory than that in total for
the sort, but the amount that can be used for a heap is restricted to
that value.  The way we do this is explained in more detail below.
One thing I just thought of (after the call) is that it might be
better for this GUC to be in units of tuples rather than in units of
memory; it's not clear to me why the optimal heap size should be
dependent on the tuple size, so we could have a threshold like 300,000
tuples or whatever.   But that's a secondary issue and I might be
wrong about it: the point is that in order to have a chance of
winning, a heap used for replacement selection needs to be not very
big at all by the standards of modern hardware, so the plan is to
limit it to a size at which it may have a chance.

Here's how that will work, assuming Peter and I understand each other:

1. We start reading the input data.  If we reach the end of the input
data before (maintenance_)work_mem is exhausted, then we can simply
quicksort the data and we're done.  This is no different than what we
already do today.

2. If (maintenance_)work_mem fills up completely, we will quicksort
all of the data we have in memory.  We will then regard the tail end
of that sorted data, in an amount governed by replacement_sort_mem, as
a heap, and use it to perform replacement selection until no tuples
remain for the current run.  Meanwhile, the rest of the sorted data
remains in memory untouched.  Logically, we're constructing a run of
tuples which is split between memory and disk: the head of the run
(what fits in all of (maintenance_)work_mem except for
replacement_sort_mem) is in memory, and the tail of the run is on
disk.

3. If we reach the end of input before replacement selection runs out
of tuples for the current run, and if it finds no tuples for the next
run prior to that time, then we are done.  All of the tuples form a
single run and we can return the tuples in memory first followed by
the tuples on disk.  This case is highly likely to be a huge win over
what we have today, because (a) some portion of the tuples were sorted
via quicksort rather than heapsort and that's faster, (b) the tuples
that were sorted using a heap were sorted using a small heap rather
than a big one, and (c) we only wrote out the minimal number of tuples
to tape instead of, as we would have done today, all of them.

4. If we reach this step, then replacement selection with a small heap
wasn't able to sort the input in a single run.  We have a bunch of
sorted data in memory which is the head of the same run whose tail is
already on disk; we now spill all of these tuples to disk.  That
leaves only the heapified tuples in memory.  We just ignore the fact
that they are a heap and treat them as unsorted.  We repeatedly do the
following: read tuples until work_mem is full, sort them, and dump the
result to disk as a run.  When all runs have been created, we merge
runs just as we do today.

This algorithm seems very likely to beat what we do today in
practically all cases.  The benchmarking Peter and others have already
done shows that building runs with quicksort rather than replacement
selection can often win even if the larger number of tapes requires a
multi-pass merge.  The only cases where it didn't seem to be a clear
win involved data that was already in sorted order, or very close to
it.  But with this algorithm, presorted input is fine: we'll quicksort
some of it (which is faster than replacement selection because
quicksort checks for presorted input) and sort the rest with a *small*
heap (which is faster than our current approach of sorting it with a
big heap when the data is already in order).  On top of that, we'll
only write out the minimal amount of data to disk rather than all of
it.  So we should still win.  On the other hand, if the data is out of
order, then we will do only a little bit of replacement selection
before switching over to building runs by quicksorting, which should
also win.

The worst case I was able to think of for this algorithm is an input
stream that is larger than work_mem and almost sorted: the only
exception is that the record that should be exactly in the middle is
all the way at the end.  In that case, today's code will use a large
heap and will consequently produce only a single run.  The algorithm
above will end up producing two runs, the second containing only that
one tuple.  That means we're going to incur the additional cost of a
merge pass.  On the other hand, we're also going to have substantial
savings to offset that - the building-runs stage will save by using
quicksort for some of the data and a small heap for the rest.  So the
cost to merge the runs will be at least partially, maybe completely,
offset by reduced time spent building them.  Furthermore, Peter has
got other improvements in the patch which also make merging faster, so
if we don't buy enough building the runs to completely counterbalance
the cost of the merge, well, we may still win for that reason.  Even
if not, this is so much faster overall that a regression in some sort
of constructed worst case isn't really important.  I feel that
presorted input is a sufficiently common case that we should try hard
not to regress it - but presorted input with the middle value moved to
the end is not.  We need to not be horrible in that case, but there's
absolutely no reason to believe that we will be.  We may even be
faster, but we certainly shouldn't be abysmally slower.

Doing it this way also avoids the need to have a cost model that makes
decisions on how to sort based on the anticipated size of the input.
I'm really very happy about that, because I feel that any such cost
model, no matter how good, is a risk: estimation errors are not
uncommon.  Maybe a really sturdy cost model would be OK in the end,
but not needing one is better.  We don't need to fear burning a lot of
time on replacement selection, because the heap is small - any
significant amount of out-of-order data will cause us to switch to the
main algorithm, which is building runs by quicksorting.  The decision
is made based on the actual data we see rather than any estimate.
There's only one potentially tunable parameter - replacement_sort_mem
- but it probably won't hurt you very much even if it's wrong by a
factor of two - and there's no reason to believe that value is going
to be very different on one machine than another.  So this seems like
it should be pretty robust.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

07 February 2016, 16:00:36

On Fri, Feb 5, 2016 at 9:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Peter, please weigh in and let me know if I've gotten anything
> incorrect here or if you think of other concerns afterwards.

Right. Let me give you the executive summary first: I continue to
believe, following thinking about the matter in detail, that this is a
sensible compromise, that weighs everyone's concerns. It is pretty
close to a win-win. I just need you to confirm what I say here in
turn, so we're sure that we understand each other perfectly.

> The basic idea is that we will add a new GUC with a name like
> replacement_sort_mem that will have a default value in the range of
> 20-30MB; or possibly we will hardcode this value, but for purposes of
> this email I'm going to assume it's a GUC.  If the value of work_mem
> or maintenance_work_mem, whichever applies, is smaller than the value
> of replacement_sort_mem, then the latter has no effect.

By "no effect", you must mean that we always use a heap for the entire
first run (albeit for the tail, with a hybrid quicksort/heap
approach), but still use quicksort for every subsequent run, when it's
clearly established that we aren't going to get one huge run. Is that
correct?

It was my understanding, based on your emphasis on producing only a
single run, as well as your recent remarks on this thread about the
first run being special, that you are really only interested in the
presorted case, where one run is produced. That is, you are basically
not interested in preserving the general ability of replacement
selection to double run size in the event of a uniform distribution.
(That particular doubling property of replacement selection is now
technically lost by virtue of using this new hybrid model *anyway*,
although it will still make runs longer in general).

You don't want to change the behavior of the current patch for the
second or subsequent run; that should remain a quicksort, pure and
simple. Do I have that right?

BTW, parallel sort should probably never use a heap anyway (ISTM that
that will almost certainly be based on external sorts in the end). A
heap is not really compatible with the parallel heap scan model.

> One thing I just thought of (after the call) is that it might be
> better for this GUC to be in units of tuples rather than in units of
> memory; it's not clear to me why the optimal heap size should be
> dependent on the tuple size, so we could have a threshold like 300,000
> tuples or whatever.

I think you're right that a number of tuples is the logical way to
express the heap size (as a GUC unit). I think that the ideal setting
for the GUC is large enough to recognize significant correlations in
input data, which may be clustered, but no larger (at least while
things don't all fit in L1 cache, or maybe L2 cache). We should "go
for broke" with replacement selection -- we don't aim for anything
less than ending up with 1 run by using the heap (merging 2 or 3 runs
rather than 4 or 6 is far less useful, maybe harmful, when one of them
is much larger). Therefore, I don't expect that we'll be practically
disadvantaged by having fewer "hands to juggle" tuples here (we'll
simply almost always have enough in practice -- more on that later).
FWIW I don't think that any benchmark we've seen so far justifies
doing less than "going for broke" with RS, even if you happen to have
a very conservative perspective.

One advantage of a GUC is that you can set it to zero, and always get
a simple hybrid sort-merge strategy if that's desirable. I think that
it might not matter much with multi-gigabyte work_mem settings anyway,
though; you'll just see a small blip. Big (maintenance_)work_mem was
by far my greatest concern in relation to using a heap in general, so
I'm left pretty happy by this plan, I think. Lots of people can afford
a multi-GB maintenance_work_mem these days, and CREATE INDEX is gonna
be the most important case overall, by far.

> 2. If (maintenance_)work_mem fills up completely, we will quicksort
> all of the data we have in memory.  We will then regard the tail end
> of that sorted data, in an amount governed by replacement_sort_mem, as
> a heap, and use it to perform replacement selection until no tuples
> remain for the current run.  Meanwhile, the rest of the sorted data
> remains in memory untouched.  Logically, we're constructing a run of
> tuples which is split between memory and disk: the head of the run
> (what fits in all of (maintenance_)work_mem except for
> replacement_sort_mem) is in memory, and the tail of the run is on
> disk.

I went back and forth on this during our call, but I now think that I
was right that there will need to be changes in order to make the tail
of the run a heap (*not* the quicksorted head), because routines like
tuplesort_heap_siftup() assume that state->memtuples[0] is the head of
the heap. This is currently assumed by the master branch for both the
currentRun/nextRun replacement selection heap, as well as the heap
used for merging. Changing this is probably fairly manageable, though
(probably still not going to use memmove() for this, contrary to my
remarks on the call).

> 3. If we reach the end of input before replacement selection runs out
> of tuples for the current run, and if it finds no tuples for the next
> run prior to that time, then we are done.  All of the tuples form a
> single run and we can return the tuples in memory first followed by
> the tuples on disk.  This case is highly likely to be a huge win over
> what we have today, because (a) some portion of the tuples were sorted
> via quicksort rather than heapsort and that's faster, (b) the tuples
> that were sorted using a heap were sorted using a small heap rather
> than a big one, and (c) we only wrote out the minimal number of tuples
> to tape instead of, as we would have done today, all of them.

Agreed.

> 4. If we reach this step, then replacement selection with a small heap
> wasn't able to sort the input in a single run.  We have a bunch of
> sorted data in memory which is the head of the same run whose tail is
> already on disk; we now spill all of these tuples to disk.  That
> leaves only the heapified tuples in memory.  We just ignore the fact
> that they are a heap and treat them as unsorted.  We repeatedly do the
> following: read tuples until work_mem is full, sort them, and dump the
> result to disk as a run.  When all runs have been created, we merge
> runs just as we do today.

Right, so: having read this far, I'm almost sure that you intend that
replacement selection is only ever used for the first run (we "go for
broke" with RS). Good.

> This algorithm seems very likely to beat what we do today in
> practically all cases.  The benchmarking Peter and others have already
> done shows that building runs with quicksort rather than replacement
> selection can often win even if the larger number of tapes requires a
> multi-pass merge.  The only cases where it didn't seem to be a clear
> win involved data that was already in sorted order, or very close to
> it.

...*and* where there was an awful lot of data, *and* where there was
very little memory in an absolute sense (e.g. work_mem = 4MB).

> But with this algorithm, presorted input is fine: we'll quicksort
> some of it (which is faster than replacement selection because
> quicksort checks for presorted input) and sort the rest with a *small*
> heap (which is faster than our current approach of sorting it with a
> big heap when the data is already in order).

I'm not going to defend the precheck in our quicksort implementation.
It's unadulterated nonsense. The B&M quicksort implementation's use of
insertion sort does accomplish this pretty well, though.

> On top of that, we'll
> only write out the minimal amount of data to disk rather than all of
> it.  So we should still win.  On the other hand, if the data is out of
> order, then we will do only a little bit of replacement selection
> before switching over to building runs by quicksorting, which should
> also win.

Yeah -- we retain much of the benefit of "quicksort with spillover",
too, without any cost model. This is also better than "quicksort with
spillover" in that it limits the size of the heap, and so limits the
extent to which the algorithm can "helpfully" spend ages spilling from
an enormous heap. The new GUC can be explained to users as a kind of
minimum burst capacity for getting a "half internal, half external"
sort, which seems intuitive enough.

> The worst case I was able to think of for this algorithm is an input
> stream that is larger than work_mem and almost sorted: the only
> exception is that the record that should be exactly in the middle is
> all the way at the end.

> We need to not be horrible in that case, but there's
> absolutely no reason to believe that we will be.  We may even be
> faster, but we certainly shouldn't be abysmally slower.

Agreed.

If we take a historical perspective, a 10MB or 30MB heap will still
have a huge "juggling capacity" -- in practice it will almost
certainly store enough tuples to make the "plate spinning circus
trick" of replacement selection make the critical difference to run
size. This new GUC is a delta between tuples for RS reordering. You
can perhaps construct a "strategically placed banana skin" case to
make this look bad before caching effects start to weigh us down, but
I think you agree that it doesn't matter. "Juggling capacity" has
nothing to do with modern hardware characteristics, except that modern
machines are where the cost of excessive "juggling capacity" really
hurts, so this is simple. It is simple *especially* because we can
throw out the idea of a cost model that cares about caching effects in
particular, but that's just one specific thing.

BTW, you probably know this, but to be clear: When I talk about
correlation, I refer specifically to what would appear within
pg_stats.correlation as 1.0 -- I am not referring to a
pg_stats.correlation of -1.0. The latter case is traditionally
considered a worst case for RS.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Robert Haas

Date:

07 February 2016, 17:21:27

On Sun, Feb 7, 2016 at 11:00 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Right. Let me give you the executive summary first: I continue to
> believe, following thinking about the matter in detail, that this is a
> sensible compromise, that weighs everyone's concerns. It is pretty
> close to a win-win. I just need you to confirm what I say here in
> turn, so we're sure that we understand each other perfectly.

Makes sense to me.

>> The basic idea is that we will add a new GUC with a name like
>> replacement_sort_mem that will have a default value in the range of
>> 20-30MB; or possibly we will hardcode this value, but for purposes of
>> this email I'm going to assume it's a GUC.  If the value of work_mem
>> or maintenance_work_mem, whichever applies, is smaller than the value
>> of replacement_sort_mem, then the latter has no effect.
>
> By "no effect", you must mean that we always use a heap for the entire
> first run (albeit for the tail, with a hybrid quicksort/heap
> approach), but still use quicksort for every subsequent run, when it's
> clearly established that we aren't going to get one huge run. Is that
> correct?

Yes.

> It was my understanding, based on your emphasis on producing only a
> single run, as well as your recent remarks on this thread about the
> first run being special, that you are really only interested in the
> presorted case, where one run is produced. That is, you are basically
> not interested in preserving the general ability of replacement
> selection to double run size in the event of a uniform distribution.
> (That particular doubling property of replacement selection is now
> technically lost by virtue of using this new hybrid model *anyway*,
> although it will still make runs longer in general).
>
> You don't want to change the behavior of the current patch for the
> second or subsequent run; that should remain a quicksort, pure and
> simple. Do I have that right?

Yes.

> BTW, parallel sort should probably never use a heap anyway (ISTM that
> that will almost certainly be based on external sorts in the end). A
> heap is not really compatible with the parallel heap scan model.

I don't think I agree with this part, though I think it's unimportant
as far as the current patch is concerned.  My initial thought is that
parallel sort should work like this:

1. Each worker reads and sorts its input tuples just as it would in
non-parallel mode.

2. If, at the conclusion of the sort, the input tuples are still in
memory (quicksort) or partially in memory (quicksort with spillover),
then write them all to a tape.  If they are on multiple tapes, merge
those to a single tape.  If they are on a single tape, do nothing else
at this step.

3. At this point, we have one sorted tape per worker.  Perform a final
merge pass to get the final result.

The major disadvantage of this is that if the input hasn't been
relatively evenly partitioned across the workers, the work of sorting
will fall disproportionately on those that got more input.  We could,
in the future, make the logic more sophisticated.  For example, if
worker A is still reading the input and dumping sorted runs, worker B
could start merging those runs.  Or worker A could read tuples into a
DSM instead of backend-private memory, and worker B could then sort
them to produce a run.  While such optimizations are clearly
beneficial, I would not try to put them into a first parallel sort
patch.  It's too complicated.

>> One thing I just thought of (after the call) is that it might be
>> better for this GUC to be in units of tuples rather than in units of
>> memory; it's not clear to me why the optimal heap size should be
>> dependent on the tuple size, so we could have a threshold like 300,000
>> tuples or whatever.
>
> I think you're right that a number of tuples is the logical way to
> express the heap size (as a GUC unit). I think that the ideal setting
> for the GUC is large enough to recognize significant correlations in
> input data, which may be clustered, but no larger (at least while
> things don't all fit in L1 cache, or maybe L2 cache). We should "go
> for broke" with replacement selection -- we don't aim for anything
> less than ending up with 1 run by using the heap (merging 2 or 3 runs
> rather than 4 or 6 is far less useful, maybe harmful, when one of them
> is much larger). Therefore, I don't expect that we'll be practically
> disadvantaged by having fewer "hands to juggle" tuples here (we'll
> simply almost always have enough in practice -- more on that later).
> FWIW I don't think that any benchmark we've seen so far justifies
> doing less than "going for broke" with RS, even if you happen to have
> a very conservative perspective.
>
> One advantage of a GUC is that you can set it to zero, and always get
> a simple hybrid sort-merge strategy if that's desirable. I think that
> it might not matter much with multi-gigabyte work_mem settings anyway,
> though; you'll just see a small blip. Big (maintenance_)work_mem was
> by far my greatest concern in relation to using a heap in general, so
> I'm left pretty happy by this plan, I think. Lots of people can afford
> a multi-GB maintenance_work_mem these days, and CREATE INDEX is gonna
> be the most important case overall, by far.

Agreed.  I suspect that a default setting that is relatively small but
not zero will be good for most people, but if some people find
advantage in changing it to a smaller value, or zero, or a larger
value, that's fine with me.

>> 2. If (maintenance_)work_mem fills up completely, we will quicksort
>> all of the data we have in memory.  We will then regard the tail end
>> of that sorted data, in an amount governed by replacement_sort_mem, as
>> a heap, and use it to perform replacement selection until no tuples
>> remain for the current run.  Meanwhile, the rest of the sorted data
>> remains in memory untouched.  Logically, we're constructing a run of
>> tuples which is split between memory and disk: the head of the run
>> (what fits in all of (maintenance_)work_mem except for
>> replacement_sort_mem) is in memory, and the tail of the run is on
>> disk.
>
> I went back and forth on this during our call, but I now think that I
> was right that there will need to be changes in order to make the tail
> of the run a heap (*not* the quicksorted head), because routines like
> tuplesort_heap_siftup() assume that state->memtuples[0] is the head of
> the heap. This is currently assumed by the master branch for both the
> currentRun/nextRun replacement selection heap, as well as the heap
> used for merging. Changing this is probably fairly manageable, though
> (probably still not going to use memmove() for this, contrary to my
> remarks on the call).

OK.  I think if possible we want to try to do this by changing the
Tuplesortstate to identify where the heap is, rather than by using
memmove() to put it where we want it to be.

>> 3. If we reach the end of input before replacement selection runs out
>> of tuples for the current run, and if it finds no tuples for the next
>> run prior to that time, then we are done.  All of the tuples form a
>> single run and we can return the tuples in memory first followed by
>> the tuples on disk.  This case is highly likely to be a huge win over
>> what we have today, because (a) some portion of the tuples were sorted
>> via quicksort rather than heapsort and that's faster, (b) the tuples
>> that were sorted using a heap were sorted using a small heap rather
>> than a big one, and (c) we only wrote out the minimal number of tuples
>> to tape instead of, as we would have done today, all of them.
>
> Agreed.

Cool.

>> 4. If we reach this step, then replacement selection with a small heap
>> wasn't able to sort the input in a single run.  We have a bunch of
>> sorted data in memory which is the head of the same run whose tail is
>> already on disk; we now spill all of these tuples to disk.  That
>> leaves only the heapified tuples in memory.  We just ignore the fact
>> that they are a heap and treat them as unsorted.  We repeatedly do the
>> following: read tuples until work_mem is full, sort them, and dump the
>> result to disk as a run.  When all runs have been created, we merge
>> runs just as we do today.
>
> Right, so: having read this far, I'm almost sure that you intend that
> replacement selection is only ever used for the first run (we "go for
> broke" with RS). Good.

Yes, absolutely.

>> This algorithm seems very likely to beat what we do today in
>> practically all cases.  The benchmarking Peter and others have already
>> done shows that building runs with quicksort rather than replacement
>> selection can often win even if the larger number of tapes requires a
>> multi-pass merge.  The only cases where it didn't seem to be a clear
>> win involved data that was already in sorted order, or very close to
>> it.
>
> ...*and* where there was an awful lot of data, *and* where there was
> very little memory in an absolute sense (e.g. work_mem = 4MB).
>
>> But with this algorithm, presorted input is fine: we'll quicksort
>> some of it (which is faster than replacement selection because
>> quicksort checks for presorted input) and sort the rest with a *small*
>> heap (which is faster than our current approach of sorting it with a
>> big heap when the data is already in order).
>
> I'm not going to defend the precheck in our quicksort implementation.
> It's unadulterated nonsense. The B&M quicksort implementation's use of
> insertion sort does accomplish this pretty well, though.

We'll leave that discussion for another day so as not to argue about it now.

>> On top of that, we'll
>> only write out the minimal amount of data to disk rather than all of
>> it.  So we should still win.  On the other hand, if the data is out of
>> order, then we will do only a little bit of replacement selection
>> before switching over to building runs by quicksorting, which should
>> also win.
>
> Yeah -- we retain much of the benefit of "quicksort with spillover",
> too, without any cost model. This is also better than "quicksort with
> spillover" in that it limits the size of the heap, and so limits the
> extent to which the algorithm can "helpfully" spend ages spilling from
> an enormous heap. The new GUC can be explained to users as a kind of
> minimum burst capacity for getting a "half internal, half external"
> sort, which seems intuitive enough.

Right.  I really like the idea of limiting the heap size - I'm quite
hopeful that will let us hang onto the limited number of cases where
RS is better while giving up on it pretty quickly when it's a loser.
But even better, if you've got a case where RS is a win, limiting the
heap size has an excellent chance of making it a bigger win.  That's
quite appealing, too.

>> The worst case I was able to think of for this algorithm is an input
>> stream that is larger than work_mem and almost sorted: the only
>> exception is that the record that should be exactly in the middle is
>> all the way at the end.
>
>> We need to not be horrible in that case, but there's
>> absolutely no reason to believe that we will be.  We may even be
>> faster, but we certainly shouldn't be abysmally slower.
>
> Agreed.
>
> If we take a historical perspective, a 10MB or 30MB heap will still
> have a huge "juggling capacity" -- in practice it will almost
> certainly store enough tuples to make the "plate spinning circus
> trick" of replacement selection make the critical difference to run
> size. This new GUC is a delta between tuples for RS reordering. You
> can perhaps construct a "strategically placed banana skin" case to
> make this look bad before caching effects start to weigh us down, but
> I think you agree that it doesn't matter. "Juggling capacity" has
> nothing to do with modern hardware characteristics, except that modern
> machines are where the cost of excessive "juggling capacity" really
> hurts, so this is simple. It is simple *especially* because we can
> throw out the idea of a cost model that cares about caching effects in
> particular, but that's just one specific thing.

Yep.  I'm mostly relying on you to be correct about the actual
performance characteristics of replacement selection here.  If the
cutover point when we go from RS to QS to build runs turns out to be
wildly wrong, I plan to look sidelong in your direction.  I don't
think that's going to happen, though.

> BTW, you probably know this, but to be clear: When I talk about
> correlation, I refer specifically to what would appear within
> pg_stats.correlation as 1.0 -- I am not referring to a
> pg_stats.correlation of -1.0. The latter case is traditionally
> considered a worst case for RS.

Makes sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Using quicksort for every external sort run

From

Greg Stark

Date:

07 February 2016, 18:52:40

On Sun, Feb 7, 2016 at 8:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Feb 7, 2016 at 11:00 AM, Peter Geoghegan <pg@heroku.com> wrote:
> > It was my understanding, based on your emphasis on producing only a
> > single run, as well as your recent remarks on this thread about the
> > first run being special, that you are really only interested in the
> > presorted case, where one run is produced. That is, you are basically
> > not interested in preserving the general ability of replacement
> > selection to double run size in the event of a uniform distribution.
>...
> > You don't want to change the behavior of the current patch for the
> > second or subsequent run; that should remain a quicksort, pure and
> > simple. Do I have that right?
>
> Yes.

I'm not even sure this is necessary. The idea of missing out on
producing a single sorted run sounds bad but in practice since we
normally do the final merge on the fly there doesn't seem like there's
really any difference between reading one tape or reading two or three
tapes when outputing the final results. There will be the same amount
of I/O happening and a 2-way or 3-way merge for most data types should
be basically free.

On Sun, Feb 7, 2016 at 8:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 3. At this point, we have one sorted tape per worker.  Perform a final
> merge pass to get the final result.

I don't even think you have to merge until you get one tape per
worker. You can statically decide how many tapes you can buffer in
memory based on work_mem and merge until you get N/workers tapes so
that a single merge in the gather node suffices. I would expect that
to nearly always mean the workers are only responsible for generating
the initial sorted runs and the single merge pass is done in the
gather node on the fly as the tuples are read.

-- 
greg

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

08 February 2016, 00:50:54

On Sun, Feb 7, 2016 at 10:51 AM, Greg Stark <stark@mit.edu> wrote:
>> > You don't want to change the behavior of the current patch for the
>> > second or subsequent run; that should remain a quicksort, pure and
>> > simple. Do I have that right?
>>
>> Yes.
>
> I'm not even sure this is necessary. The idea of missing out on
> producing a single sorted run sounds bad but in practice since we
> normally do the final merge on the fly there doesn't seem like there's
> really any difference between reading one tape or reading two or three
> tapes when outputing the final results. There will be the same amount
> of I/O happening and a 2-way or 3-way merge for most data types should
> be basically free.

I basically agree with you, but it seems possible to fix the
regression (generally misguided though those regressed cases are).
It's probably easiest to just fix it.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

08 February 2016, 02:57:10

On Sun, Feb 7, 2016 at 4:50 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I'm not even sure this is necessary. The idea of missing out on
>> producing a single sorted run sounds bad but in practice since we
>> normally do the final merge on the fly there doesn't seem like there's
>> really any difference between reading one tape or reading two or three
>> tapes when outputing the final results. There will be the same amount
>> of I/O happening and a 2-way or 3-way merge for most data types should
>> be basically free.
>
> I basically agree with you, but it seems possible to fix the
> regression (generally misguided though those regressed cases are).
> It's probably easiest to just fix it.

On a related note, we should probably come up with a way of totally
supplanting the work_mem model with something smarter in the next
couple of years. Something that treats memory as a shared resource
even when it's allocated privately, per-process. This external sort
stuff really smooths out the cost function of sorts. ISTM that that
makes the idea of dynamic memory budgets (in place of a one size fits
all work_mem) seem desirable for the first time. That said, I really
don't have a good sense of how to go about moving in that direction at
this point. It seems less than ideal that DBAs have to be so
conservative in sizing work_mem.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

15 February 2016, 04:01:39

On Sun, Feb 7, 2016 at 4:50 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I'm not even sure this is necessary. The idea of missing out on
>> producing a single sorted run sounds bad but in practice since we
>> normally do the final merge on the fly there doesn't seem like there's
>> really any difference between reading one tape or reading two or three
>> tapes when outputing the final results. There will be the same amount
>> of I/O happening and a 2-way or 3-way merge for most data types should
>> be basically free.
>
> I basically agree with you, but it seems possible to fix the
> regression (generally misguided though those regressed cases are).
> It's probably easiest to just fix it.

Here is a benchmark on my laptop:

$ pgbench -i -s 500 --unlogged

This results in a ~1GB accounts PK:

postgres=# \di+ pgbench_accounts_pkey
List of relations
─[ RECORD 1 ]──────────────────────
Schema      │ public
Name        │ pgbench_accounts_pkey
Type        │ index
Owner       │ pg
Table       │ pgbench_accounts
Size        │ 1071 MB
Description │

The query I'm testing is: "reindex index pgbench_accounts_pkey;"

Now, with a maintenance_work_mem of 5MB, the most recent revision of
my patch takes about 54.2 seconds to complete this, as compared to
master's 44.4 seconds. So, clearly a noticeable regression there of
just under 20%. I did not see a regression with a 5MB
maintenance_work_mem when pgbench scale was 100, though. And, with the
default maintenance_work_mem of 64MB, it's a totally different story
-- my patch takes about 28.3 seconds, whereas master takes 48.5
seconds (i.e. longer than with 5MB). My patch needs a 56-way final
merge with the 64MB maintenance_work_mem case, and 47 distinct merge
steps, plus a final on-the-fly merge for the 5MB maintenance_work_mem
case. So, a huge amount of merging, but RS still hardly pays for
itself. With the regressed case for my patch, we finish sorting *runs*
about 15 seconds in to a 54.2 second operation -- very early. So it
isn't "quicksort vs replacement selection", so much as "polyphase
merge vs replacement selection". There is a good reason to think that
we can make progress on fixing that regression by doubling down on the
general strategy of improving cache characteristics, and being
cleverer about memory use during non-final merging, too.

I looked at what it would take to make the heap a smaller part of
memtuples, along the lines Robert and I talked about, and I think it's
non-trivial because it needs to make the top of the heap something
other than memtuples[0]. I'd need to change the heap code, which
already has 3 reasons for existing (RS, merging, and top-N heap). I'll
find it really hard to justify the effort, and especially the risk of
adding bugs, for a benefit that there is *scant* evidence for. My
guess is that the easiest, and most sensible way to fix the ~20%
regression seen here is to introduce batch memory allocation to
non-final merge steps, which is where most time was spent. (For
simplicity, that currently only happens during the final merge phase,
but I could revisit that -- seems not that hard).

Now, I accept that the cost model has to go. So, what I think would be
best is if we still added a GUC, like the replacement_sort_mem
suggestion that Robert made. This would be a threshold for using what
is currently called "quicksort with spillover". There'd be no cost
model. Jeff Janes also suggested something like this.

The only regression that I find concerning is the one reported by Jeff
Janes [1]. That didn't even involve a correlation, though, so no
reason to think that it would be at all helped by what Robert and I
talked about. It seemed like the patch happened to have the effect of
tickling a pre-existing problem with polyphase merge -- what Jeff
called an "anti-sweetspot". Jeff had a plausible theory for why that
is.

So, what if we try to fix polyphase merge? That would be easier. We
could look at the tape buffer size, and the number of tapes, as well
as memory access patterns. We might even make more fundamental changes
to polyphase merge, since we don't use the more advanced variant that
I think correlation is a red herring. Knuth suggests that his
algorithm 5.4.3, cascade merge, has more efficient distribution of
runs.

The bottom line is that there will always be some regression
somewhere. I'm not sure what the guiding principle for when that
becomes unacceptable is, but you did seem sympathetic to the idea that
really low work_mem settings (e.g. 4MB) with really big inputs were
not too concerning [2]. I'm emphasizing Jeff's case now because I,
like you [2], am much more worried about maintenance_work_mem default
cases with regressions than anything else, and that *was* such a case.

Like Jeff Janes, I don't care about his other regression of about 5%
[3], which involved a 4MB work_mem + 100 million tuples. The other
case (the one I do care about) was 64MB +  400 million tuples, and was
a much bigger regression, which is suggestive of the unpredictable
nature of problems with polyphase merge scheduling that Jeff talked
about. Maybe we just got unlucky there, but that risk should not blind
us to the fact that overwhelmingly, replacement selection is the wrong
thing.

I'm sorry that I've reversed myself like this, Robert, but I'm just
not seeing a benefit to what we talked about, but I do see a cost.

[1] http://www.postgresql.org/message-id/CAMkU=1zKBOzkX-nqE-kJFFMyNm2hMGYL9AsKDEUHhwXASsJEbg@mail.gmail.com
[2] http://www.postgresql.org/message-id/CA+TgmoZGFt6BAxW9fYOn82VAf1u=V0ZZx3bXMs79phjg_9NYjQ@mail.gmail.com
[3] http://www.postgresql.org/message-id/CAM3SWZTYneCG1oZiPwRU=J6ks+VpRxt2Da1ZMmqFBrd5jaSJSA@mail.gmail.com

--
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Jim Nasby

Date:

15 February 2016, 20:43:55

On 2/7/16 8:57 PM, Peter Geoghegan wrote:
> It seems less than ideal that DBAs have to be so
> conservative in sizing work_mem.

+10
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Using quicksort for every external sort run

From

Greg Stark

Date:

15 February 2016, 23:46:11

On Mon, Feb 15, 2016 at 8:43 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 2/7/16 8:57 PM, Peter Geoghegan wrote:
>>
>> It seems less than ideal that DBAs have to be so
>> conservative in sizing work_mem.
>
>
> +10

I was thinking about this over the past couple weeks. I'm starting to
think the quicksort runs gives at least the beginnings of a way
forward on this front. Keeping in mind that we know how many tapes we
can buffer in memory and the number is likely to be relatively large
-- on the order of 100+ is typical, what if do something like the
following rough sketch:

Give users two knobs, a lower bound "sort in memory using quicksort"
memory size and an upper bound "absolutely never use more than this"
which they can set to a substantial fraction of physical memory. Then
when we overflow the lower bound we start generating runs, the first
one being of that length. Each run we generate we double (or increase
by 50% or something) until we hit the maximum. That means that the
first few runs may be shorter than necessary but we have enough tapes
available that that doesn't hurt much and we'll eventually get to a
large enough run size that we won't run out of tapes and can still do
a single final (on the fly) merge.

In fact what's really attractive about this idea is that it might give
us a reasonable spot to do some global system resource management.
Each time we want to increase the run size we check some shared memory
counter of how much memory is in use and refuse to increase if there's
too much in use (or if we're using too large a fraction of it or some
other heuristic). The key point is that since we don't need to decide
up front at the beginning of the sort and we don't need to track it
continuously there is neither too little nor too much contention on
this shared memory variable. Also the behaviour would be not too
chaotic if there's a user-tunable minimum and the other activity in
the system only controls how more memory it can steal from the global
pool on top of that.

-- 
greg

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

08 March 2016, 06:57:58

On Mon, Feb 15, 2016 at 3:45 PM, Greg Stark <stark@mit.edu> wrote:
> I was thinking about this over the past couple weeks. I'm starting to
> think the quicksort runs gives at least the beginnings of a way
> forward on this front.

As I've already pointed out several times, I wrote a tool that makes
it easy to load sortbenchmark.org data into a PostgreSQL table:

https://github.com/petergeoghegan/gensort

(You should use the Python script that invokes the "gensort" utility
-- see its "--help" display for details).

This seems useful as a standard benchmark, since it's perfectly
deterministic, allowing the user to create arbitrarily large tables to
use for sort benchmarks. Still, it doesn't produce data that is any
way organic; sort data is uniformly distributed. Also, it produces a
table that really only has one attribute to sort on, a text attribute.

I suggest looking at real world data, too. I have downloaded UK land
registry data, which is a freely available dataset about property
sales in the UK since the 1990s, of which there have apparently been
about 20 million (I started with a 20 million line CSV file). I've
used COPY to load the data into one PostgreSQL table.

I attach instructions on how to recreate this, and some suggested
CREATE INDEX statements that seemed representative to me. There are a
variety of Postgres data types in use, including UUID, numeric, and
text. The final Postgres table is just under 3GB. I will privately
make available a URL that those CC'd here can use to download a custom
format dump of the table, which comes in at 1.1GB (ask me off-list if
you'd like to get that URL, but weren't CC'd here). This URL is
provided as a convenience for reviewers, who can skip my detailed
instructions.

An expensive rollup() query on the "land_registry_price_paid_uk" table
is interesting. Example:

select date_trunc('year', transfer_date), county, district, city,
sum(price) from land_registry_price_paid_uk group by rollup (1,
county, district, city);

Performance is within ~5% of an *internal* sort with the patch series
applied, even though ~80% of time is spent copying and sorting
SortTuples overall in the internal sort case (the internal case cannot
overlap sorting and aggregate processing, since it has no final merge
step). This is a nice demonstration of how this work has significantly
blurred the line between internal and external sorts.

--
Peter Geoghegan

Attachment

land-registry-data.txt

Re: Using quicksort for every external sort run

From

Tomas Vondra

Date:

10 March 2016, 13:40:44

Hi,

On Mon, 2015-12-28 at 15:03 -0800, Peter Geoghegan wrote:
> On Fri, Dec 18, 2015 at 11:57 AM, Peter Geoghegan <pg@heroku.com> wrote:
> > BTW, I'm not necessarily determined to make the new special-purpose
> > allocator work exactly as proposed. It seemed useful to prioritize
> > simplicity, and currently so there is one big "huge palloc()" with
> > which we blow our memory budget, and that's it. However, I could
> > probably be more clever about "freeing ranges" initially preserved for
> > a now-exhausted tape. That kind of thing.
> 
> Attached is a revision that significantly overhauls the memory patch,
> with several smaller changes.

I was thinking about running some benchmarks on this patch, but the
thread is pretty huge so I want to make sure I'm not missing something
and this is indeed the most recent version.

Is that the case?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

10 March 2016, 18:36:01

On Thu, Mar 10, 2016 at 5:40 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I was thinking about running some benchmarks on this patch, but the
> thread is pretty huge so I want to make sure I'm not missing something
> and this is indeed the most recent version.

Wait 24 - 48 hours, please. Big update coming.


-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Greg Stark

Date:

10 March 2016, 18:40:32

On Thu, Mar 10, 2016 at 1:40 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> I was thinking about running some benchmarks on this patch, but the
> thread is pretty huge so I want to make sure I'm not missing something
> and this is indeed the most recent version.

I also ran some preliminary benchmarks just before FOSDEM and intend to get back to in after running different benchmarks. These are preliminary because it was only a single run and on a machine that wasn't dedicated for benchmarks. These were comparing the quicksort-all-runs patch against HEAD at the time without the memory management optimizations which I think are independent of the sort algorithm.

It looks to me like the interesting space to test is on fairly small work_mem compared to the data size. There's a general slowdown on 4MB-8MB work_mem when the data set is more than a gigabyte but but even in the worst case it's only a 30% slowdown and the speedup in the more realistic scenarios looks at least as big.

I want to rerun these on a dedicated machine and with trace_sort enabled so that we can see how many merge passes were actually happening and how much I/O was actually happening.

--
greg

Attachment

2016-01-24.png

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

10 March 2016, 19:34:24

On Thu, Mar 10, 2016 at 10:39 AM, Greg Stark <stark@mit.edu> wrote:
> I want to rerun these on a dedicated machine and with trace_sort
> enabled so that we can see how many merge passes were actually
> happening and how much I/O was actually happening.

Putting the results in context, by keeping trace_sort output with the
results is definitely a good idea here. Otherwise, it's almost
impossible to determine what happened after the fact. I have had
"trace_sort = on" in my dev postgresql.conf for some time now. :-)

When I produce my next revision, we should focus on regressions at the
low end, like the 4MB work_mem for multiple GB table size cases you
show here. So, I ask that any benchmarks that you or Tomas do look at
that first and foremost. It's clear that in high memory environments
the patch significantly improves performance, often by as much as
2.5x, and so that isn't really a concern anymore. I think we may be
able to comprehensively address Robert's concerns about regressions
with very little work_mem and lots of data by fixing a problem with
polyphase merge. More to come soon.

-- 
Peter Geoghegan

Re: Using quicksort for every external sort run

From

Peter Geoghegan

Date:

11 March 2016, 02:54:24

On Sun, Feb 14, 2016 at 8:01 PM, Peter Geoghegan <pg@heroku.com> wrote:
> The query I'm testing is: "reindex index pgbench_accounts_pkey;"
>
> Now, with a maintenance_work_mem of 5MB, the most recent revision of
> my patch takes about 54.2 seconds to complete this, as compared to
> master's 44.4 seconds. So, clearly a noticeable regression there of
> just under 20%. I did not see a regression with a 5MB
> maintenance_work_mem when pgbench scale was 100, though.

I've fixed this regression, and possibly all regressions where workMem
> 4MB. I've done so without resorting to making the heap structure
more complicated, or using a heap more often than when
replacement_sort_mem is exceeded by work_mem or maintenance_work_mem
(so replacement_sort_mem becomes something a bit different to what we
discussed, Robert -- more on that later). This seems like an
"everybody wins" situation, because in this revision the patch series
is now appreciably *faster* where the amount of memory available is
only a tiny fraction of the total input size.

Jeff Janes deserves a lot of credit for helping me to figure out how
to do this. I couldn't get over his complaint about the regression he
saw a few months back. He spoke of an "anti-sweetspot" in polyphase
merge, and how he suspected that to be the real culprit (after all,
most of his time was spent merging, with or without the patch
applied). He also said that reverting the memory batch/pool patch made
things go a bit faster, somewhat ameliorating his regression (when
just the quicksort patch was applied). This made no sense to me, since
I understood the memory batching patch to be orthogonal to the
quicksort thing, capable of being applied independently, and more or
less a strict improvement on master, no matter what the variables of
the sort are. Jeff's regressed case especially made no sense to me
(and, I gather, to him) given that the regression involved no
correlation, and so clearly wasn't reliant on generating far
fewer/longer runs than the patch (that's the issue we've discussed
more than any other now -- it's a red herring, it seems). As I
suspected out loud on February 14th, replacement selection mostly just
*masked* the real problem: the problem of palloc() fragmentation.
There doesn't seem to be much of an issue with the scheduling of
polyphase merging, once you fix palloc() fragmentation. I've created a
new revision, incorporating this new insight.

New Revision
============

Attached revision of patch series:

1. Creates a separate memory context for tuplesort's copies of
caller's tuples, which can be reset at key points, avoiding
fragmentation. Every SortTuple.tuple is allocated there (with trivial
exception); *everything else*, including the memtuples array, is
allocated in the existing tuplesort context, which becomes the parent
of this new "caller's tuples" context. Roughly speaking, that means
that about half of total memory for the sort is managed by each
context in common cases. Even with a high work_mem memory budget,
memory fragmentation could previously get so bad that tuplesort would
in effect claim a share of memory from the OS that is *significantly*
higher than the work_mem budget allotted to its sort. And with low
work_mem settings, fragmentation previously made palloc() thrash the
sort, especially during non-final merging. In this latest revision,
tuplesort now almost gets to use 100% of the memory that was requested
from the OS by palloc() is cases tested.

2. Loses the "quicksort with spillover" case entirely, making the
quicksort patch significantly simpler. A *lot* of code was thrown out.

This change is especially significant because it allowed me to remove
the cost model that Robert took issue with so vocally. "Quicksort with
spillover" was always far less important than the basic idea of using
quicksort for external sorts, so I'm not sad to see it go. And, I
thought that the cost model was pretty bad myself.

3. Fixes cost_sort(), making optimizer account for the fact that runs
are now about sort_mem-sized, not (sort_mem * 2)-sized.

While I was at it, I made cost_sort() more optimistic about the amount
of random I/O required relative to sequential I/O. This additional
change to cost_sort() was probably overdue.

4. Restores the ability of replacement selection to generate one run
and avoid any merging (previously, only one really long run and one
short run was possible, because at the time I conceptualized
replacement selection as being all about enabling "quicksort with
spillover", which quicksorted that second run in memory). This
only-one-run case is the case that Robert particularly cared about,
and it's fully restored when RS is in use (which can still only happen
for the first run, just never for the benefit of the now-axed
"quicksort with spillover" case).

5. Adds a new GUC, "replacement_sort_mem". The default setting is
16MB. Docs are included. If work_mem/maintenance_work_mem is less than
or equal to this, the first (and hopefully only) run uses replacement
selection.

"replacement_sort_mem" is a different thing to the concept for a GUC
Robert and I discussed (only the name is the same). That other concept
for a GUC related to the hybrid heap/quicksort idea (it controlled how
big the heap portion of memtuples was supposed to be, in a speculative
world where the patch took that "hybrid" approach [1] at all). In
light of this new information about palloc() fragmentation, and given
the risk to tuplesort's stability posed by implementing this "hybrid"
algorithm, this seems like a good compromise. I cannot see an upside
to pursuing the "hybrid" approach now. I regret reversing my position
on that, but that's just how it happened. Since Robert was seemingly
only worried about regressions, which are fixed now for a variety of
cases that I tested, I'm optimistic that this will be acceptable to
him. I believe that replacement_sort_mem as implemented here is quite
useful, although mostly because I see some further opportunities for
it.

Replacement Selection uses
--------------------------

What opportunities, you ask? Maybe CREATE INDEX can be made to accept
a "presorted" parameter, letting the user promise that the input is
more or less presorted. This allows tuplesort to only use a tiny heap,
while having it throw an error if it cannot produce one long run (i.e.
CREATE INDEX is documented as raising an error if the input is not
more or less presorted). The nice thing about that idea is that we can
be very optimistic about the data actually being more or less
presorted, so the implementation doesn't *actually* produce one long
run -- it produces one long *index*, with IndexTuples passed back to
nbtsort.c as soon as the heap fills for the first time, a bit like an
on-the-fly merge. Unlike an on-the-fly merge, no tapes or temp files
are actually involved; we write out IndexTuples by actually writing
out the index optimistically. There is a significant saving by using a
heap *because there is no need for a TSS_SORTEDONTAPE pass over the
data*. We succeed at doing it all at once with a tiny heap, or die
trying. So, in a later version of Postgres (9.7?),
replacement_sort_mem becomes more important because of this
"presorted" CREATE INDEX parameter. That's a relatively easy patch to
write, but it's not 9.6 material.

Commits
-------

Note that the attached revision makes the batch memory patch the first
commit in the patch series. It might be useful to get that one out of
the way first, since I imagine it is now considered the least
controversial, and is perhaps the simplest of the two big patches in
the series. I'm not very optimistic about the memory prefetch patch
0003-* getting committed, but so far I've only seen it help, and all
of my testing is based on having it applied. In any case, it's clearly
way way less important than the other two patches.

Testing
-------

Re: Using quicksort for every external sort run

From

Tomas Vondra

Date:

22 March 2016, 21:28:16

Hi,

I've finally managed to do some benchmarks on the patches. I haven't
really studied the details of the patch, so I simply collected a bunch
of queries relying on sorting - various forms of SELECT and a few CREATE
INDEX commands). It's likely some of the queries can't really benefit
from the patch - those should not be positively or negatively affected,
though.

I've executed the queries on a few basic synthetic data sets with
different cardinality

   1) unique data
   2) hight cardinality (rows/100)
   3) low cardinality (rows/1000)

initial ordering

   1) random
   2) sorted
   3) almost sorted

and different data types

   1) int
   2) numeric
   3) text

Tables with and without additional data (padding) were created.

So there are quite a few combinations. Attached is a shell script I've
used for testing, and also results for 1M and 10M rows on two different
machines (one with i5-2500k CPU, the other one with Xeon E5450).

Each query was executed 5x for each work_mem value (between 8MB and
1GB), and then a median of the runs was computed and that's what's on
the "comparison". This compares a414d96ad2b without (master) and with
the patches applied (patched). The last set of columns is simply a
"speedup" where "<1.0" means the patched code is faster, while >1.0
means it's slower. Values below 0.9 or 1.1 are using green or red
background, to make the most significant improvements or regressions
clearly visible.

For the smaller data set (1M rows), things works pretty fine. There are
pretty much no red cells (so no significant regressions), but quite a
few green ones (with duration reduced by up to 50%). There are some
results in the 1.0-1.05 range, but considering how short the queries
are, I don't think this is a problem. Overall the total duration was
reduced by ~20%, which is nice.

For the 10M data sets, total speedup is also almost ~20%, and the
speedups for most queries are also very nice (often ~50%). But the
number of regressions is considerably higher - there's a small number of
queries that got significantly slower for multiple data sets,
particularly for smaller work_mem values.

For example these two queries got almost 2x as slow for some data sets:

SELECT a FROM numeric_test UNION SELECT a FROM numeric_test_padding
SELECT a FROM text_test UNION SELECT a FROM text_test_padding

I assume the slowdown is related to the batching (as it's only happening
for low work_mem values), so perhaps there's an internal heuristics that
we could tune?

I also find it quite interesting that on the i5 machine the CREATE INDEX
commands are pretty much not impacted, while on the Xeon machine there's
an obvious significant improvement.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hi,

On 03/30/2016 04:53 AM, Peter Geoghegan wrote:
> On Tue, Mar 29, 2016 at 6:02 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
...
>>> If there is ever a regression, it is only really sensible to talk
>>> about it while looking at trace_sort output (and, I guess, the query
>>> plan). I've asked Tomas for trace_sort output in all relevant cases.
>>> There is no point in "flying blind" and speculating what the problem
>>> was from a distance.
>>
>>
>> The updated benchmarks are currently running. I'm out of office until
>> Friday, and I'd like to process the results over the weekend. FWIW I'll have
>> results for these cases:
>>
>> 1) unpatched (a414d96a)
>> 2) patched, default settings
>> 3) patched, replacement_sort_mem=64
>>
>> Also, I'll have trace_sort=on output for all the queries, so we can
>> investigate further.
>
> Thanks! That will tell us a lot more.

So, I do have the results from both machines - I've attached the basic
comparison spreadsheets, the complete summary is available here:

    https://github.com/tvondra/sort-benchmark

The database log also includes the logs for trace_sort=on for each query
(use the timestamp logged for each query in the spreadsheet to locate
the right section of the log).

The benchmark was slightly modified, based on the previous feedback:

* fix the maintenance_work_mem thinko (affects CREATE INDEX cases)

* use "SELECT * FROM (... OFFSET 1e10)" pattern instead of the original
approach (copy to /dev/null)

* change the data generation for "low cardinality" data sets (by mistake
it generated mostly the same stuff as "high cardinality")

I have not collected explain plans. I guess we'll need explain analyze
in most cases anyway, and collecting those would increase the duration
of the benchmark. So I plan to collect this info for the interesting
cases on request.

While it might look like I'm somehow opposed to this patch series,
that's mostly because we tend to look only at the few cases that behave
poorly.

So let me be clear: I do think the patch seems to be a significant
performance improvement for most of the queries, and I'm OK with
accepting a few regressions (particularly if we agree those are
pathological cases, unlikely to happen in real-world workloads).

It's quite rare that a patch is a universal win without regressions, so
it's important to consider how likely those regressions are and what's
the net effect of the patch - and the patch seems to be a significant
improvement in most cases (and regressions limited to pathological or
rare corner cases).

I don't think those are reasons not to push this into 9.6.

Following is a rudimentary analysis of the results, a bit about how the
benchmark was constructed (and it's representativeness).

rudimentary analysis
--------------------

I haven't done any thorough investigation of the results yet, but in
general it seems the results from both machines are quite similar - the
numbers are different, but the speedup/slowdown patterns are mostly the
same (with some exceptions that I'd guess are due to HW differences).

The slowdown/speedup patterns (red/green cells in the spreadheets) are
also similar to those collected originally. Some timings are much lower,
presumably thanks to using the "OFFSET 1e10" pattern, but the patterns
are the same. CREATE INDEX statements are an obvious exception, of
course, due to the thinko in the previous benchmark.

The one thing that surprised me a bit is that

     replacement_sort_mem=64

actually often made the results considerably worse in many cases. A
common pattern is that the slowdown "spreads" to nearby cells - the are
many queries where the 8MB case is 1:1 with master and 32MB is 1.5:1
(i.e. takes 1.5x more time), and setting replacement_sort_mem=64 just
slows down the 8MB case.

In general, replacement_sort_mem=64 seems to only affect the 8MB case,
and in most cases it results in 100% slowdown (so 2x as long queries).

That being said, I do think the results are quite impressive - there are
far many queries with significant speedups (usually ~2x or more) than
slowdowns (and less significant than speedups).

I mostly agree with Peter that we probably don't need to worry about the
slowdown cases with low work_mem settings - if you do sorts with
millions of rows, you really need to give the database enough RAM.

But there are multiple slowdown cases with work_mem=128MB, and I'd dare
to say 128MB is not quite low-end work_mem value. So perhaps we should
look at least at those cases.

It's also interesting that setting replacement_sort_mem=64 makes this
much worse - i.e. the number of slowdowns with higher work_mem values
increases, and the difference is often quite huge.

So I'm really not sure what to do with this GUC ...

L2/L3 cache
-----------

I think we're overly optimistic when it comes to the size of the CPU
cache - while it's certainly true that modern CPUs have quite a bit of
it (the modern Xeon E5 have up to ~45MB per socket), there are two
important factors here:

1) The cache is shared by all cores on the socket (and on average
there's ~2-3 MB/s per physical core), and thus by all processes running
on the CPU. It's possible to run a single process on the CPU (thus
getting all the cache), but that's a bit expensive 1-core CPU.

2) The cache is shared by all nodes in the query plan, and we do have
executor that interleaves the nodes (so while an implementation of the
node may be very efficient when executed in isolation, that may not be
true when executed as part of a larger plan). The sort may be immune to
this to some degree, though.

I'm not sure how much this is considered in the 1994 VLDB paper, but I'd
be very careful about making claims about how much CPU cache is
available today (even on the best server CPUs).

benchmark discussion
--------------------

1) representativeness

Let me explain how I constructed the benchmark - I simply compiled a
list of queries executing sorts, and ran them on synthetic datasets with
different characteristics (cardinality and initial ordering). And I've
done that with different work_mem values, to see how that affects the
behavior.

I've done it this way for a few reasons - firstly, I'm extremely lazy
and did not want to study the internals of the patch as I'm not too much
into sorting details. Secondly, I did not want to tailor the benchmark
too tightly to the patch - it's quite possible some of the queries are
not executing the modified code at all, in which case they should be
unaffected (no slowdown, no speedup).

So while the benchmark might certainly include additional queries or
data sets with different characteristics, I'd dare to claim it's not
entirely misguided.

Some of the tested combinations may certainly be seen as implausible or
pathological, although intentional and not constructed on purpose. I'm
perfectly fine with identifying such cases and ignoring them.

2) TOAST overhead

Peter also mentioned that some of the cases have quite a bit of padding,
and that the TOAST overhead distorts the results. It's true there's
quite a bit of padding (~320B), but I don't quite see why this would
makes the results bogus - I've intentionally constructed it like this to
see how the sort behaves with wide rows, because:

* many BI queries actually fetch quite a lot of columns, and while 320B
may seem a bit high, it's not that difficult to reach with a few NUMERIC
columns

* we're getting parallel aggregate in 9.6, which relies on serializing
the aggregate state (and the combine phase may then need to do a sort again)

Moreover, while there certainly is TOAST overhead, I don't quite see why
it should change with the patch (as the padding columns are not used as
a sort key). Perhaps the patch results in "moving the tuples around
more" (deemphasizing comparison), but I don't see why that shouldn't be
an important metric in general - memory bandwidth seems to be a quite
important bottleneck these days. Of course, if this only affects the
pathological cases, we may ignore that.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On Thu, Apr 7, 2016 at 11:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I prefer units of tuples, with the GUC itself therefore being
> unitless.  I suggest we call the parameter replacement_sort_threshold
> and document that (1) the ideal value may depend on the amount of CPU
> cache available to running processes, with more cache implying higher
> values; and (2) the ideal value may depend somewhat on the input data,
> with more correlation implying higher values.  And then pick some
> value that you think is likely to work well for most people and call
> it good.
>
> If you could prepare a new patch with those changes and also making
> the changes requested in my other email, I will try to commit that
> before the deadline.  Thanks.

Attached revision of patch series:

* Breaks out the parts you don't want to commit right now, as agreed.

These separate patches in the rebased patch series are included here
for completeness, but will probably be submitted separately to 9.7. I
do still think you should commit 0002-* alongside 0001-*, though,
because it's useful to be able to enable the memory context dumps on
dev builds to debug external sorting. I won't insist on it, but that
is my recommendation.

* Fixes "over-zealous assertion" that I pointed out recently.

* Replaces replacement_sort_mem GUC with replacement_sort_tuples GUC,
since, as discussed, effective cut-off points for using replacement
selection for the first run are easier to derive from the size of
memtuples (the might-be heap) than from work_mem/maintenance_work_mem
(the fraction of all tuplesort memory used that is used for memtuples
could be very low in cases with what Tomas called "padding").

Since you didn't get back to me on the name of the GUC, I just ran
with the name replacement_sort_tuples, but that's something I'm
totally unattached to. Feel free to change it to whatever you prefer,
including your original suggestion of replacement_sort_threshold if
you still think that works.

The new default value that I came up with for replacement_sort_tuples
is 150,000 tuples, which is intended as a rough generic break-even
point. Note that trace_sort reports how many tuples were in the heap
should replacement selection actually be chosen for the first run.
150,000 seems like a high enough generic delta between an out-of-order
tuple, and its optimal in-order position; if *that* amount of buffer
space to "juggle" tuples isn't enough, it seems unlikely that
*anything* will be (anything that is less than 1/2 of the total number
of input tuples, at least).

Note that I use the term "cache oblivious" in the documentation now,
per your suggestion that CPU cache characteristics be addressed. We
have traditionally avoided using jargon like that, but I think it
works well here. The reader is not required to know the definition.
Dropping that term provides bread-crumbs for advance users to put all
this together in more detail, which I believe has value. It suggests
that increasing work_mem or maintenance_work_mem can have almost no
downside provided you don't need that memory for anything else, which
is true.

I will be glad to see this through. Thanks for your help with this, Robert.

--
Peter Geoghegan