Re: Using quicksort for every external sort run - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Using quicksort for every external sort run
Date
Msg-id CAM3SWZQG4jJHH7AXCXTN7iHKdEh4eqgn+dB30_qA_yLtwR2-4A@mail.gmail.com
Whole thread Raw
In response to Re: Using quicksort for every external sort run  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Using quicksort for every external sort run
List pgsql-hackers
On Tue, Dec 22, 2015 at 2:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Dec 22, 2015 at 4:37 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> That's not fair. DEFAULT_EQ_SEL, DEFAULT_RANGE_INEQ_SEL, and
>> DEFAULT_NUM_DISTINCT are each about as arbitrary. We have to do
>> something, though.
>>

> Sure, there are arbitrary numbers all over the code, driven by
> empirical observations about what factors are important to model.  But
> this is not that.  You don't have a thing called seq_page_cost and a
> thing called cpu_tuple_cost and then say, well, empirically the ratio
> is about 100:1, so let's make the former 1 and the latter 0.01.  You
> just have some numbers, and it's not clear what, if anything, they
> actually represent.

What I find difficult to accept about what you say here is that at
*this* level, something like cost_sort() has little to recommend it.
It costs a sort of a text attribute at the same level as the cost of
sorting the same tuples using an int4 attribute (based on the default
cpu_operator_cost for C functions -- without any attempt to
differentiate text and int4).

Prior to 9.5, sorting text took about 5 - 10 times longer that this
similar int4 sort. That's a pretty big difference, and yet I recall no
complaints. The cost of a comparison in a sort can hardly be
considered in isolation, anyway -- cache efficiency is at least as
important.

Of course, the point is that the goal of a cost model is not to
simulate reality as closely as possible -- it's to produce a good
outcome for performance purposes under realistic assumptions.
Realistic assumptions include that you can't hope to account for
certain differences in cost. Avoiding a terrible outcome is very
important, but the worst case for useselection() is no worse than
today's behavior (or a lost opportunity to do better than today's
behavior).

Recently, the paper that was posted to the list about the Postgres
optimizer stated formally what I know I had a good intuitive sense of
for a long time: that better selectivity estimates are much more
important than better cost models in practice. The "empirical
observations" driving something like DEFAULT_EQ_SEL are very weak --
but what are you gonna do?

> The crossover point is clamped to a minimum of 40% [constant #1] and a
> maximum of 85% [constant #2] when the size of the SortTuple array is
> no more than MaxAllocSize.  Between those bounds, the crossover point
> is 90% [constant #3] minus 7.5% [constant #4] per 32-byte increment
> [constant #5] of estimated average tuple size.  On the other hand,
> when the estimated average tuple size exceeds MaxAllocSize, the
> crossover point is either 90% [constant #6] or 95% [constant #7]
> depending on whether the average tuple size is greater than 32 bytes
> [constant #8].  But if the row count hit is less than 50% [constant
> #9] of the rows we've already seen, then we ignore it and do not use
> selection.
>
> You make no attempt to justify why any of these numbers are correct,
> or what underlying physical reality they represent.

Just like selfuncs.h for the most part, then.

> The comment which
> describes the manner in which crossover point is computed for
> SortTuple volumes under 1GB says "Starting from a threshold of 90%,
> refund 7.5% per 32 byte average-size-increment."  That is a precise
> restatement of what the code does, but it doesn't attempt to explain
> why it's a good idea.  Perhaps the reader should infer that the
> crossover point drops as the tuples get bigger, except that in the
> over-1GB case, a larger tuple size causes the crossover point to go
> *up* while in the under-1GB case, a larger tuple size causes the
> crossover point to go *down*.  Concretely, if we're sorting 44,739,242
> 224-byte tuples, the estimated crossover point is 40%.  If we're
> sorting 44,739,243 244-byte tuples, the estimated crossover point is
> 95%.  That's an extremely sharp discontinuity, and it seems very
> unlikely that any real system behaves that way.

Again, the goal of the cost model is not to model reality as such.
This cost model is conservative about using replacement selection. It
makes sense when you consider that there tends to be a lot fewer
external sorts on a realistic workload -- if we can cut that number in
half, which seems quite possible, that's pretty good, especially from
a DBA's practical perspective. I want to buffer DBAs against suddenly
incurring more I/O, but not at the risk of having a far longer sort
for the first run. Or with minimal exposure to that risk. The cost
model weighs the cost of the hint being wrong to some degree (which is
indeed novel). I think it makes sense in light of the cost and
benefits in this case, although I will add that I'm not entirely
comfortable with it. I just don't imagine that there is a solution
that I will be fully comfortable with. There may be one that
superficially looks correct, but I see little point in that.

> I'm prepared to concede that constant #9 - ignoring the input row
> estimate if we've already seen twice that many rows - probably doesn't
> need a whole lot of justification here, and what justification it does
> need is provided by the fact that (we think) replacement selection
> only wins when there are going to be less than 2 quicksorted runs.
> But the other 8 constants here have to have reasons why they exist,
> what they represent, and why they have the values they do, and that
> explanation needs to be something that can be understood by people
> besides you.  The overall cost model needs some explanation of the
> theory of operation, too.

The cost model is extremely fudged. I think that the greatest problem
that it has is that it isn't explicit enough about that.

But yes, let me concede more clearly: the cost model is based on
frobbing. But at least it's relatively honest about that, and is
relatively simple. I think it might be possible to make it simpler,
but I have a feeling that anything we can come up with will basically
have the same quality that you so dislike. I don't know how to do
better. Frankly, I'd rather be roughly correct than exactly wrong.

> In my opinion, reasoning in terms of a crossover point is a strange
> way of approaching the problem.  What would be more typical at least
> in our code, and I suspect in general, is do a cost estimate of using
> selection and a cost estimate of not using selection and compare them.
> Replacement selection has a CPU cost and an I/O cost, each of which is
> estimable based on the tuple count, chosen comparator, and expected
> I/O volume.  Quicksort has those same costs, in different amounts.  If
> those respective costs are accurately estimated, then you can pick the
> strategy with the lower cost and expect to win.

If you instrument the number of comparisons, I expect you'll find that
master is very competitive with the patch in terms of number of
comparisons performed in total. I think it might even win (Knuth
specifically addresses this, actually). Where does that leave your
theory of how to build a cost model? Also, the disadvantage of
replacement selection's heap is smaller with smaller work_mem settings
-- this has been shown many times to make a *huge* difference. Can the
alternative cost model be reasonably expected to incorporate that,
too? Heap sort isn't cache oblivious, which is why we see these weird
effects, so don't forget to have CPU cache size as an input into your
cost model (or maybe use a magic value based on something like
MaxAllocSize!). How do you propose to way the distributed cost of a
lost opportunity to reduce I/O against the distributed cost of
heapsort wasting system memory bandwidth?

And so on, and so on...believe me, I could go on.

By the way, I think that there needs to be a little work done to
cost_sort() too, which so far I've avoided.

>> However, where replacement selection can still help is avoiding I/O
>> *entirely*. If we can avoid spilling 95% of tuples in the first place,
>> and quicksort the remaining (heapified) tuples that were not spilled,
>> and merge an in-memory run with an on-tape run, then we can win big.
>
> That's pretty much what I was trying to say, except that I'm curious
> to know whether replacement selection can win when it manages to
> generate a vastly longer run than what we get from quicksorting.  Say
> quicksorting produces 10, or 100, or 1000 tapes, and replacement
> selection produces 1 due to a favorable data distribution.

I believe the answer is probably no, but if there is a counter
example, it probably isn't worth pursuing. To repeat myself, I started
out with exactly the same intuition as you on that question, but
changed my mind when my efforts to experimentally verify the intuition
were not successful.

> I agree, but that's not what I proposed.  You don't want to keep
> re-sorting to incorporate new tuples into the run, but if you've got
> 1010 tuples and you can fit 1000 tuples in, you can (a) quicksort the
> first 1000 tuples, (b) read in 10 more tuples, dumping the first 10
> tuples from run 0 to disk, (c) quicksort the last 10 tuples to create
> run 1, and then (d) merge run 0 [which is mostly in memory] with run 1
> [which is entirely in memory].  In other words, yes, quicksorting
> doesn't let you add things to the sort incrementally, but you can
> still write out the run incrementally, writing only as many tuples as
> you need to dump to get the rest of the input data into memory.

Merging is still sorting. The 10 tuples are not very cheap to merge
against the 1000 tuples, because you'll probably still end up reading
most of the 1000 tuples to do so. Perhaps you anticipate that there
will be roughly disjoint ranges of values in each run due to a
logical/physical correlation, and so you won't have to read that many
of the 1000 tuples, but this approach has no ability to buffer even
one outlier value (unlike replacement selection, in particular my
approach within mergememruns()).

The cost of heapification of 1.01 million tuples to spill 0.01 million
tuples is pretty low (relative to the cost of sorting them in
particular). The only difference between what you say here and what I
actually do is that the remaining tuples are heapified rather than
sorted, and I quicksort everything together to "merge run 1 and run 0"
rather than doing two quicksorts and a merge. I believe that this can
be demonstrated to be cheaper.

Another factor is that the heap could be useful for other stuff in the
future. As Simon Riggs pointed out, for deduplicating values as
they're read in by tuplesort. (Okay, that's really the only other
thing, but it's a good one).

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [PROPOSAL] Backup and recovery of pg_statistic
Next
From: Bruce Momjian
Date:
Subject: Re: Freeze avoidance of very large table.