Re: Using quicksort for every external sort run - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Using quicksort for every external sort run
Date
Msg-id CAM3SWZREyZCrvx8aHAK-8Uieo-bzuSkXPJpc9ni9ZMOANGRrqg@mail.gmail.com
Whole thread Raw
In response to Re: Using quicksort for every external sort run  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Using quicksort for every external sort run  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
Hi Tomas,

Overall, I agree with your summary.

On Sun, Apr 3, 2016 at 5:24 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> So, let me sum this up, the way I understand the current status.
>
>
> 1) overall, the patch seems to be a clear performance improvement

I think that's clear. There are even cases that are over 5x faster,
which are representative of some real workloads (e.g., "CREATE INDEX x
ON numeric_test (a)" when low_cardinality_almost_asc +
maintenance_work_mem=512MB). A lot of the aggregate (datum sort)
cases, and heap tuple cases are 3x - 4x faster.

> 2) it's unlikely we can improve the performance further

I think it's very unlikely that these remaining regressions can be fixed, yes.

> Secondly, master is faster only if there's enough on-CPU cache for the
> replacement sort (for the memtuples heap), but the benchmark is not
> realistic in this respect as it only ran 1 query at a time, so it used the
> whole cache (6MB for i5, 12MB for Xeon).
>
> In reality there will be multiple processes running at the same time (e.g
> backends when running parallel query), significantly reducing the amount of
> cache per process, making the replacement sort inefficient and thus
> eliminating the regressions (by making the master slower).

Agreed. And even though the 8MB work_mem cases always have more than
enough CPU cache to fit the replacement selection heap, it's still no
worse than a mixed picture. The replacement_work_mem=64KB + patch +
8MB (maintenance_)work_mem cases (i.e. replacement selection entirely
disabled) don't always do worse; they are often a draw, and sometimes
do much better. We *still* win in many cases, sometimes by quite a bit
(e.g. "SELECT COUNT(DISTINCT a) FROM int_test" typically loses about
50% of its runtime when patched and RS is disabled at work_mem=8MB).
The cases where we lose at work_mem=8MB involve padding and a
correlation. The really important case of CREATE INDEX on int4 almost
always wins, *even with sorted input* (the
almost-but-not-quite-asc-sorted case loses ~1%). We can shave 20% -
30% off the CREATE INDEX int4 cases with just maintenance_work_mem =
8MB.

Even in these cases with so much CPU cache relative to work_mem, you
need to search for regressed cases to find them, and they are less
representative cases. So, while the picture for the work_mem=8MB
column alone seems kind of bad, if you consider where the regressions
actually occur, you could argue that even that's a draw.

> 3) replacement_sort_mem GUC
>
> I'm not quite sure what's the plan with this GUC. It was useful for
> development, but it seems to me it's pretty difficult to tune it in practice
> (especially if you don't know the internals, which users generally don't).

I agree.

> So I think we should either remove the GUC entirely, or move it to the
> developer section next to trace_sort (and removing it from the conf).

I'll let Robert decide what's best here, but I see your point.

Side note: trace_sort actually is documented. It's a bit weird that we
have those TRACE_SORT macros at all IMV. I think we should rip those
out, and assume every build enables TRACE_SORT, because that's
probably true anyway.

I do think that replacement selection could be put to good use for
CREATE INDEX if the CREATE INDEX utility command had a "presorted"
parameter. Specifically, an implementation of the "presorted" idea
that I recently sketched [1] could do better than any presorted
replacement selection case we've seen so far because it allows the
implementation to optimistically create the index on-the-fly (if that
isn't possible, throw an error), without a second pass over tuples
sorted on tape. Nothing needs to be stored on a tape/temp file *at
all*; the only thing that is stored externally is the index itself.
But this patch doesn't add that feature, which can be worked on
without the user needing to know about replacement_sort_mem in 9.6.

So, I'm not in favor of ripping out the replacement selection code,
but think it could make sense to effectively disable it entirely for
the time being (with some developer feature to turn it back on for
testing). In general, I share your misgivings about the new GUC,
though.

> I'm wondering whether 16MB default is not a bit too much, actually. As
> explained before, that's not the amount of cache we should expect per
> process, so maybe ~2-4MB would be a better default value?

The obvious presorted case is where we have a SERIAL column, but as I
mentioned even that isn't helped by RS. Moreover, it will be
significantly hurt with a default maintenance_work_mem of 64MB. Your
int4 CREATE INDEX cases clearly show this.

> BTW couldn't we tune the value automatically for each sort, using the
> pg_stats.correlation for the sort keys, when available (increasing the
> replacement_sort_mem when correlation is close to 1.0)? Wouldn't that
> improve at least some of the regressions?

Maybe, but that seems hard. That information isn't conveniently
available to the executor/tuplesort, and as we've seen with CREATE
INDEX int4 cases, it's far from clear that we'll win even when there
definitely is presorted input. Replacement selection needs more than a
simple correlation to win, so you'll end up building a cost model with
many new problems if this is to work.

[1] http://www.postgresql.org/message-id/CAM3SWZRFzg1LUK8FBg_goZ8zL0n7k6q83qQjhOV8NDZioA5TEQ@mail.gmail.com
-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Proposal: RETURNING primary_key()
Next
From: Simon Riggs
Date:
Subject: Re: PATCH: use foreign keys to improve join estimates v1