Thread: New 9.6 external sort guidance around temp_tablespaces and maintenance_work_mem

New 9.6 external sort guidance around temp_tablespaces and maintenance_work_mem

From
Peter Geoghegan
Date:
Postgres 9.6 saw the performance characteristics of sorting
significantly altered. While almost every external sort will see a
benefit without any change in configuration, there is a new scope for
DBAs to tune the system to better take advantage of the improved
implementation of external sorting. They should get guidance from us
on this to make it more likely that this actually happens, though.

I'd like to discuss where this guidance should be added, but first, a
little background.

Firstly, DBAs may get a significant, appreciable benefit from making
sure that temp_tablespaces puts temp files on a fast filesystem; this
would probably have been far less helpful prior to 9.6. It seems
reasonable to suppose that explicitly setting temp_tablespaces does
not happen all that often on production installations today, since
external sorts were stalled on memory access most of the time with the
old replacement selection approach. While memory stalls remain a big
cost in 9.6, the situation is much improved. External sorts may be
significantly I/O bound far more frequently in 9.6, especially during
merging (which is also optimized in 9.6), and especially during
merging for CREATE INDEX in particular. In the latter case, it's
likely that the system writes index tuples and WAL out as it reads
runs in, with everything on the same filesystem.

Modern filesystems are really good at read-ahead and write-behind. It
really is not at all unexpected for sequential I/O with fast *disks*
to be faster than random *memory* access [1], and so I really doubt
that there is any question that I/O will now matter more. I expect
that blocking on sequential I/O will become much more of a problem
when parallel sort is added, but it's still going to be a problem with
serial sorts on busy production systems servicing many clients. (With
only a single client, this is much less true, but that's seldom the
case.)

Secondly, with 9.6 it's roughly true that the more memory you can
spare for maintenance_work_mem and work_mem, the better. There are
some caveats that I won't rehash right now; the caveats are rather
limited, and DBAs can probably just pretend that the common, simple
intuition "more memory is better, until you OOM" is at last true. I
don't imagine more information (the caveats) is of much practical use.

A duality
=========

Adding faster disks to get more sequential write bandwidth is a
practical measure available to many DBAs, which is more important for
sorting now. While not every user can afford to do so, it's nice that
some have the *option* of upgrading to get better performance. It need
not be expensive; we get plenty of benefit from cheaper SATA HDDs,
since, with care, only sequential I/O performance really matters. This
was not really the case in prior releases. At the same time, if you
can afford the memory, you should probably just increase
work_mem/maintenance_work_mem to get another performance benefit.
These two benefits will often be complementary. There may a feedback
loop that looks a bit like this:

Quicksort is cache oblivious, which implies that we may effectively
use more working memory, which implies longer runs, which implies
fewer merge passes, which implies that sequential I/O performance is
mostly where I/O costs are paid, which implies that you can manage I/O
costs by adding (consumer grade?) SATA HDDs configured in RAID0 for
temp files, which implies that sorts finish sooner, which implies that
in aggregate memory use is lower, which implies that you can afford to
set work_mem/maintenance_work_mem higher still, which implies ... .

This may seem facile, but consider the huge difference in costs I'm
focussing on. There is a huge difference between the cost of random
I/O and sequential I/O (when FS cache is not able to "amortize" the
random I/O), just as there is a huge difference between the cost of an
L1 cache access, and main memory access (a cache miss). So, if I can
be forgiven for using such a hand-wavey term, there is an interesting
duality here. We should avoid "the expensive variety" of I/O (random
I/O) wherever possible, and avoid "the expensive variety" of memory
access (involving a last-level CPU cache miss) as much as possible.
Certainly, the economics of modern hardware strongly support
increasing locality of access at every level. We're talking about
differences of perhaps several orders of magnitude.

I/O bound sorts in earlier releases
-----------------------------------

Perhaps someone has personally seen a DBA adding a new temp_tablespace
on a separate, faster filesystem. Perhaps this clearly improved
performance. I don't think that undermines my argument, though. This
*does* sometimes happen on prior versions of Postgres, but my guess is
that this is almost accidental:

* Increasing work_mem/maintenance_work_mem perversely makes external
sorts *slower* before 9.6. Iterative tuning of these settings on busy
production systems will therefore tend to guide the DBA to decrease
work_mem/maintenance_work_mem settings (maybe external sorts got so
slow that OOMs happened). Whether or not the DBA realizes it, the
counter-intuitive slowdown occurs because replacement selection is
sensitive to CPU cache size.

* As those memory settings are pushed down, runs become smaller, and
more merge steps are required for any larger external sorts performed.
Merge steps increase the amount of I/O, particularly when a few passes
are necessary; *random* I/O suddenly spikes. Of course, this could
have been avoided if work_mem/maintenance_work_mem were higher, but
that's already been ruled out.

Documentation
=============

I think we can expand "21.6. Tablespaces" to describe the implications
of these new performance characteristics. I'd like to hear opinions on
how to approach that before proposing a patch, though. The basic
guidance should, IMV, be:

* A temp tablespace with cheaper disks that have good sequential I/O
performance can speed up external sorts quite a lot. Probably not a
great idea to have many temp tablespaces. Use RAID0 instead, because
that performs better, and because it doesn't matter that temp files
are not recoverable if a disk is faulty.

* More memory for sorting and hashing is often better in PostgreSQL
9.6. Notably, the performance of hash joins that spill will tend to
degrade less predictably than the performance of sorts that spill as
less memory is made available. (Perhaps mention the number of external
sort passes?)

* Increasing work_mem/maintenance_work_mem may fail to improve
performance only because sorts then become more I/O bound. When in
doubt, testing is advised. A balance may need to be found, if only to
avoid wasting memory.

Thoughts?

[1] https://queue.acm.org/detail.cfm?id=1563874
-- 
Peter Geoghegan



On Thu, Apr 21, 2016 at 08:37:54PM -0700, Peter Geoghegan wrote:
> Documentation
> =============
> 
> I think we can expand "21.6. Tablespaces" to describe the implications
> of these new performance characteristics. I'd like to hear opinions on
> how to approach that before proposing a patch, though. The basic
> guidance should, IMV, be:
> 
> * A temp tablespace with cheaper disks that have good sequential I/O
> performance can speed up external sorts quite a lot. Probably not a
> great idea to have many temp tablespaces. Use RAID0 instead, because
> that performs better, and because it doesn't matter that temp files
> are not recoverable if a disk is faulty.
> 
> * More memory for sorting and hashing is often better in PostgreSQL
> 9.6. Notably, the performance of hash joins that spill will tend to
> degrade less predictably than the performance of sorts that spill as
> less memory is made available. (Perhaps mention the number of external
> sort passes?)
> 
> * Increasing work_mem/maintenance_work_mem may fail to improve
> performance only because sorts then become more I/O bound. When in
> doubt, testing is advised. A balance may need to be found, if only to
> avoid wasting memory.

This seems very detailed.  I think we need much broader coverage of how
the existing GUC variables affect performance before we could cover
this.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



On Sat, Apr 30, 2016 at 9:30 AM, Bruce Momjian <bruce@momjian.us> wrote:
>> I think we can expand "21.6. Tablespaces" to describe the implications
>> of these new performance characteristics. I'd like to hear opinions on
>> how to approach that before proposing a patch, though.

> This seems very detailed.  I think we need much broader coverage of how
> the existing GUC variables affect performance before we could cover
> this.

While my justification is very detailed, I think that the practical
guidance itself is fairly simple. In fact, making sure that people
don't miss that things have become *simpler* across the board is the
primary point that I'd like to make to users.

As of 9.6, external sorting finally benefits from very large amounts
of memory, certainly in the multiple gigabyte range, where before
performance steadily declined as work_mem increased past a fairly
small amount (less than 100MB in my tests). So, DBAs had competing,
even contradictory considerations: keep work_mem high enough to make
most sorts internal. But if a sort must be external, make sure that
the work_mem it has available is quite low. Say, less than 100MB. This
set of characteristics is almost impossible to tune for. Other
operations that use temporary memory bound in size by work_mem have
always benefited from increasing work_mem settings in a more or less
predictable, linear way, so the *general* picture about what to do
becomes far clearer (we don't really *need* to talk about sorting at
all).

Presumably due to the old issues with tuplesort, the closest the docs
get to recommending higher work_mem or maintenance_work_mem settings
is: "Larger [maintenance_work_mem] settings might improve performance
for vacuuming and for restoring database dumps". That's it! Since the
performance characteristics of external sorting are now roughly in
line with everything else, why continue to make such a weak statement
in 9.6? It's not hard to understand why we originally equivocated
here, but things have changed.

I hardly realized that the docs are only lukewarm on the idea that
increasing work_mem will ever help *anything* at all. It's easy to
fail to notice that when you're an expert. We provide *zero* guidance
on how to tune work_mem or maintenance_work_mem. Surely we can do
better than that.

A secondary point I'd like to make is that if and when no further
benefit can be observed from increasing work_mem, it's well worth
considering that more I/O bandwidth for temp files will help.
Obviously, this assumes the DBA avoids swapping when using all that
memory, for example by setting vm.swappiness appropriately on Linux,
while also avoiding OOMs. At the point that no further benefit can be
obtained by increasing work_mem, sequential I/O should be strongly
considered as a possible bottleneck to target. That's a nice problem
to have, because you can buy fairly inexpensive HDDs for temp files
that will increase the point at which higher work_mem settings will no
longer help.

So, we can talk about this stuff without necessarily even mentioning
external sorting.

I didn't mention it before now, but as it happens the 2009 ACM article
I linked to already (https://queue.acm.org/detail.cfm?id=1563874)
mentions that they found performance fell sharply past a certain point
when using Postgres for a large aggregate SQL query:

"""
Invoking the DBMS’s built-in EXPLAIN facility revealed the problem:
while the query planner chose a reasonable hash table-based
aggregation strategy for small tables, on larger tables it switched to
sorting by grouping columns—a viable, if suboptimal strategy given a
few million rows, but a very poor one when facing a billion.

"""

I think that this must be describing tuplesort's previous use of
replacement selection to sort runs -- it might have been the case that
an internal sort for the GroupAggregate was "viable, if suboptimal"
then, but an external sort was just unacceptable. 9.6 really blurs the
distinction between internal and external sorts a lot of the time. It
was common to see a big external sort with ~5 runs taking no longer
than 110% of the time of an equivalent internal sort when testing the
9.6 sorting stuff. I would like to help DBAs keep up the benefits when
they have hundreds of gigabytes or even terabytes of data to aggregate
through. If you look at the graph in that article, it isn't that hard
to imagine that a well-tuned implementation could avoid that
superlinear growth in query runtime, because *superlinear* growth
isn't predicted by any theoretical model. The article is incorrect to
state "There is no pathology here". Although, that doesn't undermine
the author's argument too much; I find it rather convincing overall.
The fact that a single node will eventually fall short doesn't mean it
isn't important to push your single node further, to maintain an
unsharded Postgres instance as the right choice for longer (by getting
the full benefit of the hardware). Getting this right could allow
unsharded Postgres to go a lot further (parallelism is the final part
of this), which could make all the difference for some users.

--
Peter Geoghegan



On Sat, Apr 30, 2016 at 12:19:02PM -0700, Peter Geoghegan wrote:
> Presumably due to the old issues with tuplesort, the closest the docs
> get to recommending higher work_mem or maintenance_work_mem settings
> is: "Larger [maintenance_work_mem] settings might improve performance
> for vacuuming and for restoring database dumps". That's it! Since the
> performance characteristics of external sorting are now roughly in
> line with everything else, why continue to make such a weak statement
> in 9.6? It's not hard to understand why we originally equivocated
> here, but things have changed.

Yes, this needs updating.  My point is that there is a whole lot of
things we don't talk about in this area, and should, but I would like it
to be of a consistent level of detail for all areas of performancce.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



On Sat, Apr 30, 2016 at 3:23 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Yes, this needs updating.  My point is that there is a whole lot of
> things we don't talk about in this area, and should, but I would like it
> to be of a consistent level of detail for all areas of performancce.

I think that we need to do better generally too, but the existing
handling of performance, such as it is, is not consistent in the level
of detail it goes into. For example, we give far more advice about
setting the value of commit_delay than setting the value of work_mem,
even though that's clearly a niche topic in comparison. You can say
the same thing about effective_io_concurrency. 95%+ of all users don't
use either setting, making that documentation irrelevant to them. I
think that this is simply because it was hard to make a good
recommendation about work_mem, but that's now less true overall. We
don't like equivocating, so we said only the absolute minimum.

ISTM that the area that needs the most attention is planner stuff, and
query workspace memory stuff (e.g. work_mem, temp files). work_mem and
maintenance_work_mem seem like good places to start adding more
practical advise, particularly given we can avoid mentioning sorting
or hashing, and still add value.

Maybe there is a place to emphasize this change in the release notes.
I don't really want to make it about the external sort feature,
though, because enabling higher work_mem settings by making sure that
does not disadvantage external sorts is as much about enabling
HashAggregates as it is about enabling internal sorts.

-- 
Peter Geoghegan



On Sat, Apr 30, 2016 at 04:23:00PM -0700, Peter Geoghegan wrote:
> Maybe there is a place to emphasize this change in the release notes.
> I don't really want to make it about the external sort feature,
> though, because enabling higher work_mem settings by making sure that
> does not disadvantage external sorts is as much about enabling
> HashAggregates as it is about enabling internal sorts.

We do often mention in the release notes areas that will need retuning.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



On Sat, Apr 30, 2016 at 4:26 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Sat, Apr 30, 2016 at 04:23:00PM -0700, Peter Geoghegan wrote:
>> Maybe there is a place to emphasize this change in the release notes.
>> I don't really want to make it about the external sort feature,
>> though, because enabling higher work_mem settings by making sure that
>> does not disadvantage external sorts is as much about enabling
>> HashAggregates as it is about enabling internal sorts.
>
> We do often mention in the release notes areas that will need retuning.

How do you feel about it in this instance?

As you may have gathered, my perspective is that the external sorting
patches were more about fixing a problem with tuplesort than about
improving its performance. The performance characteristics of the old
approach to sorting runs were all over the place, which made
increasing work_mem like taking one step forward, then two steps
backwards.

-- 
Peter Geoghegan



On Sat, Apr 30, 2016 at 04:39:22PM -0700, Peter Geoghegan wrote:
> On Sat, Apr 30, 2016 at 4:26 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Sat, Apr 30, 2016 at 04:23:00PM -0700, Peter Geoghegan wrote:
> >> Maybe there is a place to emphasize this change in the release notes.
> >> I don't really want to make it about the external sort feature,
> >> though, because enabling higher work_mem settings by making sure that
> >> does not disadvantage external sorts is as much about enabling
> >> HashAggregates as it is about enabling internal sorts.
> >
> > We do often mention in the release notes areas that will need retuning.
> 
> How do you feel about it in this instance?
> 
> As you may have gathered, my perspective is that the external sorting
> patches were more about fixing a problem with tuplesort than about
> improving its performance. The performance characteristics of the old
> approach to sorting runs were all over the place, which made
> increasing work_mem like taking one step forward, then two steps
> backwards.

Yes, we should mention something.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +