Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id 0939f43f-226b-ee57-4a27-2d1b8e18185c@iki.fi
Whole thread Raw
In response to Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
I'm reviewing patches 1-3 in this series, i.e. those patches that are 
not directly related to parallelism, but are independent improvements to 
merging.

Let's begin with patch 1:

On 08/02/2016 01:18 AM, Peter Geoghegan wrote:
> Cap the number of tapes used by external sorts
>
> Commit df700e6b set merge order based on available buffer space (the
> number of tapes was as high as possible while still allowing at least 32
> * BLCKSZ buffer space per tape), rejecting Knuth's theoretically
> justified "sweet spot" of 7 tapes (a merge order of 6 -- Knuth's P),
> improving performance when the sort thereby completed in one pass.
> However, it's still true that there are unlikely to be benefits from
> increasing the number of tapes past 7 once the amount of data to be
> sorted significantly exceeds available memory; that commit probably
> mostly just improved matters where it enabled all merging to be done in
> a final on-the-fly merge.
>
> One problem with the merge order logic established by that commit is
> that with large work_mem settings and data volumes, the tapes previously
> wasted as much as 8% of the available memory budget; tens of thousands
> of tapes could be logically allocated for a sort that will only benefit
> from a few dozen.

Yeah, wasting 8% of the memory budget on this seems like a bad idea. If 
I understand correctly, that makes the runs shorter than necessary, 
leading to more runs.

> A new quasi-arbitrary cap of 501 is applied on the number of tapes that
> tuplesort will ever use (i.e.  merge order is capped at 500 inclusive).
> This is a conservative estimate of the number of runs at which doing all
> merging on-the-fly no longer allows greater overlapping of I/O and
> computation.

Hmm. Surely there are cases, so that with > 501 tapes you could do it 
with one merge pass, but now you need two? And that would hurt 
performance, no?

Why do we reserve the buffer space for all the tapes right at the 
beginning? Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD) 
callin inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every 
time we start a new run, until we reach maxTapes?

- Heikki




pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: Declarative partitioning - another take
Next
From: Ivan Kartyshov
Date:
Subject: Re: make async slave to wait for lsn to be replayed